Correlation Bi-community Extraction method

Consider two sets of high-dimensional measurements on the same set of samples. CBCE (Correlation Bi-Community Extraction method) finds sets of variables from the first measurement and sets of variables from the second measurement which are correlated to each other.

cbce(
  X,
  Y,
  alpha = 0.05,
  alpha.init = alpha,
  cov = NULL,
  cache.size = (utils::object.size(X) + utils::object.size(Y))/2,
  start_frac = 1,
  start_nodes = list(x = sample(1:ncol(X), ceiling(ncol(X) * start_frac)), y =
    sample(1:ncol(Y), ceiling(ncol(Y) * start_frac))),
  max_iterations = 20,
  size_threshold = 0.5 * exp(log(ncol(X))/2 + log(ncol(Y))/2),
  interaction = interaction_none,
  heuristic_search = FALSE,
  filter_low_score = TRUE,
  diagnostic = diagnostics
)

Arguments

X, Y	Numeric Matices. Represents the two groups of variables. Rows represent samples and columns represent variables.
alpha	$\in (0,1)$. Controls the type1 error for the update (for the multiple testing procedure).
alpha.init	$\in (0,1)$ Controls the type1 error for the initialization step. This could be more liberal (i.e greater than) than the alpha for the update step.
cov	The covariates to account for; This should be a matrix with the same number of rows as X and Y. Each column represents a covariate whose effect needs to be removed. If this is null, no covariate will be removed.
cache.size	integer The amount of memory to dedicate for caching correlations. This will speed things up. Defaults to the average memory required by X and Y matrices
start_frac	$\in (0,1)$ The random proportion of nodes to start extractions from. This is used to randomly sample `start_nodes`. If `start_node` is provided this parameter is ignored.
start_nodes	list The initial set of variables to start with. If this is provided, `start_frac` will be ignored. If Null, extractions are run starting from each varable from X and Y. Otherwise `start_node$x` gives the X variables to start from and `start_nodes$y` gives the Y variables to start from.
max_iterations	integer The maximum number of iterations per extraction. If a fixed point is not found by this step, the extraciton is terminated. This limit is set so that the program terminates.
size_threshold	The maximum size of bimodule we want to search for. The search will be terminated when sets grow beyond this size. The size of a bimodule is defined as the geometric mean of its X and Y sizes.
interaction	(internal) This is a function that will be called between extractions to allow interaction with the program. For instance one cas pass the function `interaction_gui` (EXPERIMENTAL) or `interaction_cli`.
heuristic_search	Use a fast, but incomplete, version of heuristic search that doesn't start from nodes inside bimodules already found.
filter_low_score	Should we remove bimodules with low score? (recommended).
diagnostic	(internal) This is a internal function for probing the internal state of the method. It will be called at special hooks and can look into what the method is doing. Pass either `diagnostics`, `diagnostics_none`.

Value

The return value is a list with the results and meta-data about the extraction. The most useful field is comms - this is a list of all the Correlation Bi-communities that was detected after filtering, while comms.fil consist of all the communities that were found after filtering similar communities.

Details

cbce applies an update function (mapping subsets of variables to subsets of variables) iteratively until a fixed point is found. These fixed points are reported as communities. The update starts from a single variable (the initialization step) and is repeated till either a fixed point is found or some set repeats. Each such run is called an extraction. Since the extraction only starts from singleton node, there are ncol(X)+ncol(Y) possible extractions.

Examples

library(cbce)
#Sample size
n <- 40
#Dimension of measurement 1
dx <- 20
#Dimension of measurement 2
dy <- 50
#Correlation strength
rho <- 0.5
set.seed(1245)
# Assume first measurement is gaussian
X <- matrix(rnorm(dx*n), nrow=n, ncol=dx)
# Measurements 3:6 in set 2 are correlated to 4:7 in set 1
Y <- matrix(rnorm(dy*n), nrow=n, ncol=dy)
Y[, 3:6] <- sqrt(1-rho)*Y[, 3:6] + sqrt(rho)*rowSums(X[, 4:5])
res <- cbce(X, Y)
#Recovers the indices 4:5 for X and 3:6 for Y
#If the strength of the correlation was higher
#all the indices could be recovered.
res$comms
#> [[1]]
#> [[1]]$x
#> [1] 4 5
#> 
#> [[1]]$y
#> [1] 3 4 5 6
#> 
#>