Consider two sets of high-dimensional measurements on the same set of samples. CBCE (Correlation Bi-Community Extraction method) finds sets of variables from the first measurement and sets of variables from the second measurement which are correlated to each other.
cbce( X, Y, alpha = 0.05, alpha.init = alpha, cov = NULL, cache.size = (utils::object.size(X) + utils::object.size(Y))/2, start_frac = 1, start_nodes = list(x = sample(1:ncol(X), ceiling(ncol(X) * start_frac)), y = sample(1:ncol(Y), ceiling(ncol(Y) * start_frac))), max_iterations = 20, size_threshold = 0.5 * exp(log(ncol(X))/2 + log(ncol(Y))/2), interaction = interaction_none, heuristic_search = FALSE, filter_low_score = TRUE, diagnostic = diagnostics )
X, Y | Numeric Matices. Represents the two groups of variables. Rows represent samples and columns represent variables. |
---|---|
alpha | \(\in (0,1)\). Controls the type1 error for the update (for the multiple testing procedure). |
alpha.init | \(\in (0,1)\) Controls the type1 error for the initialization step. This could be more liberal (i.e greater than) than the alpha for the update step. |
cov | The covariates to account for; This should be a matrix with the same number of rows as X and Y. Each column represents a covariate whose effect needs to be removed. If this is null, no covariate will be removed. |
cache.size | integer The amount of memory to dedicate for caching correlations. This will speed things up. Defaults to the average memory required by X and Y matrices |
start_frac | \(\in (0,1)\) The random proportion of nodes to
start extractions from. This is used to randomly sample
|
start_nodes | list The initial set of variables to start with.
If this is provided, |
max_iterations | integer The maximum number of iterations per extraction. If a fixed point is not found by this step, the extraciton is terminated. This limit is set so that the program terminates. |
size_threshold | The maximum size of bimodule we want to search for. The search will be terminated when sets grow beyond this size. The size of a bimodule is defined as the geometric mean of its X and Y sizes. |
interaction | (internal) This is a function that will be called
between extractions to allow interaction with the program.
For instance one cas pass the function |
heuristic_search | Use a fast, but incomplete, version of heuristic search that doesn't start from nodes inside bimodules already found. |
filter_low_score | Should we remove bimodules with low score? (recommended). |
diagnostic | (internal) This is a internal function for
probing the internal state of the method. It will be
called at special hooks and can look into what the method is doing.
Pass either |
The return value is a list with the results and
meta-data about the extraction. The most useful field is
comms
- this is a list of all the Correlation Bi-communities
that was detected after filtering, while comms.fil
consist of all the communities that were found after
filtering similar communities.
cbce
applies an update function (mapping subsets of
variables to subsets of variables) iteratively until a fixed point
is found. These fixed points are reported as communities.
The update starts from a single variable (the initialization step)
and is repeated till either a fixed point is found or some set
repeats. Each such run is called an extraction. Since the extraction
only starts from singleton node, there are ncol(X)+ncol(Y)
possible extractions.
library(cbce) #Sample size n <- 40 #Dimension of measurement 1 dx <- 20 #Dimension of measurement 2 dy <- 50 #Correlation strength rho <- 0.5 set.seed(1245) # Assume first measurement is gaussian X <- matrix(rnorm(dx*n), nrow=n, ncol=dx) # Measurements 3:6 in set 2 are correlated to 4:7 in set 1 Y <- matrix(rnorm(dy*n), nrow=n, ncol=dy) Y[, 3:6] <- sqrt(1-rho)*Y[, 3:6] + sqrt(rho)*rowSums(X[, 4:5]) res <- cbce(X, Y) #Recovers the indices 4:5 for X and 3:6 for Y #If the strength of the correlation was higher #all the indices could be recovered. res$comms#> [[1]] #> [[1]]$x #> [1] 4 5 #> #> [[1]]$y #> [1] 3 4 5 6 #> #>