Title: | Marker Gene Detection via Penalized Principal Component Analysis |
---|---|
Description: | Implementation of the 'MarkerPen' algorithm, short for marker gene detection via penalized principal component analysis, described in the paper by Qiu, Wang, Lei, and Roeder (2020, <doi:10.1101/2020.11.07.373043>). 'MarkerPen' is a semi-supervised algorithm for detecting marker genes by combining prior marker information with bulk transcriptome data. |
Authors: | Yixuan Qiu, Jiebiao Wang, Jing Lei, and Kathryn Roeder |
Maintainer: | Yixuan Qiu <[email protected]> |
License: | GPL |
Version: | 0.1.1 |
Built: | 2025-01-05 04:48:30 UTC |
Source: | https://github.com/yixuan/markerpen |
A data set showing the mapping between gene names and Ensembl gene IDs, derived from the EnsDb.Hsapiens.v79 Bioconductor package.
gene_mapping
gene_mapping
A data frame with 59074 rows and 2 variables:
Ensembl gene IDs
corresponding gene names
https://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
This function solves the optimization problem
where means all eigenvalues of
are
between 0 and 1,
means all elements of
are nonnegative,
and
is a penalty function defined in the article
(see the References section).
pca_pen( S, gr, lambda, w = 1.5, alpha = 0.01, maxit = 1000, eps = 1e-04, verbose = 0 )
pca_pen( S, gr, lambda, w = 1.5, alpha = 0.01, maxit = 1000, eps = 1e-04, verbose = 0 )
S |
The sample correlation matrix of gene expression. |
gr |
Indices of genes that are treated as markers in the prior information. |
lambda |
Tuning parameter to control the sparsity of eigenvectors. |
w |
Tuning parameter to control the weight on prior information.
Larger |
alpha |
Step size of the optimization algorithm. |
maxit |
Maximum number of iterations. |
eps |
Tolerance parameter for convergence. |
verbose |
Level of verbosity. |
A list containing the following components:
The estimated projection matrix.
The estimated eigenvectors.
Number of iterations used in the optimization process.
The optimization error in each iteration.
Qiu, Y., Wang, J., Lei, J., & Roeder, K. (2020). Identification of cell-type-specific marker genes from co-expression patterns in tissue samples.
set.seed(123) n = 200 # Sample size p = 500 # Number of genes s = 50 # Number of true signals # The first s genes are true markers, and others are noise Sigma = matrix(0, p, p) Sigma[1:s, 1:s] = 0.9 diag(Sigma) = 1 # Simulate data from the covariance matrix x = matrix(rnorm(n * p), n) %*% chol(Sigma) # Sample correlation matrix S = cor(x) # Indices of prior marker genes # Note that we have omitted 10 true markers, and included 10 false markers gr = c(1:(s - 10), (s + 11):(s + 20)) # Run the algorithm res = pca_pen(S, gr, lambda = 0.1, verbose = 1) # See if we can recover the true correlation structure image(res$projection, asp = 1)
set.seed(123) n = 200 # Sample size p = 500 # Number of genes s = 50 # Number of true signals # The first s genes are true markers, and others are noise Sigma = matrix(0, p, p) Sigma[1:s, 1:s] = 0.9 diag(Sigma) = 1 # Simulate data from the covariance matrix x = matrix(rnorm(n * p), n) %*% chol(Sigma) # Sample correlation matrix S = cor(x) # Indices of prior marker genes # Note that we have omitted 10 true markers, and included 10 false markers gr = c(1:(s - 10), (s + 11):(s + 20)) # Run the algorithm res = pca_pen(S, gr, lambda = 0.1, verbose = 1) # See if we can recover the true correlation structure image(res$projection, asp = 1)
This function refines a prior marker gene list by combining information from bulk tissue data, based on the penalized principal component analysis. The current implementation computes on one cell type at a time. To get marker genes for multiple cell types, call this function iteratively.
refine_markers( mat_exp, range, markers, lambda, w = 1.5, thresh = 0.001, alpha = 0.01, maxit = 1000, eps = 1e-04, verbose = 0 )
refine_markers( mat_exp, range, markers, lambda, w = 1.5, thresh = 0.001, alpha = 0.01, maxit = 1000, eps = 1e-04, verbose = 0 )
mat_exp |
The gene expression matrix in the original scale (not logarithm-transformed), with rows standing for observations and columns for genes. The matrix should include gene names as column names. |
range |
A character vector of gene names, representing the range of genes in which markers are sought. |
markers |
A character vector of gene names giving the prior marker gene list. |
lambda |
A tuning parameter to control the number of selected marker genes. A larger value typically means a smaller number of genes. |
w |
Tuning parameter to control the weight on prior information.
Larger |
thresh |
Below this threshold small factor loadings are treated as zeros. |
alpha |
Step size of the optimization algorithm. |
maxit |
Maximum number of iterations. |
eps |
Tolerance parameter for convergence. |
verbose |
Level of verbosity. |
A list containing the following components:
The sparse PCA result as in pca_pen()
.
A character vector of selected markers genes.
The estimated factor loadings for the associated genes.
Qiu, Y., Wang, J., Lei, J., & Roeder, K. (2020). Identification of cell-type-specific marker genes from co-expression patterns in tissue samples.
# Data used in the vignette load(system.file("examples", "gene_expr.RData", package = "markerpen")) load(system.file("examples", "published_markers.RData", package = "markerpen")) load(system.file("examples", "markers_range.RData", package = "markerpen")) # Get expression matrix - rows are observations, columns are genes ind = match(rownames(dat), markerpen::gene_mapping$name) ind = na.omit(ind) ensembl = markerpen::gene_mapping$ensembl[ind] mat_exp = t(dat[markerpen::gene_mapping$name[ind], ]) colnames(mat_exp) = ensembl # We compute the marker genes for two cell types with a reduced problem size # See the vignette for the full example # Markers for astrocytes set.seed(123) search_range = intersect(markers_range$astrocytes, ensembl) search_range = sample(search_range, 300) prior_markers = intersect(pub_markers$astrocytes, search_range) ast_re = refine_markers( mat_exp, search_range, prior_markers, lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0 ) # Remove selected markers from the expression matrix mat_rest = mat_exp[, setdiff(colnames(mat_exp), ast_re$markers)] # Markers for microglia search_range = intersect(markers_range$microglia, ensembl) search_range = sample(search_range, 300) prior_markers = intersect(pub_markers$microglia, search_range) mic_re = refine_markers( mat_exp, search_range, prior_markers, lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0 ) # Refined markers markers_re = list(astrocytes = ast_re$markers, microglia = mic_re$markers) # Visualize the correlation matrix cor_markers = cor(mat_exp[, unlist(markers_re)]) image(cor_markers, asp = 1) # Post-process the selected markers # Pick the first 20 ordered markers markers_ord = sort_markers(cor_markers, markers_re) markers_ord = lapply(markers_ord, head, n = 20) # Visualize the correlation matrix image(cor(mat_exp[, unlist(markers_ord)]), asp = 1)
# Data used in the vignette load(system.file("examples", "gene_expr.RData", package = "markerpen")) load(system.file("examples", "published_markers.RData", package = "markerpen")) load(system.file("examples", "markers_range.RData", package = "markerpen")) # Get expression matrix - rows are observations, columns are genes ind = match(rownames(dat), markerpen::gene_mapping$name) ind = na.omit(ind) ensembl = markerpen::gene_mapping$ensembl[ind] mat_exp = t(dat[markerpen::gene_mapping$name[ind], ]) colnames(mat_exp) = ensembl # We compute the marker genes for two cell types with a reduced problem size # See the vignette for the full example # Markers for astrocytes set.seed(123) search_range = intersect(markers_range$astrocytes, ensembl) search_range = sample(search_range, 300) prior_markers = intersect(pub_markers$astrocytes, search_range) ast_re = refine_markers( mat_exp, search_range, prior_markers, lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0 ) # Remove selected markers from the expression matrix mat_rest = mat_exp[, setdiff(colnames(mat_exp), ast_re$markers)] # Markers for microglia search_range = intersect(markers_range$microglia, ensembl) search_range = sample(search_range, 300) prior_markers = intersect(pub_markers$microglia, search_range) mic_re = refine_markers( mat_exp, search_range, prior_markers, lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0 ) # Refined markers markers_re = list(astrocytes = ast_re$markers, microglia = mic_re$markers) # Visualize the correlation matrix cor_markers = cor(mat_exp[, unlist(markers_re)]) image(cor_markers, asp = 1) # Post-process the selected markers # Pick the first 20 ordered markers markers_ord = sort_markers(cor_markers, markers_re) markers_ord = lapply(markers_ord, head, n = 20) # Visualize the correlation matrix image(cor(mat_exp[, unlist(markers_ord)]), asp = 1)
This function reorders the selected marker genes using information of the sample correlation matrix.
sort_markers(corr, markers)
sort_markers(corr, markers)
corr |
The sample correlation matrix, whose row and column names are gene names. |
markers |
A list of marker genes. Each component of the list is a vector of marker
gene names corresponding to a cell type. All the gene names in this list
must appear in the row/column names of |
A list that has the same structure as the input markers
argument, with
the elements in each component reordered. See the example in
refine_markers()
.