Package 'markerpen' reference manual

Title:	Marker Gene Detection via Penalized Principal Component Analysis
Description:	Implementation of the 'MarkerPen' algorithm, short for marker gene detection via penalized principal component analysis, described in the paper by Qiu, Wang, Lei, and Roeder (2020, <doi:10.1101/2020.11.07.373043>). 'MarkerPen' is a semi-supervised algorithm for detecting marker genes by combining prior marker information with bulk transcriptome data.
Authors:	Yixuan Qiu, Jiebiao Wang, Jing Lei, and Kathryn Roeder
Maintainer:	Yixuan Qiu <[email protected]>
License:	GPL
Version:	0.1.1
Built:	2025-03-06 04:48:51 UTC
Source:	https://github.com/yixuan/markerpen

Mapping gene names to Ensembl IDs

Description

A data set showing the mapping between gene names and Ensembl gene IDs, derived from the EnsDb.Hsapiens.v79 Bioconductor package.

Usage

gene_mapping
gene_mapping

Format

A data frame with 59074 rows and 2 variables:

ensembl: Ensembl gene IDs
name: corresponding gene names

Source

https://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

Penalized Principal Component Analysis for Marker Gene Selection

Description

This function solves the optimization problem

$\min\quad-\mathrm{tr}(SX) + \lambda p(X),$

$s.t.\quad O\preceq X \preceq I, \quad X \ge 0, \quad\mathrm{and}\quad \mathrm{tr}(X)=1,$

where $O\preceq X \preceq I$ means all eigenvalues of $X$ are between 0 and 1, $X \ge 0$ means all elements of $X$ are nonnegative, and $p(X)$ is a penalty function defined in the article (see the References section).

Usage

pca_pen(
  S,
  gr,
  lambda,
  w = 1.5,
  alpha = 0.01,
  maxit = 1000,
  eps = 1e-04,
  verbose = 0
)
pca_pen(
  S,
  gr,
  lambda,
  w = 1.5,
  alpha = 0.01,
  maxit = 1000,
  eps = 1e-04,
  verbose = 0
)

Arguments

`S`	The sample correlation matrix of gene expression.
`gr`	Indices of genes that are treated as markers in the prior information.
`lambda`	Tuning parameter to control the sparsity of eigenvectors.
`w`	Tuning parameter to control the weight on prior information. Larger $w$ means genes not in the prior list are less likely to be selected as markers.
`alpha`	Step size of the optimization algorithm.
`maxit`	Maximum number of iterations.
`eps`	Tolerance parameter for convergence.
`verbose`	Level of verbosity.

Value

A list containing the following components:

projection: The estimated projection matrix.
evecs: The estimated eigenvectors.
niter: Number of iterations used in the optimization process.
err_v: The optimization error in each iteration.

References

Qiu, Y., Wang, J., Lei, J., & Roeder, K. (2020). Identification of cell-type-specific marker genes from co-expression patterns in tissue samples.

Examples

set.seed(123)
n = 200  # Sample size
p = 500  # Number of genes
s = 50   # Number of true signals

# The first s genes are true markers, and others are noise
Sigma = matrix(0, p, p)
Sigma[1:s, 1:s] = 0.9
diag(Sigma) = 1

# Simulate data from the covariance matrix
x = matrix(rnorm(n * p), n) %*% chol(Sigma)

# Sample correlation matrix
S = cor(x)

# Indices of prior marker genes
# Note that we have omitted 10 true markers, and included 10 false markers
gr = c(1:(s - 10), (s + 11):(s + 20))

# Run the algorithm
res = pca_pen(S, gr, lambda = 0.1, verbose = 1)

# See if we can recover the true correlation structure
image(res$projection, asp = 1)

set.seed(123)
n = 200  # Sample size
p = 500  # Number of genes
s = 50   # Number of true signals

# The first s genes are true markers, and others are noise
Sigma = matrix(0, p, p)
Sigma[1:s, 1:s] = 0.9
diag(Sigma) = 1

# Simulate data from the covariance matrix
x = matrix(rnorm(n * p), n) %*% chol(Sigma)

# Sample correlation matrix
S = cor(x)

# Indices of prior marker genes
# Note that we have omitted 10 true markers, and included 10 false markers
gr = c(1:(s - 10), (s + 11):(s + 20))

# Run the algorithm
res = pca_pen(S, gr, lambda = 0.1, verbose = 1)

# See if we can recover the true correlation structure
image(res$projection, asp = 1)

Marker Gene Selection via Penalized Principal Component Analysis

Description

This function refines a prior marker gene list by combining information from bulk tissue data, based on the penalized principal component analysis. The current implementation computes on one cell type at a time. To get marker genes for multiple cell types, call this function iteratively.

Usage

refine_markers(
  mat_exp,
  range,
  markers,
  lambda,
  w = 1.5,
  thresh = 0.001,
  alpha = 0.01,
  maxit = 1000,
  eps = 1e-04,
  verbose = 0
)
refine_markers(
  mat_exp,
  range,
  markers,
  lambda,
  w = 1.5,
  thresh = 0.001,
  alpha = 0.01,
  maxit = 1000,
  eps = 1e-04,
  verbose = 0
)

Arguments

`mat_exp`	The gene expression matrix in the original scale (not logarithm-transformed), with rows standing for observations and columns for genes. The matrix should include gene names as column names.
`range`	A character vector of gene names, representing the range of genes in which markers are sought.
`markers`	A character vector of gene names giving the prior marker gene list.
`lambda`	A tuning parameter to control the number of selected marker genes. A larger value typically means a smaller number of genes.
`w`	Tuning parameter to control the weight on prior information. Larger $w$ means genes not in the prior list are less likely to be selected as markers.
`thresh`	Below this threshold small factor loadings are treated as zeros.
`alpha`	Step size of the optimization algorithm.
`maxit`	Maximum number of iterations.
`eps`	Tolerance parameter for convergence.
`verbose`	Level of verbosity.

Value

A list containing the following components:

spca: The sparse PCA result as in pca_pen().
markers: A character vector of selected markers genes.
markers_coef: The estimated factor loadings for the associated genes.

References

Qiu, Y., Wang, J., Lei, J., & Roeder, K. (2020). Identification of cell-type-specific marker genes from co-expression patterns in tissue samples.

Examples

# Data used in the vignette
load(system.file("examples", "gene_expr.RData", package = "markerpen"))
load(system.file("examples", "published_markers.RData", package = "markerpen"))
load(system.file("examples", "markers_range.RData", package = "markerpen"))

# Get expression matrix - rows are observations, columns are genes
ind = match(rownames(dat), markerpen::gene_mapping$name)
ind = na.omit(ind)
ensembl = markerpen::gene_mapping$ensembl[ind]
mat_exp = t(dat[markerpen::gene_mapping$name[ind], ])
colnames(mat_exp) = ensembl

# We compute the marker genes for two cell types with a reduced problem size
# See the vignette for the full example

# Markers for astrocytes
set.seed(123)
search_range = intersect(markers_range$astrocytes, ensembl)
search_range = sample(search_range, 300)
prior_markers = intersect(pub_markers$astrocytes, search_range)
ast_re = refine_markers(
    mat_exp, search_range, prior_markers,
    lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0
)
# Remove selected markers from the expression matrix
mat_rest = mat_exp[, setdiff(colnames(mat_exp), ast_re$markers)]

# Markers for microglia
search_range = intersect(markers_range$microglia, ensembl)
search_range = sample(search_range, 300)
prior_markers = intersect(pub_markers$microglia, search_range)
mic_re = refine_markers(
    mat_exp, search_range, prior_markers,
    lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0
)

# Refined markers
markers_re = list(astrocytes = ast_re$markers,
                  microglia  = mic_re$markers)
# Visualize the correlation matrix
cor_markers = cor(mat_exp[, unlist(markers_re)])
image(cor_markers, asp = 1)

# Post-process the selected markers
# Pick the first 20 ordered markers
markers_ord = sort_markers(cor_markers, markers_re)
markers_ord = lapply(markers_ord, head, n = 20)
# Visualize the correlation matrix
image(cor(mat_exp[, unlist(markers_ord)]), asp = 1)

# Data used in the vignette
load(system.file("examples", "gene_expr.RData", package = "markerpen"))
load(system.file("examples", "published_markers.RData", package = "markerpen"))
load(system.file("examples", "markers_range.RData", package = "markerpen"))

# Get expression matrix - rows are observations, columns are genes
ind = match(rownames(dat), markerpen::gene_mapping$name)
ind = na.omit(ind)
ensembl = markerpen::gene_mapping$ensembl[ind]
mat_exp = t(dat[markerpen::gene_mapping$name[ind], ])
colnames(mat_exp) = ensembl

# We compute the marker genes for two cell types with a reduced problem size
# See the vignette for the full example

# Markers for astrocytes
set.seed(123)
search_range = intersect(markers_range$astrocytes, ensembl)
search_range = sample(search_range, 300)
prior_markers = intersect(pub_markers$astrocytes, search_range)
ast_re = refine_markers(
    mat_exp, search_range, prior_markers,
    lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0
)
# Remove selected markers from the expression matrix
mat_rest = mat_exp[, setdiff(colnames(mat_exp), ast_re$markers)]

# Markers for microglia
search_range = intersect(markers_range$microglia, ensembl)
search_range = sample(search_range, 300)
prior_markers = intersect(pub_markers$microglia, search_range)
mic_re = refine_markers(
    mat_exp, search_range, prior_markers,
    lambda = 0.35, w = 1.5, maxit = 500, eps = 1e-3, verbose = 0
)

# Refined markers
markers_re = list(astrocytes = ast_re$markers,
                  microglia  = mic_re$markers)
# Visualize the correlation matrix
cor_markers = cor(mat_exp[, unlist(markers_re)])
image(cor_markers, asp = 1)

# Post-process the selected markers
# Pick the first 20 ordered markers
markers_ord = sort_markers(cor_markers, markers_re)
markers_ord = lapply(markers_ord, head, n = 20)
# Visualize the correlation matrix
image(cor(mat_exp[, unlist(markers_ord)]), asp = 1)

Post-processing Selected Marker Genes

Description

This function reorders the selected marker genes using information of the sample correlation matrix.

Usage

sort_markers(corr, markers)
sort_markers(corr, markers)

Arguments

`corr`	The sample correlation matrix, whose row and column names are gene names.
`markers`	A list of marker genes. Each component of the list is a vector of marker gene names corresponding to a cell type. All the gene names in this list must appear in the row/column names of `corr`.

Value

A list that has the same structure as the input markers argument, with the elements in each component reordered. See the example in refine_markers().

Package 'markerpen'

Help Index

Mapping gene names to Ensembl IDs

Description

Usage

Format

Source

Penalized Principal Component Analysis for Marker Gene Selection

Description

Usage

Arguments

Value

References

Examples

Marker Gene Selection via Penalized Principal Component Analysis

Description

Usage

Arguments

Value

References

Examples

Post-processing Selected Marker Genes

Description

Usage

Arguments

Value