The single cell variational inference (scVI) framework 14 generates an embedding using non-linear autoencoders that can be used in a range of analyses including normalization, batch correction, gene-dropout correction, and visualization. In addition to PCA, more sophisticated methods have been developed to better handle the specific challenges of scRNA data. While these loadings highlight sets of genes that explain each orthogonal axis of variation, pathways and cell type signatures can be conflated within a single axis. Despite such issues, gene programs generated from PCA loadings have been used to generate metagenes that explain each principal component 13. However, the assumption of a continuous multivariate gaussian distribution creates distortion in modeling read counts generated by a true distribution that is over-dispersed, possibly zero-inflated 12, with positive support and mean close to zero 2. A PCA embedding is an ideal input for building a nearest neighbor graph for unsupervised clustering algorithms 9 and visualization methods including t-SNE 10 and UMAP 11. The relationship of principal components to gene expression is linear, allowing lower dimensional structure to be directly related to variation in expression. The most pervasive method for identifying the sources of variation in scRNA-seq studies is principal component analysis (PCA) 6, 7, 8. The vectors derived from GeneVector provide a framework for identifying metagenes within a gene co-expression graph and relating these metagenes back to each cell using latent space arithmetic. While current methods reduce dimensionality with respect to sparse expression across each cell, our tool produces a lower dimensional embedding with respect to each gene. Inspired by such work, we developed a tool that generates gene vectors based on single cell RNA (scRNA)-seq expression data. Similar methodology has been applied to bulk RNA-seq expression for finding co-expression patterns 5. To find contextually similar words, NLP methods make use of vector space models to represent similarities in a lower dimensional space. NLP commonly uses dimensionality reduction to identify word associations within a body of text 3, 4. To find similarities in lower dimensions, biology can borrow from the field of natural language processing (NLP). However, to map existing biological knowledge to each cell, the derived features must be interpretable at the gene level. The first intuitive step to identify such co-regulated genes is the reduction of dimensionality for sparse expression measurements: high dimensional gene expression data is compressed into a minimal set of explanatory features that highlight similarities in cellular function. To approximate these connections, transcriptomic studies have conceptually organized the transcriptome into sets of co-regulated genes, termed gene programs 1 or metagenes 2. Maintenance of cell state and execution of cellular function are based on coordinated activity within networks of related genes. In this work, we show in four single cell RNA-seq datasets that GeneVector was able to capture phenotype-specific pathways, perform batch effect correction, interactively annotate cell types, and identify pathway variation with treatment over time. Unlike other methods, including principal component analysis and variational autoencoders, GeneVector uses latent space arithmetic in a lower dimensional gene embedding to identify transcriptional programs and classify cell types. We describe GeneVector, a scalable framework for dimensionality reduction implemented as a vector space model using mutual information between gene expression. By performing dimensionality reduction with respect to gene co-expression, low-dimensional features can model these gene-specific relationships and leverage shared signal to overcome sparsity. However, current dimensionality reduction methods aggregate sparse gene information across cells, without directly measuring the relationships that exist between genes. Deciphering individual cell phenotypes from cell-specific transcriptional processes requires high dimensional single cell RNA sequencing.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |