Download PDFOpen PDF in browser

Vector Representation of Gene Co-expression in Single Cell RNA-Seq

EasyChair Preprint no. 4571

11 pagesDate: November 15, 2020

Abstract

The sparsity of gene expression is a well known problem in single cell RNA-seq data. Known as dropout, the gene expression observed for each cell is only a fraction of the total transcriptome. Several techniques have been adopted to address this challenge including variable gene selection and expression imputation. We present an approach for finding dense vector representations of genes from co-expression that can be used in place of the sparse expression profile over cells. By leveraging co-expression across all cells, each gene vector is a meaningful representation that is independent of missing data from individual cells. Similar genes, measured by cosine similarity between vectors, are found to correspond to known cell type markers. Using latent space arithmetic, these gene vectors have the additive capacity to accurately describe each cell and can be used to generate a low dimensional cell embedding. It is also possible to decompose and subtract sources of variation including batch effects. Any feature that can be described as a set of genes can be represented as a composite of vectors. We demonstrate the application of these vectors in identification of cell type markers, dimensionality reduction, and batch correction.

Keyphrases: batch effect correction, dimensionality reduction, gene expression, machine learning, single-cell RNA-seq, vector representations

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:4571,
  author = {Nicholas Ceglia and Florian Uhlitz and Andrew McPherson},
  title = {Vector Representation of Gene Co-expression in Single Cell RNA-Seq},
  howpublished = {EasyChair Preprint no. 4571},

  year = {EasyChair, 2020}}
Download PDFOpen PDF in browser