Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity

Abstract

Single-cell epigenomic data has been growing continuously at an unprecedented pace, but their characteristics such as high dimensionality and sparsity pose substantial challenges to downstream analysis. Although deep learning models—especially variational autoencoders—have been widely used to capture low-dimensional feature embeddings, the prevalent Gaussian assumption somewhat disagrees with real data, and these models tend to struggle to incorporate reference information from abundant cell atlases. Here we propose CASTLE, a deep generative model based on the vector-quantized variational autoencoder framework to extract discrete latent embeddings that interpretably characterize single-cell chromatin accessibility sequencing data. We validate the performance and robustness of CASTLE for accurate cell-type identification and reasonable visualization compared with state-of-the-art methods. We demonstrate the advantages of CASTLE for effective incorporation of existing massive reference datasets in a weakly supervised or supervised manner. We further demonstrate CASTLE’s capacity for intuitively distilling cell-type-specific feature spectra that unveil cell heterogeneity and biological implications quantitatively.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of CASTLE framework.
Fig. 2: Evaluation of CASTLE compared with baseline methods.
Fig. 3: Performance of batch correction for CASTLE compared with baseline methods.
Fig. 4: Robustness analysis for CASTLE compared with baseline methods.
Fig. 5: Performance of reference incorporation for CASTLE compared with other baseline methods.
Fig. 6: Feature spectrum analysis and biological implications of the cell-type-specific peaks identified by CASTLE.

Similar content being viewed by others

Data availability

The splenocyte dataset, with peaks in the GRCm38/mm10 genome, was downloaded from ArrayExpress via accession no. E-MTAB-6714 (ref. 66). The InSilico dataset, with peaks in the GRCh37/hg19 genome, was constructed by computationally putting together six scCAS experiments that were performed on different cell lines individually and was downloaded in Gene Expression Omnibus (GEO) under accession no. GSE65360 (ref. 1). The droplet dataset, with peaks in the GRCh37/hg19 genome, measures chromatin accessibility across 136,463 resting and stimulated human bone marrow-derived cells and was downloaded in GEO under accession no. GSE123580 (ref. 25). The mouse chromatin accessibility atlas datasets with peaks in the genome of GRCm37/mm9 were downloaded from http://atlas.gs.washington.edu/mouse-atac (ref. 37). The human cell atlas of fetal chromatin accessibility, with peaks in the GRCh37/hg19 genome, was downloaded under accession no. GSE149683 (ref. 26). The brain dataset, with peaks in the GRCm38/mm10 genome, consists of two batches assayed by single nucleus assay for transposase-accessible chromatin using sequencing, snATAC and 10X (refs. 12,32) and was downloaded in GEO under accession no. GSE126724 and https://support.10xgenomics.com/single-cell-atac/datasets/1.1.0/atac_v1_adult_brain_fresh_5k. The immune dataset with peaks in the genome of GRCh37/hg19, derived from human peripheral blood and bone marrow dataset, was downloaded in GEO under accession no. GSE129785 (ref. 56). Source Data are provided with this paper.

Code availability

The CASTLE software, including detailed documents and tutorial, is freely available on GitHub (https://github.com/cuixj19/CASTLE). The source code is also available via Zenodo at https://doi.org/10.5281/zenodo.10906304 (ref. 67).

References

  1. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).

    Article  Google Scholar 

  2. Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).

    Article  Google Scholar 

  3. Klemm, S. L., Shipony, Z. & Greenleaf, W. J. Chromatin accessibility and the regulatory epigenome. Nat. Rev. Genet. 20, 207–220 (2019).

    Article  Google Scholar 

  4. Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 20, 1–25 (2019).

    Article  Google Scholar 

  5. Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 10, 4576 (2019).

    Article  Google Scholar 

  6. Ding, J. & Regev, A. Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces. Nat. Commun. 12, 2554 (2021).

    Article  Google Scholar 

  7. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  Google Scholar 

  8. Gao, Z. et al. scEpiTools: a database to comprehensively interrogate analytic tools for single-cell epigenomic data. J. Genet. Genom. 51, 462–465 (2024).

    Article  Google Scholar 

  9. Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods 16, 397–400 (2019).

    Article  Google Scholar 

  10. Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).

    Article  Google Scholar 

  11. Kingma, D. P. & Welling, M. Auto-encoding variational bayes. In Proc. 2nd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) http://arxiv.org/abs/1312.6114 (ICLR, 2014).

  12. Xiong, L. et al. Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space. Nat. Commun. 13, 6118 (2022).

    Article  Google Scholar 

  13. Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40, 1458–1466 (2022).

    Article  Google Scholar 

  14. Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: a deep generative model for single-cell chromatin accessibility analysis. Cell Rep. Methods 2, 100182 (2022).

    Article  Google Scholar 

  15. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  Google Scholar 

  16. Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).

    Article  Google Scholar 

  17. Ashuach, T. et al. MultiVI: deep generative model for the integration of multimodal data. Nat. Methods 20, 1222–1231 (2023).

    Article  Google Scholar 

  18. van den Oord, A., Vinyals, O. & Kavukcuoglu, K. Neural discrete representation learning. In Proc. 31st Conference on Neural Information Processing Systems 6309–6318 (Curran Associates Inc., 2017).

  19. Razavi, A., van den Oord, A. & Vinyals, O. Generating diverse high-fidelity images with VQ-VAE-2. In Proc. 33rd Conference on Neural Information Processing Systems 1331 (Curran Associates Inc., 2019).

  20. Kobayashi, H., Cheveralls, K. C., Leonetti, M. D. & Royer, L. A. Self-supervised deep learning encodes high-resolution features of protein subcellular localization. Nat. Methods 19, 995–1003 (2022).

    Article  Google Scholar 

  21. Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).

    Article  Google Scholar 

  22. Romano, S., Vinh, N. X., Bailey, J. & Verspoor, K. Adjusting for chance clustering comparison measures. J. Mach. Learn. Res. 17, 1–32 (2016).

    MathSciNet  Google Scholar 

  23. Strehl, A. & Ghosh, J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002).

    MathSciNet  Google Scholar 

  24. Fowlkes, E. B. & Mallows, C. L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).

    Article  Google Scholar 

  25. Lareau, C. A. et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat. Biotechnol. 37, 916–924 (2019).

    Article  Google Scholar 

  26. Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science 370, eaba7612 (2020).

    Article  Google Scholar 

  27. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).

  28. Argelaguet, R., Cuomo, A. S., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat. Biotechnol. 39, 1202–1215 (2021).

    Article  Google Scholar 

  29. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).

    Article  Google Scholar 

  30. Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

    Article  Google Scholar 

  31. Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16, 695–698 (2019).

    Article  Google Scholar 

  32. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).

    Article  Google Scholar 

  33. Kopp, W., Akalin, A. & Ohler, U. Simultaneous dimensionality reduction and integration for single-cell ATAC-seq data using deep learning. Nat. Mach. Intell. 4, 162–168 (2022).

    Article  Google Scholar 

  34. Chen, S., Wang, R., Long, W. & Jiang, R. ASTER: accurately estimating the number of cell types in single-cell chromatin accessibility data. Bioinformatics 39, btac842 (2023).

    Article  Google Scholar 

  35. Dudoit, S. & Fridlyand, J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3, 1–21 (2002).

    Article  Google Scholar 

  36. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 310 (2019).

    Article  Google Scholar 

  37. Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).

    Article  Google Scholar 

  38. Chen, S. et al. RA3 is a reference-guided approach for epigenetic characterization of single cells. Nat. Commun. 12, 2177 (2021).

    Article  Google Scholar 

  39. Lin, Y. et al. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat. Biotechnol. 40, 703–710 (2022).

    Article  Google Scholar 

  40. Li, Z., Chen, X., Zhang, X., Jiang, R. & Chen, S. Latent feature extraction with a prior-based self-attention framework for spatial transcriptomics. Genome Res. 33, 1757–1773 (2023).

    Article  Google Scholar 

  41. Slowikowski, K., Hu, X. & Raychaudhuri, S. SNPsea: an algorithm to identify cell types, tissues and pathways affected by risk loci. Bioinformatics 30, 2496–2497 (2014).

    Article  Google Scholar 

  42. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    Article  Google Scholar 

  43. Danese, A. et al. EpiScanpy: integrated single-cell epigenomic analysis. Nat. Commun. 12, 5228 (2021).

    Article  Google Scholar 

  44. Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).

    Article  Google Scholar 

  45. Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 1–16 (2018).

    Article  Google Scholar 

  46. Chen, X. et al. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. Nat. Mach. Intell. 4, 116–126 (2022).

    Article  Google Scholar 

  47. Liu, Q., Chen, S., Jiang, R. & Wong, W. H. Simultaneous deep generative modelling and clustering of single-cell genomic data. Nat. Mach. Intell. 3, 536–544 (2021).

    Article  Google Scholar 

  48. Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat. Commun. 9, 2410 (2018).

    Article  Google Scholar 

  49. Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).

    Article  Google Scholar 

  50. Kaiser, L. et al. Fast decoding in sequence models using discrete latent variables. In Proc. 35th International Conference on Machine Learning 2390–2399 (PMLR, 2018).

  51. Peng, J., Liu, D., Xu, S. & Li, H. Generating diverse structure for image inpainting with hierarchical VQ-VAE. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10775–10784 (IEEE, 2021).

  52. Williams, W. et al. Hierarchical quantized autoencoders. In Proc. 34th International Conference on Neural Information Processing Systems 4524–4535 (ACM, 2020).

  53. Takida, Y. et al. SQ-VAE: variational bayes on discrete representation with self-annealed stochastic quantization. In Proc. 39th International Conference on Machine Learning 20987–21012 (PMLR, 2022).

  54. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In 33rd Conference on Neural Information Processing Systems 721 (Curran Associates Inc., 2019).

  55. Kingma, D. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations (Bengio, Y. & LeCun, Y.) http://arxiv.org/abs/1412.6980 (ICLR, 2015).

  56. Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).

    Article  Google Scholar 

  57. Tang, S. et al. scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data. Nat. Commun. 15, 1629 (2024).

    Article  Google Scholar 

  58. Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl Acad. Sci. USA 101, 6062–6067 (2004).

    Article  Google Scholar 

  59. Gazal, S. S-LDSC reference files. Zenodo https://doi.org/10.5281/zenodo.7768714 (2017).

  60. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  Google Scholar 

  61. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).

    Article  Google Scholar 

  62. Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).

    Article  Google Scholar 

  63. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

    Article  Google Scholar 

  64. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  65. Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    Article  Google Scholar 

  66. Chen, X., Miragaia, R. J., Natarajan, K. N. & Teichmann, S. A. A rapid and robust method for single cell chromatin accessibility profiling. Nat. Commun. 9, 5345 (2018).

    Article  Google Scholar 

  67. Cui, X. et al. Discrete latent embedding of single-cell chromatin accessibility sequencing data for cell heterogeneity uncovering. Zenodo https://zenodo.org/doi/10.5281/zenodo.10906304 (2024).

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (grant nos. 2021YFF1200902 and 2023YFF1204802 to R.J.), the National Natural Science Foundation of China (grant nos. 62273194 to R.J. and 62203236 to S.C.), the Fundamental Research Funds for the Central Universities (grant no. Nankai University 63231137 to S.C.), and the Young Elite Scientists Sponsorship Program by CAST (grant no. 2023QNRC001 to S.C.).

Author information

Authors and Affiliations

Authors

Contributions

R.J. and S.C. conceived the study and supervised the project. X.C. and S.C. designed, implemented and validated CASTLE. Z.L. and Z.G. helped analyze the results. X.C., S.C. and X.C. wrote the paper, with input from all of the authors.

Corresponding authors

Correspondence to Shengquan Chen or Rui Jiang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Jianrong Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Editor recognition statement: Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–12, Figs. 1–65 and Table 1.

Reporting Summary

Source data

Source Data Fig. 2

Original data for this figure.

Source Data Fig. 3

Original data for this figure.

Source Data Fig. 4

Original data for this figure.

Source Data Fig. 5

Original data for this figure.

Source Data Fig. 6

Original data for this figure.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, X., Chen, X., Li, Z. et al. Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity. Nat Comput Sci 4, 346–359 (2024). https://doi.org/10.1038/s43588-024-00625-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-024-00625-4

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing