A Guide to 10x Genomics Datasets and Data Formats

High-throughput genomics has fundamentally changed the study of biology by allowing scientists to analyze biological systems at an unprecedented scale. Traditional methods of gene sequencing often relied on bulk tissue samples, which provided an average molecular measurement across millions of cells. This averaging effect obscured the unique characteristics of individual cells, a phenomenon known as cellular heterogeneity. 10x Genomics developed systems that partition complex biological samples into tiny, individualized compartments. This advance allows researchers to study the unique molecular profile of thousands of cells separately, ushering in a new era of single-cell and spatial resolution. The result is the generation of massive, complex datasets.

The Diversity of 10x Genomic Data Types

Understanding the molecular profile of single cells starts with quantifying gene expression using Single-Cell RNA Sequencing (scRNA-seq). This method captures the messenger RNA (mRNA) within each cell, providing a snapshot of which genes are active and at what level. By profiling thousands of cells from a single sample, researchers can identify distinct cell populations and track subtle changes in cell states that might be missed in bulk analysis. The resulting dataset reveals the true heterogeneity of a tissue, which is foundational for mapping complex organs.

Beyond gene expression, other assays probe the regulatory landscape of the cell’s nucleus. Single-Cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) data provides information on chromatin accessibility. This tells scientists which regions of the DNA are “open” or available to regulatory proteins, suggesting active or potential gene activation. Such datasets are instrumental in mapping the regulatory elements that control a cell’s identity and function.

A distinct data type, Spatial Transcriptomics, specifically the Visium platform, adds a crucial layer of positional context to gene expression. Unlike single-cell methods where cells are dissociated, Visium retains the tissue architecture, allowing gene expression to be mapped back to its precise location on a tissue slice. This is achieved by using slides coated with thousands of spots, each containing unique spatial barcodes that link the gene expression data to a physical coordinate. Analyzing this data reveals how cell-to-cell communication and tissue organization influence molecular activity.

Single-Cell Multiome assays represent a convergence of these approaches by measuring two different data modalities from the exact same nucleus, such as gene expression and chromatin accessibility. This combined dataset provides a more complete picture of cellular function by directly linking the open regulatory regions (ATAC-seq) to the resulting active genes (scRNA-seq). Analyzing these integrated datasets allows researchers to build regulatory networks, providing a deeper understanding of the mechanisms that define cell state.

Locating and Interpreting 10x Dataset Formats

Accessing and processing 10x Genomics datasets requires navigating a specific set of file formats and public repositories. Researchers often share their raw and processed data through major public archives, such as the NCBI Gene Expression Omnibus (GEO) and the European Nucleotide Archive (ENA). The official 10x Genomics Data Portal is also a valuable resource, providing publicly available, high-quality datasets generated by the company itself. The raw output from the sequencing machine is typically in the form of FASTQ files, which contain the sequence reads and their quality scores, but these must be processed before biological analysis can begin.

The initial processing of these raw reads is handled by specialized software, such as the company’s Cell Ranger pipeline. This pipeline demultiplexes the reads, aligning them to a reference genome and counting the unique molecular identifiers (UMIs) to create the final data structure. The core output of this process is a set of three files that collectively form the feature-barcode matrix, which is the input for all downstream analysis. This matrix is the fundamental data structure used to quantify molecular abundance across all captured cells.

The three files include a list of features, a list of barcodes, and the count matrix itself. The `features.tsv.gz` file contains the list of genes, peaks, or other molecular features that were quantified, essentially labeling the rows of the matrix. The `barcodes.tsv.gz` file lists the unique cell barcodes that passed quality control, labeling the columns of the matrix. Finally, the `matrix.mtx.gz` file is the count matrix, where each entry represents the number of molecules (UMIs) detected for a specific feature in a specific cell.

This count matrix is stored in a compressed, sparse format, such as the Market Exchange Format (MEX) or Hierarchical Data Format (HDF5), to manage the massive size of the data. The sparsity is necessary because most genes are not expressed in any single cell, resulting in a matrix filled predominantly with zero values. Computational tools like the R package Seurat or the Python package Scanpy are needed to efficiently read and manipulate these sparse matrix formats for analysis. This step is where the raw counts are normalized and scaled to prepare the data for visualization and biological interpretation.

Translating Data into Biological Insight

The comprehensive datasets generated by 10x Genomics technology are being used to drive large-scale collaborative projects, such as the Human Cell Atlas initiative. This international effort aims to map every cell type in the human body, providing a reference for health and disease. The technology was used to profile hundreds of thousands of immune cells from various human tissues, leading to the identification of novel cell states and subtypes within the immune system. Detailed atlases of developing organs, such as the human heart, have been created by integrating single-cell and spatial data to map cellular composition and gene expression to precise anatomical regions.

In disease modeling, these datasets provide an unparalleled view into the complexity of conditions like cancer. Single-cell analysis is routinely used to unmask intratumoral heterogeneity, revealing the diverse populations of cancer cells that exist within a single tumor. By tracing the molecular profiles of these subpopulations, researchers can track clonal evolution and identify the rare cell types that may be responsible for therapeutic resistance. Specific studies have utilized Multiome data to link gene expression changes to underlying genomic alterations in aggressive cancers like glioblastoma.

This level of resolution is proving valuable in translational research for the discovery of new therapeutic targets. By comparing the single-cell expression profiles of healthy cells to those in a diseased or drug-resistant state, scientists can pinpoint specific, differentially active genes within a small, disease-driving cell population. For example, single-cell analysis of acute lymphoblastic leukemia identified a resistant subpopulation and a druggable target, BCL2, which led to the proposal of a new combination therapy. Furthermore, spatial data can be used to identify cell-specific biomarkers by revealing how the physical location of cells, such as in the tumor microenvironment, influences their response to treatment.