scConcept#
This repository contains the python package to train and use scConcept (Single-cell contrastive cell pre-training) method for single-cell transcriptomics.
Installation#
You need to have Python 3.12 or newer installed on your system. If you don’t have Python installed, we recommend installing uv.
Default installation#
Install the latest release of sc-concept from PyPI:
pip install sc-concept
Latest development version#
To install the latest development version directly from GitHub:
pip install git+https://github.com/theislab/scConcept.git@main
Optional Flash Attention speedup#
The standard installation is enough for loading pretrained models, extracting embeddings, and light adaptation. For faster inference, embedding extraction, adaptation, or large-scale training, install Flash Attention with one of the following options.
Recommended:
cdto the project root and run./scripts/setup_env.sh, which installs uv if needed and creates a virtual environment with the training dependencies.Manual: make sure a CUDA-enabled version of PyTorch is installed. More information is available in the PyTorch installation guide. Then install Flash Attention:
MAX_JOBS=4 pip install "flash-attn>=2.7" --no-build-isolation
This can take up to an hour depending on the system specifications and whether a pre-built release of flash-attn is available for your exact versions of Python, PyTorch, and CUDA. If this takes long, we recommend using the setup script instead.
How to use#
scConcept provides a simple API to load and adapt pre-trained models and extract embeddings from scRNA-seq data.
Pre-trained models#
The following models are available from the scConcept Hugging Face repository. Use the value in the model_name column with concept.load_config_and_model(model_name=...).
|
Training corpus |
Architecture |
Max tokens |
Species |
Notes |
|---|---|---|---|---|---|
|
360M cells (CellxGene 2026 + scBaseCount 2025) |
170M parameters, 16 layers, 1024 hidden size, 16 heads |
20,000 |
16 species |
Largest multi-species checkpoint; best suited for cross-species applications with sufficient memory. |
|
40M cells (CellxGene 2023) |
30M parameters, 8 layers, 512 hidden size, 8 heads |
1,000 |
Human |
Recommended default for embedding extraction and light adaptation. |
Here’s a basic example:
from concept import scConcept
import scanpy as sc
# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")
# Initialize scConcept and load a pretrained model
concept = scConcept(cache_dir='./cache/')
# Option 1: Load a model directly from HuggingFace
concept.load_config_and_model(model_name='corpus40M-model30M')
# Option 2: Load any local model
concept.load_config_and_model(
config='<path-to-config.yaml>',
model_path='<path-to-model.ckpt>',
gene_mappings_path='<path-to-gene-mappings-directory>',
)
# scConcept accepts Gene Ensemble IDs as input. You can use built-in helper methods to do the mapping if needed:
adata.var['gene_id'] = concept.map_gene_names_to_ids(
species='hsapiens', # see concept.species for available species names
gene_names=adata.var_names.tolist(),
)
# Extract embeddings --> adata.var['gene_id']: ENSGXXXXXXXXXXX
result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')
# Use embeddings for downstream analysis
adata.obsm['X_scConcept'] = result['cls_cell_emb']
Model adaptation#
# Adapt a pre-trained model on your own data
concept.train(adata, max_steps=10000, batch_size=128)
# Important: For multiple datasets pass them separately
concept.train([adata1, adata2, ...], max_steps=20000, batch_size=128)
result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')
adata.obsm['X_scConcept_adapted'] = result['cls_cell_emb']
Large-scale pre-training from scratch#
scConcept.train() is only for light adaptation of pretrained models or small trainings on the fly. Use train.py for distributed model pre-training from scratch over large corpus of data.
Before using train.py follow the instructions on lamindb for setting up a lamin instance.
Troubleshooting#
If you encounter an error when loading a pre-trained model, try the following:
Remove the repository and clone the most recent version
Remove the cache directory (
cache/by default)Run again
This will force a fresh download of the pre-trained model and should resolve most loading issues.
Citation#
Bahrami, M., Tejada-Lapuerta, A., Becker, S., Hashemi G, F.S. and Theis, F.J., 2025. scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction. bioRxiv, pp.2025-10. doi: https://doi.org/10.1101/2025.10.14.682419