scConcept#

This repository contains the python package to train and use scConcept (Single-cell contrastive cell pre-training) method for single-cell transcriptomics.

Installation#

You need to have Python 3.12 or newer installed on your system. If you don’t have Python installed, we recommend installing uv.

Default installation#

Install the latest release of sc-concept from PyPI:

pip install sc-concept

Latest development version#

To install the latest development version directly from GitHub:

pip install git+https://github.com/theislab/scConcept.git@main

Optional Flash Attention speedup#

The standard installation is enough for loading pretrained models, extracting embeddings, and light adaptation. For faster inference, embedding extraction, adaptation, or large-scale training, install Flash Attention with one of the following options.

Recommended: cd to the project root and run ./scripts/setup_env.sh, which installs uv if needed and creates a virtual environment with the training dependencies.
Manual: make sure a CUDA-enabled version of PyTorch is installed. More information is available in the PyTorch installation guide. Then install Flash Attention:

MAX_JOBS=4 pip install "flash-attn>=2.7" --no-build-isolation

This can take up to an hour depending on the system specifications and whether a pre-built release of flash-attn is available for your exact versions of Python, PyTorch, and CUDA. If this takes long, we recommend using the setup script instead.

How to use#

scConcept provides a simple API to load and adapt pre-trained models and extract embeddings from scRNA-seq data.

Pre-trained models#

The following models are available from the scConcept Hugging Face repository. Use the value in the model_name column with concept.load_config_and_model(model_name=...).

`model_name`	Training corpus	Architecture	Max tokens	Species	Notes
`corpus360M[multi-species]-model170M`	360M cells (CellxGene 2026 + scBaseCount 2025)	170M parameters, 16 layers, 1024 hidden size, 16 heads	20,000	16 species	Largest multi-species checkpoint; best suited for cross-species applications with sufficient memory.
`corpus40M-model30M`	40M cells (CellxGene 2023)	30M parameters, 8 layers, 512 hidden size, 8 heads	1,000	Human	Recommended default for embedding extraction and light adaptation.

Here’s a basic example:

from concept import scConcept
import scanpy as sc

# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")

# Initialize scConcept and load a pretrained model
concept = scConcept(cache_dir='./cache/')

# Option 1: Load a model directly from HuggingFace
concept.load_config_and_model(model_name='corpus40M-model30M') 

# Option 2: Load any local model
concept.load_config_and_model(
    config='<path-to-config.yaml>',
    model_path='<path-to-model.ckpt>',
    gene_mappings_path='<path-to-gene-mappings-directory>',
)

# scConcept accepts Gene Ensemble IDs as input. You can use built-in helper methods to do the mapping if needed:
adata.var['gene_id'] = concept.map_gene_names_to_ids(
    species='hsapiens', # see concept.species for available species names
    gene_names=adata.var_names.tolist(),
)

# Extract embeddings --> adata.var['gene_id']: ENSGXXXXXXXXXXX
result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')

# Use embeddings for downstream analysis
adata.obsm['X_scConcept'] = result['cls_cell_emb']

Model adaptation#

# Adapt a pre-trained model on your own data
concept.train(adata, max_steps=10000, batch_size=128) 

# Important: For multiple datasets pass them separately
concept.train([adata1, adata2, ...], max_steps=20000, batch_size=128) 

result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')
adata.obsm['X_scConcept_adapted'] = result['cls_cell_emb']

Large-scale pre-training from scratch#

scConcept.train() is only for light adaptation of pretrained models or small trainings on the fly. Use train.py for distributed model pre-training from scratch over large corpus of data.

Before using train.py follow the instructions on lamindb for setting up a lamin instance.

Troubleshooting#

If you encounter an error when loading a pre-trained model, try the following:

Remove the repository and clone the most recent version
Remove the cache directory (cache/ by default)
Run again

This will force a fresh download of the pre-trained model and should resolve most loading issues.

Citation#

Bahrami, M., Tejada-Lapuerta, A., Becker, S., Hashemi G, F.S. and Theis, F.J., 2025. scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction. bioRxiv, pp.2025-10. doi: https://doi.org/10.1101/2025.10.14.682419

scConcept

Contents