Pre-training from scratch#

Use src/concept/train.py for large-scale distributed pre-training from scratch. The scConcept.train() Python API is intended for light adaptation of pretrained models or small experiments.

Before running large-scale pre-training, make sure the training files listed in the split configuration files are available on disk.

LaminDB setup#

Follow the LaminDB setup guide before running the training pipeline: https://docs.lamin.ai/setup.

Expected project layout#

The root training configuration is src/concept/conf/config.yaml. Set PATH.PROJECT_PATH to a project directory with the following structure:

PROJECT_PATH/
|-- data/
|   `-- <dataset_name>/
|       `-- h5ads/
|           |-- <adata_1>.h5ad
|           |-- <adata_2>.h5ad
|           `-- ...
|-- model_checkpoints/
|-- panels/
|   |-- hsapiens/
|   |   |-- panel_1.csv
|   |   |-- panel_2.csv
|   |   `-- ...
|   `-- mmusculus/
|       |-- panel_1.csv
|       |-- panel_2.csv
|       `-- ...
`-- references/
    |-- embeddings/
    |   `-- <embedding_name>/
    |       |-- <human_gene_embeddings>.csv
    |       |-- <mouse_gene_embeddings>.csv
    |       `-- ...
    `-- vocabulary/
        `-- token_mappings_per_specie/
            |-- hsapiens.csv
            |-- mmusculus.csv
            `-- ...

Panel CSV files are expected to contain an Ensembl_ID column. The per-species token mapping CSV files are expected to contain gene_id and token columns. Example panel CSV files and token mapping files are available in the scConcept model repository on Hugging Face: https://huggingface.co/theislab/scConcept/tree/main.

The gene embeddings directory is optional. Provide it only if you want to initialize the model with pretrained gene embeddings. In that case, set PATH.PRETRAINED_VOCABULARY to a directory that contains one CSV file per species, for example:

references/
`-- embeddings/
    `-- esm2_t30/
        |-- hsapiens.csv
        |-- mmusculus.csv
        `-- ...

The token mapping files look like:

#hsapiens.csv
gene_id,gene_name,token
<cls>,,0
<pad>,,1
ENSG00000000003,TSPAN6,2
ENSG00000000005,TNMD,3
...

Command-line overrides#

The training script uses Hydra/OmegaConf. Any configuration value can be overridden from the command line with dotted keys:

python src/concept/train.py PATH.PROJECT_PATH=/path/to/project model.training.max_steps=100000

Configuration files#

The training configuration is composed from Hydra YAML files in src/concept/conf/.

config.yaml

The root configuration. It selects the default model and datamodule configs, defines PATH.* locations, controls Weights & Biases logging, and stores resume/profiler settings. The most important required value is PATH.PROJECT_PATH, because the default data, panel, reference, and checkpoint paths are derived from it.

model/ContrastiveModel.yaml

Model architecture and optimizer/training settings. This file controls the embedding size, transformer depth, attention heads, dropout, loss weights, Flash Attention flag, learning rate, scheduler, number of steps, GPU/device settings, validation interval, gradient accumulation, and checkpoint cadence.

datamodule/DataModuleBasic.yaml

Default human-only datamodule with simple validations. It defines the species list, observation columns read from adata.obs, normalization, sampling strategy, train/validation splits, panel sampling settings, maximum token count, batch size, workers, and validation loaders.

datamodule/split_*/split_*.yaml

Dataset split definitions. Each split file declares a source_name, a species, a source_path, and lists of .h5ad files for train and/or val. The training script expands these entries into concrete file paths and passes the species metadata to the multi-species tokenizer. A split entry can also choose a specific section with:

To use a different config group, override it on the command line. For example:

python src/concept/train.py \
  PATH.PROJECT_PATH=/path/to/project \
  datamodule=DataModuleAdvanced \
  model.training.max_steps=1000000

On SLURM, launch the same command through srun or your site-specific job script. Lightning reads SLURM environment variables and uses them for distributed training when available.

Checkpoints and the resolved training configuration are written to PATH.CHECKPOINT_ROOT, which defaults to PATH.PROJECT_PATH/model_checkpoints.