concept.scConcept#

class concept.scConcept(cfg=None, repo_id='theislab/scConcept', cache_dir='./cache/')#

High-level interface for loading, adapting, and applying scConcept models.

This wrapper handles model/config loading from Hugging Face or local paths, gene-tokenizer setup across species, embedding extraction from AnnData, and optional lightweight fine-tuning on user-provided datasets.

Attributes table#

species

Return species names supported by the loaded model.

Methods table#

`apply_compatibility_changes`(cfg)	Apply compatibility changes for older checkpoints.
`decode`(cell_embeddings[, batch_size])	Decode cell embeddings into gene expression predictions.
`estimate_mutual_information`(cell_embs_1, ...)	Estimate mutual information between two aligned sets of cell embeddings.
`extract_embeddings`(adata[, species, ...])	Extract embeddings from AnnData using the loaded model.
`get_gene_name_to_id_mapping`(species)	Return a `{gene_name: gene_id}` mapping for one species.
`load_config`(config)	Load configuration from file or dict.
`load_config_and_model`([model_name, config, ...])	Load configuration and initialize the model.
`map_gene_ids_to_names`(species, gene_ids)	Map gene IDs to gene names case-insensitively, using `nan` for unavailable gene IDs.
`map_gene_names_to_ids`(species, gene_names)	Map gene names to gene IDs case-insensitively, using `nan` for unavailable gene names.
`plot_training_curves`([metrics_path, ...])	Plot train/loss and train/recall@1 from the current CSVLogger metrics file.
`train`(adata_list[, species, max_steps, ...])	Train a new model using the configuration in self.cfg.
`validate_config`(cfg)	Validate configuration constraints.

Attributes#

scConcept.species#: Return species names supported by the loaded model.

Methods#

static scConcept.apply_compatibility_changes(cfg)#: Apply compatibility changes for older checkpoints. Returns updated cfg.

scConcept.decode(cell_embeddings, batch_size=128)#

Decode cell embeddings into gene expression predictions.

Return type:: dict[str, Tensor | list[str]]

scConcept.estimate_mutual_information(cell_embs_1, cell_embs_2)#

Estimate mutual information between two aligned sets of cell embeddings.

The estimate follows the contrastive objective used by scConcept: log(n_cells) - cross_entropy(normalized_embs_1 @ normalized_embs_2.T * logit_scale, arange(n_cells)).

Return type:: float

Args:: cell_embs_1: Tensor of shape (n_cells, embedding_dim) for the first panel/view. cell_embs_2: Tensor of shape (n_cells, embedding_dim) for the second panel/view.
Returns:: Estimated mutual information as log(n_cells) - contrastive_loss.

scConcept.extract_embeddings(adata, species=None, gene_id_column=None, batch_size=32, max_tokens=None, gene_sampling_strategy=None, use_learnable_embs=True, return_type='numpy', num_workers=8)#

Extract embeddings from AnnData using the loaded model.

Args:

adata: AnnData object containing single-cell data species: Species identifier (e.g. ‘hsapiens’). If not provided, will be inferred from gene IDs using the tokenizer’s gene mappings if possible. gene_id_column: Column name in adata.var to use as gene IDs: ENSGXXXXXXXXXXX (default: None, uses index) batch_size: Batch size for dataloader (default: 32) max_tokens: Maximum number of tokens per cell (if None, uses config default) gene_sampling_strategy: Gene sampling strategy (‘top-nonzero’, etc.) (if None, uses config default) use_learnable_embs: Whether to enable specie-specific learnable embeddings during prediction for maximum single-species performance.

Set to ‘False’ for better cross-species alignment. (default: ‘True’)

return_type: Output type for embeddings: "numpy" (default) or "torch".

Returns:

dict: Dictionary containing ‘cls_cell_emb’, and optionally ‘context_sizes’

scConcept.get_gene_name_to_id_mapping(species)#

Return a {gene_name: gene_id} mapping for one species.

Return type:: dict[str, str]

static scConcept.load_config(config)#: Load configuration from file or dict.

scConcept.load_config_and_model(model_name=None, config=None, model_path=None, decoder_model_name='mlp-nb.ckpt', decoder_model_path=None, gene_mappings_path=None, panels_dir=None, pretrained_vocabulary_path=None)#

Load configuration and initialize the model.

Args:

model_name: Model name to download from HuggingFace (e.g., ‘corpus40M-model30M’). List of models: https://huggingface.co/theislab/scConcept/tree/main - required if directpaths are not provided config: Configuration - can be a path to config file (.yaml) as str, Path, a dictionary, or DictConfig.

When used with model_name, it is merged on top of the downloaded config.

model_path: Path to model checkpoint file (.ckpt) - if provided, bypasses HuggingFace download decoder_model_name: Name of the decoder model checkpoint to download (default: ‘mlp-nb.ckpt’) decoder_model_path: Optional path to a decoder checkpoint file (.ckpt). If provided, the decoder is

loaded and can be used through decode().

gene_mappings_path: Path to gene mappings. For multi-species models, a directory containing: {species}.csv files (one per species). For single-species models, a .pkl or .csv file. If provided, bypasses HuggingFace download

panels_dir: Path to panels directory - if provided, bypasses HuggingFace download pretrained_vocabulary_path: Path to pretrained vocabulary directory (containing .csv files) - if provided, overrides config PATH.PRETRAINED_VOCABULARY

scConcept.map_gene_ids_to_names(species, gene_ids)#

Map gene IDs to gene names case-insensitively, using nan for unavailable gene IDs.

Return type:: list[object]

scConcept.map_gene_names_to_ids(species, gene_names)#

Map gene names to gene IDs case-insensitively, using nan for unavailable gene names.

Return type:: list[object]

scConcept.plot_training_curves(metrics_path=None, save_path=None, show=True, n_avg=1)#

Plot train/loss and train/recall@1 from the current CSVLogger metrics file.

Args:

metrics_path: Optional path to a CSVLogger metrics.csv file, or to the log directory: containing it. If omitted, uses the metrics file from the most recent train call.

save_path: Optional path where the generated figure should be saved. show: Whether to display the figure with matplotlib.pyplot.show. Set to False for

scripts or tests that only need the returned figure.

n_avg: Number of consecutive logged entries to average before plotting. Defaults to 1,: which plots each logged value without smoothing.

Returns:

A tuple of (fig, axes) from matplotlib.

scConcept.train(adata_list, species=None, max_steps=None, batch_size=None, num_workers=8, save_dir='./training_logs/', panels_dir=None)#

Train a new model using the configuration in self.cfg.

Uses self.model if it exists, otherwise initializes a new model. Assumes single GPU device with num_nodes=1.

Args:

adata_list: A single AnnData object or file path string, or a list of these. species: Species identifier(s). A single string when adata_list is a single item, or

a list of strings with the same length as adata_list when it is a list. If None (or a list containing None entries), species will be inferred from gene ID overlap with the tokenizer vocabularies — inference is only possible for AnnData items, not file path strings.

max_steps: Optional maximum number of training steps. If provided, overrides config value. batch_size: Optional batch size for training. If provided, overrides config value. num_workers: Number of data loading workers for training. Clamped to available SLURM CPUs

by the datamodule, matching extract_embeddings behavior. Defaults to 8.

save_dir: Directory where CSVLogger writes training logs. Defaults to ./training_logs/. panels_dir: Optional panels directory to use for this training run. If provided, overrides

the loaded model panels directory and cfg.PATH.PANELS_PATH.

Examples:

# Single AnnData — species inferred automatically
model.train(adata)

# Single AnnData — species provided explicitly
model.train(adata, species="hsapiens")

# List of AnnData — all species inferred
model.train([adata1, adata2])

# List of AnnData — all species provided explicitly
model.train([adata1, adata2], species=["hsapiens", "mmusculus"])

# List of AnnData — mixed: first inferred, second explicit
model.train([adata1, adata2], species=[None, "mmusculus"])

# File path strings — species must always be provided explicitly
model.train("path/to/data.h5ad", species="hsapiens")
model.train(["path/to/a.h5ad", "path/to/b.h5ad"], species=["hsapiens", "mmusculus"])

static scConcept.validate_config(cfg)#: Validate configuration constraints.

concept.scConcept

Contents

concept.scConcept#

Attributes table#

Methods table#

Attributes#

Methods#