concept.scConcept#

class concept.scConcept(cfg=None, repo_id='theislab/scConcept', cache_dir='./cache/')#

High-level interface for loading, adapting, and applying scConcept models.

This wrapper handles model/config loading from Hugging Face or local paths, gene-tokenizer setup across species, embedding extraction from AnnData, and optional lightweight fine-tuning on user-provided datasets.

Attributes table#

species

Return species names supported by the loaded model.

Methods table#

apply_compatibility_changes(cfg)

Apply compatibility changes for older checkpoints.

estimate_mutual_information(cell_embs_1, ...)

Estimate mutual information between two aligned sets of cell embeddings.

extract_embeddings(adata[, species, ...])

Extract embeddings from AnnData using the loaded model.

get_gene_name_to_id_mapping(species)

Return a {gene_name: gene_id} mapping for one species.

load_config(config)

Load configuration from file or dict.

load_config_and_model([model_name, config, ...])

Load configuration and initialize the model.

map_gene_names_to_ids(species, gene_names)

Map gene names to gene IDs case-insensitively, using nan for unavailable gene names.

train(adata_list[, species, max_steps, ...])

Train a new model using the configuration in self.cfg.

validate_config(cfg)

Validate configuration constraints.

Attributes#

scConcept.species#

Return species names supported by the loaded model.

Methods#

static scConcept.apply_compatibility_changes(cfg)#

Apply compatibility changes for older checkpoints. Returns updated cfg.

scConcept.estimate_mutual_information(cell_embs_1, cell_embs_2)#

Estimate mutual information between two aligned sets of cell embeddings.

The estimate follows the contrastive objective used by scConcept: log(n_cells) - cross_entropy(normalized_embs_1 @ normalized_embs_2.T * logit_scale, arange(n_cells)).

Return type:

float

Args:

cell_embs_1: Tensor of shape (n_cells, embedding_dim) for the first panel/view. cell_embs_2: Tensor of shape (n_cells, embedding_dim) for the second panel/view.

Returns:

Estimated mutual information as log(n_cells) - contrastive_loss.

scConcept.extract_embeddings(adata, species=None, gene_id_column=None, batch_size=32, max_tokens=None, gene_sampling_strategy=None, use_learnable_embs=True, return_type='numpy', num_workers=8)#

Extract embeddings from AnnData using the loaded model.

Args:

adata: AnnData object containing single-cell data species: Species identifier (e.g. ‘hsapiens’). If not provided, will be inferred from gene IDs using the tokenizer’s gene mappings if possible. gene_id_column: Column name in adata.var to use as gene IDs: ENSGXXXXXXXXXXX (default: None, uses index) batch_size: Batch size for dataloader (default: 32) max_tokens: Maximum number of tokens per cell (if None, uses config default) gene_sampling_strategy: Gene sampling strategy (‘top-nonzero’, etc.) (if None, uses config default) use_learnable_embs: Whether to enable specie-specific learnable embeddings during prediction for maximum single-species performance.

Set to ‘False’ for better cross-species alignment. (default: ‘True’)

return_type: Output type for embeddings: "numpy" (default) or "torch".

Returns:

dict: Dictionary containing ‘cls_cell_emb’, and optionally ‘context_sizes’

scConcept.get_gene_name_to_id_mapping(species)#

Return a {gene_name: gene_id} mapping for one species.

Return type:

dict[str, str]

static scConcept.load_config(config)#

Load configuration from file or dict.

scConcept.load_config_and_model(model_name=None, config=None, model_path=None, gene_mappings_path=None, panels_dir=None, pretrained_vocabulary_path=None)#

Load configuration and initialize the model.

Args:

model_name: Model name to download from HuggingFace (e.g., ‘corpus40M-model30M’). List of models: https://huggingface.co/theislab/scConcept/tree/main - required if directpaths are not provided config: Configuration - can be a path to config file (.yaml) as str, Path, a dictionary, or DictConfig.

When used with model_name, it is merged on top of the downloaded config.

model_path: Path to model checkpoint file (.ckpt) - if provided, bypasses HuggingFace download gene_mappings_path: Path to gene mappings. For multi-species models, a directory containing

{species}.csv files (one per species). For single-species models, a .pkl or .csv file. If provided, bypasses HuggingFace download

panels_dir: Path to panels directory - if provided, bypasses HuggingFace download pretrained_vocabulary_path: Path to pretrained vocabulary directory (containing .csv files) - if provided, overrides config PATH.PRETRAINED_VOCABULARY

scConcept.map_gene_names_to_ids(species, gene_names)#

Map gene names to gene IDs case-insensitively, using nan for unavailable gene names.

Return type:

list[object]

scConcept.train(adata_list, species=None, max_steps=None, batch_size=None)#

Train a new model using the configuration in self.cfg.

Uses self.model if it exists, otherwise initializes a new model. Assumes single GPU device with num_nodes=1.

Args:

adata_list: A single AnnData object or file path string, or a list of these. species: Species identifier(s). A single string when adata_list is a single item, or

a list of strings with the same length as adata_list when it is a list. If None (or a list containing None entries), species will be inferred from gene ID overlap with the tokenizer vocabularies — inference is only possible for AnnData items, not file path strings.

max_steps: Optional maximum number of training steps. If provided, overrides config value. batch_size: Optional batch size for training. If provided, overrides config value.

Examples:

# Single AnnData — species inferred automatically
model.train(adata)

# Single AnnData — species provided explicitly
model.train(adata, species="hsapiens")

# List of AnnData — all species inferred
model.train([adata1, adata2])

# List of AnnData — all species provided explicitly
model.train([adata1, adata2], species=["hsapiens", "mmusculus"])

# List of AnnData — mixed: first inferred, second explicit
model.train([adata1, adata2], species=[None, "mmusculus"])

# File path strings — species must always be provided explicitly
model.train("path/to/data.h5ad", species="hsapiens")
model.train(["path/to/a.h5ad", "path/to/b.h5ad"], species=["hsapiens", "mmusculus"])
static scConcept.validate_config(cfg)#

Validate configuration constraints.