concept.scConcept#
- class concept.scConcept(cfg=None, repo_id='theislab/scConcept', cache_dir='./cache/')#
High-level interface for loading, adapting, and applying scConcept models.
This wrapper handles model/config loading from Hugging Face or local paths, gene-tokenizer setup across species, embedding extraction from
AnnData, and optional lightweight fine-tuning on user-provided datasets.
Attributes table#
Return species names supported by the loaded model. |
Methods table#
Apply compatibility changes for older checkpoints. |
|
|
Estimate mutual information between two aligned sets of cell embeddings. |
|
Extract embeddings from AnnData using the loaded model. |
|
Return a |
|
Load configuration from file or dict. |
|
Load configuration and initialize the model. |
|
Map gene names to gene IDs case-insensitively, using |
|
Train a new model using the configuration in self.cfg. |
|
Validate configuration constraints. |
Attributes#
- scConcept.species#
Return species names supported by the loaded model.
Methods#
- static scConcept.apply_compatibility_changes(cfg)#
Apply compatibility changes for older checkpoints. Returns updated cfg.
- scConcept.estimate_mutual_information(cell_embs_1, cell_embs_2)#
Estimate mutual information between two aligned sets of cell embeddings.
The estimate follows the contrastive objective used by scConcept:
log(n_cells) - cross_entropy(normalized_embs_1 @ normalized_embs_2.T * logit_scale, arange(n_cells)).- Return type:
- Args:
cell_embs_1: Tensor of shape
(n_cells, embedding_dim)for the first panel/view. cell_embs_2: Tensor of shape(n_cells, embedding_dim)for the second panel/view.- Returns:
Estimated mutual information as
log(n_cells) - contrastive_loss.
- scConcept.extract_embeddings(adata, species=None, gene_id_column=None, batch_size=32, max_tokens=None, gene_sampling_strategy=None, use_learnable_embs=True, return_type='numpy', num_workers=8)#
Extract embeddings from AnnData using the loaded model.
- Args:
adata: AnnData object containing single-cell data species: Species identifier (e.g. ‘hsapiens’). If not provided, will be inferred from gene IDs using the tokenizer’s gene mappings if possible. gene_id_column: Column name in adata.var to use as gene IDs: ENSGXXXXXXXXXXX (default: None, uses index) batch_size: Batch size for dataloader (default: 32) max_tokens: Maximum number of tokens per cell (if None, uses config default) gene_sampling_strategy: Gene sampling strategy (‘top-nonzero’, etc.) (if None, uses config default) use_learnable_embs: Whether to enable specie-specific learnable embeddings during prediction for maximum single-species performance.
Set to ‘False’ for better cross-species alignment. (default: ‘True’)
return_type: Output type for embeddings:
"numpy"(default) or"torch".- Returns:
dict: Dictionary containing ‘cls_cell_emb’, and optionally ‘context_sizes’
- scConcept.get_gene_name_to_id_mapping(species)#
Return a
{gene_name: gene_id}mapping for one species.
- static scConcept.load_config(config)#
Load configuration from file or dict.
- scConcept.load_config_and_model(model_name=None, config=None, model_path=None, gene_mappings_path=None, panels_dir=None, pretrained_vocabulary_path=None)#
Load configuration and initialize the model.
- Args:
model_name: Model name to download from HuggingFace (e.g., ‘corpus40M-model30M’). List of models: https://huggingface.co/theislab/scConcept/tree/main - required if directpaths are not provided config: Configuration - can be a path to config file (.yaml) as str, Path, a dictionary, or DictConfig.
When used with
model_name, it is merged on top of the downloaded config.model_path: Path to model checkpoint file (.ckpt) - if provided, bypasses HuggingFace download gene_mappings_path: Path to gene mappings. For multi-species models, a directory containing
{species}.csvfiles (one per species). For single-species models, a.pklor.csvfile. If provided, bypasses HuggingFace downloadpanels_dir: Path to panels directory - if provided, bypasses HuggingFace download pretrained_vocabulary_path: Path to pretrained vocabulary directory (containing .csv files) - if provided, overrides config PATH.PRETRAINED_VOCABULARY
- scConcept.map_gene_names_to_ids(species, gene_names)#
Map gene names to gene IDs case-insensitively, using
nanfor unavailable gene names.
- scConcept.train(adata_list, species=None, max_steps=None, batch_size=None)#
Train a new model using the configuration in self.cfg.
Uses self.model if it exists, otherwise initializes a new model. Assumes single GPU device with num_nodes=1.
- Args:
adata_list: A single AnnData object or file path string, or a list of these. species: Species identifier(s). A single string when adata_list is a single item, or
a list of strings with the same length as adata_list when it is a list. If None (or a list containing None entries), species will be inferred from gene ID overlap with the tokenizer vocabularies — inference is only possible for AnnData items, not file path strings.
max_steps: Optional maximum number of training steps. If provided, overrides config value. batch_size: Optional batch size for training. If provided, overrides config value.
Examples:
# Single AnnData — species inferred automatically model.train(adata) # Single AnnData — species provided explicitly model.train(adata, species="hsapiens") # List of AnnData — all species inferred model.train([adata1, adata2]) # List of AnnData — all species provided explicitly model.train([adata1, adata2], species=["hsapiens", "mmusculus"]) # List of AnnData — mixed: first inferred, second explicit model.train([adata1, adata2], species=[None, "mmusculus"]) # File path strings — species must always be provided explicitly model.train("path/to/data.h5ad", species="hsapiens") model.train(["path/to/a.h5ad", "path/to/b.h5ad"], species=["hsapiens", "mmusculus"])
- static scConcept.validate_config(cfg)#
Validate configuration constraints.