Annotations Module
This module provides functions to annotate genomic variants with gene information and effects.
Includes functions to: - Find the closest gene to a given SNP position. - Map chromosome numbers to identifiers. - Convert GTF files to a format containing all genes. - Annotate SNPs with gene names using Ensembl or RefSeq databases. - Prepare genome data from GTF files. - Annotate variants with their effects relying on Ensembl VEP.
- ideal_genom.annotations.annotate_snp(insumstats: DataFrame, gtf_path: str, chrom: str = 'CHR', pos: str = 'POS', build: str = '38', source: str = 'ensembl') DataFrame
Annotate SNPs with nearest gene name(s) using either Ensembl or RefSeq databases.
This function takes a DataFrame containing SNP data and annotates each variant with information about the nearest gene(s) based on genomic coordinates.
- Parameters:
insumstats (pandas.DataFrame) – DataFrame containing SNP data with chromosome and position information.
gtf_path (str) – Path to the GTF (Gene Transfer Format) file for gene annotations.
chrom (str, optional) – Column name in the DataFrame that contains chromosome information. Defaults to “CHR”.
pos (str, optional) – Column name in the DataFrame that contains position information. Defaults to “POS”.
build (str, optional) – Genome build version. Must be one of “19”, “37”, or “38”. Defaults to “38”.
source (str, optional) – Source for gene annotation. Must be either “ensembl” or “refseq”. Defaults to “ensembl”.
- Returns:
A copy of the input DataFrame with additional gene annotation columns.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If input is not a pandas DataFrame or if GTF path is not a string.
ValueError – If required columns are missing in the input DataFrame or if build/source parameters are invalid.
- ideal_genom.annotations.annotate_variants(output: DataFrame, data: Genome, chrom: str, pos: str, source: str, build: str = '38') DataFrame
Annotate variants with their closest genes.
This function processes a DataFrame containing genomic variants and enriches it with gene annotation information by finding the closest gene for each variant.
- Parameters:
output (pandas.DataFrame) – DataFrame containing variant information to be annotated.
data (Genome) – Genome object containing reference data for annotation.
chrom (str) – Column name in the output DataFrame that contains chromosome information.
pos (str) – Column name in the output DataFrame that contains position information.
source (str) – Source of the gene annotation data.
build (str, default='38') – Genome build version (default is GRCh38).
- Returns:
DataFrame containing gene annotation information for each variant.
- Return type:
pandas.DataFrame
Notes
This function applies the get_closest_gene function to each row in the input DataFrame and returns the results as a DataFrame with the same index as the input.
- ideal_genom.annotations.annotate_with_ensembl(output: DataFrame, chrom: str, pos: str, build: str, gtf_path: str, is_gtf_path: bool) DataFrame
Annotate variants with gene information from Ensembl database.
This function adds gene annotations to a DataFrame containing variant information by looking up the genomic coordinates in Ensembl data. It adds ‘LOCATION’ and ‘GENE’ columns to the input DataFrame.
- Parameters:
output (pandas.DataFrame) – DataFrame containing variant information with chromosome and position columns.
chrom (str) – Name of the column in the DataFrame that contains chromosome information.
pos (str) – Name of the column in the DataFrame that contains position information.
build (str) – Genome build version to use. Must be one of ‘19’, ‘37’, or ‘38’. Note that ‘19’ and ‘37’ are treated as equivalent (GRCh37).
gtf_path (str) – Path to GTF file with gene annotations or None to use default paths. If None, the appropriate GTF file will be downloaded or used from cache.
is_gtf_path (bool) – If True, gtf_path is treated as a direct path to a GTF file. If False, gtf_path is treated as a directory where the GTF file should be downloaded.
- Returns:
The input DataFrame with additional ‘LOCATION’ and ‘GENE’ columns containing gene annotations from Ensembl.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If output is not a pandas DataFrame or if gtf_path is not a string (when provided).
ValueError – If the required columns are not in the DataFrame or if the build is invalid.
Notes
The function supports both GRCh37 (build ‘19’ or ‘37’) and GRCh38 (build ‘38’) and will download the appropriate annotation files if not already available.
- ideal_genom.annotations.annotate_with_refseq(output: DataFrame, chrom: str, pos: str, build: str, gtf_path: str, is_gtf_path: bool) DataFrame
Annotate genomic variants with RefSeq gene information.
This function adds gene and location annotations to genomic variants using NCBI RefSeq data. It processes the input DataFrame and adds two new columns: ‘LOCATION’ and ‘GENE’.
- Parameters:
output (pandas.DataFrame) – DataFrame containing variant information to annotate.
chrom (str) – Column name in DataFrame that contains chromosome information.
pos (str) – Column name in DataFrame that contains position information.
build (str) – Genome build version. Must be one of ‘19’, ‘37’, or ‘38’.
gtf_path (str) – Path to the GTF file. If None, a default path will be used.
is_gtf_path (bool) – If True, gtf_path is treated as a direct file path. If False, gtf_path is treated as a directory.
- Returns:
The input DataFrame with added ‘LOCATION’ and ‘GENE’ columns.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If output is not a pandas DataFrame or if gtf_path is provided but not a string.
ValueError – If required columns are missing from output or if build is invalid.
Notes
For builds ‘19’ and ‘37’, GRCh37 RefSeq annotations are used.
For build ‘38’, GRCh38 RefSeq annotations are used.
Only protein-coding genes are considered for annotation.
- ideal_genom.annotations.get_chr_to_NC(build: str = '38', inverse: bool = False) dict
Returns a dictionary mapping between chromosome names and NCBI NC identifiers.
This function provides a mapping between chromosome names (like “1”, “X”, “MT”) and their corresponding NCBI RefSeq accession numbers (like “NC_000001.10”) for different human genome builds.
- Parameters:
build (str, optional) – The genome build version. Accepted values are “19”, “37”, or “38”. Note that builds “19” and “37” return the same mapping.
inverse (bool, optional) – If True, returns an inverted dictionary where NC identifiers are keys and chromosome names are values. Defaults to False.
- Returns:
A dictionary mapping chromosome names to NC identifiers (if inverse=False) or NC identifiers to chromosome names (if inverse=True).
- Return type:
dict
- Raises:
TypeError – If build is not a string or inverse is not a boolean.
ValueError – If build is not one of “19”, “37”, or “38”.
References
- ideal_genom.annotations.get_closest_gene(x, data: Genome, chrom: str = 'CHR', pos: str = 'POS', max_iter: int = 20000, step: int = 50, source: str = 'ensembl', build: str = '38') tuple
Find the closest gene to a given position in the genome.
This function searches for the closest gene to a specified SNP position in the genome. It checks the position in the specified chromosome and returns the distance to the closest gene along with the gene name(s). If no gene is found within the specified distance, it returns “intergenic”.
- Parameters:
x – SNP information.
data (pyensembl.Genome) – An instance of the Genome class containing gene annotations.
chrom (str, optional) – The key in the dictionary x that corresponds to the chromosome. Default is “CHR”.
pos (str, optional) – The key in the dictionary x that corresponds to the position. Default is “POS”.
max_iter (int, optional) – The maximum number of iterations to search for a gene. Default is 20000.
step (int, optional) – The step size for each iteration when searching for a gene. Default is 50.
source (str, optional) – The source of the gene annotations, either “ensembl” or “refseq”. Default is “ensembl”.
build (str, optional) – The genome build version, used when source is “refseq”. Default is “38”.
- Returns:
A tuple containing the distance to the closest gene and the gene name(s). If no gene is found, returns the distance and “intergenic”.
- Return type:
tuple
- Raises:
TypeError – If data is not an instance of Genome, or if chrom or pos are not strings, or if max_iter or step are not integers.
ValueError – If source is not “ensembl” or “refseq”, or if build is not “37” or “38”.
- ideal_genom.annotations.get_number_to_chr(in_chr: bool = False, xymt: list = ['X', 'Y', 'MT'], xymt_num: list = [23, 24, 25], prefix: str = '') dict
Creates a dictionary mapping chromosome numbers to chromosome identifiers.
This function generates a mapping between chromosome numbers (as keys) and chromosome identifiers (as values), with special handling for sex chromosomes and mitochondrial chromosome.
- Parameters:
in_chr (bool, default=False) – If True, dictionary keys will be strings; if False, keys will be integers.
xymt (list, default=["X","Y","MT"]) – List of string identifiers for the X, Y, and mitochondrial chromosomes.
xymt_num (list, default=[23,24,25]) – List of numeric identifiers corresponding to X, Y, and MT chromosomes.
prefix (str, default="") – String prefix to add to all chromosome identifiers.
- Returns:
A dictionary mapping chromosome numbers to chromosome identifiers. For autosomal chromosomes (1-199), maps to prefix+number. For sex and mitochondrial chromosomes, maps to prefix+X/Y/MT.
- Return type:
dict
- Raises:
TypeError – If in_chr is not a boolean, xymt or xymt_num are not lists, or prefix is not a string.
Examples
>>> get_number_to_chr() {1: '1', 2: '2', ..., 23: 'X', 24: 'Y', 25: 'MT', ...}
>>> get_number_to_chr(in_chr=True, prefix="chr") {'1': 'chr1', '2': 'chr2', ..., '23': 'chrX', '24': 'chrY', '25': 'chrMT', ...}
- ideal_genom.annotations.gtf_to_all_genes(gtfpath: str) str
Extract all gene records from a GTF file and save them to a new file.
This function reads a GTF file, extracts all gene records, and saves them to a new file with the suffix ‘_all_genes.gtf.gz’. If the output file already exists, it will be returned without regenerating it.
- Parameters:
gtfpath (str) – Path to the input GTF file.
- Returns:
Path to the output file containing all gene records.
- Return type:
str
- Raises:
TypeError – If gtfpath is not a string.
Notes
The function uses the read_gtf function for initial parsing and pandas for extraction. The function assumes the GTF file has a standard format with gene_id attributes.
- ideal_genom.annotations.prepare_genome(gtf_path: str, reference_name: str, annotation_name: str) Genome
Prepare a genome annotation by loading or creating a database from a GTF file.
This function creates a Genome object from a GTF file and ensures that the corresponding database is indexed for efficient access.
- Parameters:
gtf_path (str) – Path to the GTF (Gene Transfer Format) file
reference_name (str) – Name of the reference genome
annotation_name (str) – Name of the annotation
- Returns:
A Genome object initialized with the provided reference and annotation
- Return type:
pyensemble.Genome
Notes
If the database file (with extension .db) doesn’t exist, this function will create it by calling the index() method on the Genome object.
- ideal_genom.annotations.prepare_gtf_path(gtf_path: str, is_gtf_path: bool, source: str, build: str) str
Prepares the path to a GTF (Gene Transfer Format) file for annotation purposes.
This function either uses a user-provided GTF file or downloads one from the specified source (Ensembl or RefSeq) for the given genome build (GRCh37 or GRCh38). If a download is required, it fetches the latest release, unzips it, and processes it to extract all genes.
- Parameters:
gtf_path (str) – Path to an existing GTF file, or None if one should be downloaded.
is_gtf_path (bool) – Flag indicating whether the provided gtf_path is valid.
source (str) – Source database for GTF file (‘ensembl’ or ‘refseq’).
build (str) – Genome build version (‘37’ or ‘38’).
- Returns:
Path to the prepared GTF file with all genes.
- Return type:
str
Notes
If gtf_path is None or is_gtf_path is False, a new GTF file will be downloaded.
If a valid gtf_path is provided, it will be processed using gtf_to_all_genes.
The function logs the actions being performed.