Zoom-Heatmap Plot Module

This module provides functions to create a zoomed heatmap visualization of SNP associations, gene annotations, and linkage disequilibrium (LD) patterns.

It includes filtering and annotating SNP data, calculating LD matrices, and generating a three-panel plot with: 1. Association plot with SNPs colored by functional consequences 2. Gene track showing gene locations and orientations 3. LD heatmap showing correlation patterns between SNPs

ideal_genom.visualization.zoom_heatmap.draw_zoomed_heatmap(data_df: DataFrame, lead_snp: str, snp_col: str, p_col: str, pos_col: str, chr_col: str, output_folder: str, bfile_folder: str, bfile_name: str, pval_threshold: float = 5e-06, radius: int | float = 1000000.0, build: str = '38', gtf_path: str | None = None, anno_source: str = 'ensembl', batch_size: int = 100, effect_dict: dict = {}, extension: str = 'pdf', request_persec: int = 15) → bool

Creates a zoomed heatmap visualization around a lead SNP showing LD patterns and gene annotations.

This function generates a three-panel plot: 1. Association plot with SNPs colored by functional consequences 2. Gene track showing gene locations and orientations 3. LD heatmap showing correlation patterns between SNPs

Parameters:

data_df (pandas.DataFrame) – Input DataFrame containing GWAS summary statistics
lead_snp (str) – Identifier of the lead SNP to center the plot around
snp_col (str) – Column name containing SNP identifiers
p_col (str) – Column name containing p-values
pos_col (str) – Column name containing genomic positions
chr_col (str) – Column name containing chromosome numbers
output_folder (str) – Path to save output files
bfile_folder (str) – Folder containing PLINK binary files
bfile_name (str) – Base name of PLINK binary files (without extensions)
pval_threshold (float, optional) – P-value threshold for significance, default 5e-6
radius (Union[int, float], optional) – Distance in base pairs to plot around lead SNP, default 1e6
build (str, optional) – Genome build version, default ‘38’
gtf_path (str, optional) – Path to custom GTF file, default None
anno_source (str, optional) – Source for gene annotations (‘ensembl’ or ‘refseq’), default ‘ensembl’
batch_size (int, optional) – Batch size for API requests, default 100
effect_dict (dict, optional) – Dictionary mapping functional effects to display names, default empty dict
extension (str, optional) – File extension for output plot, default ‘pdf’
request_persec (int, optional) – Number of API requests per second allowed, default 15

Returns:

True if plot was generated successfully

Return type:

bool

Raises:

TypeError – If input parameters are not of the expected types
FileNotFoundError – If specified folders or PLINK binary files do not exist
ValueError – If the required columns are not found in the DataFrame

Notes

Required input DataFrame must contain columns for SNP IDs, p-values, positions and chromosomes. PLINK binary files (.bed, .bim, .fam) must exist in specified folder. Generates and saves a zoomed heatmap plot in the specified output folder.

ideal_genom.visualization.zoom_heatmap.filter_sumstats(data_df: DataFrame, lead_snp: str, snp_col: str, p_col: str, pos_col: str, chr_col: str, pval_threshold: float = 5e-08, radius: float | int = 10000000.0) → DataFrame

Filter GWAS summary statistics based on a lead SNP, p-value threshold and genomic region.

This function filters a DataFrame containing GWAS summary statistics to return variants within a specified genomic region around a lead SNP that meet a p-value significance threshold.

Parameters:

data_df (pandas.DataFrame) – DataFrame containing GWAS summary statistics
lead_snp (str) – Identifier of the lead SNP to center the region on
snp_col (str) – Name of column containing SNP identifiers
p_col (str) – Name of column containing p-values
pos_col (str) – Name of column containing genomic positions
chr_col (str) – Name of column containing chromosome numbers/identifiers
pval_threshold (float, optional) – P-value significance threshold for filtering variants (default: 5e-8)
radius (float, optional) – Size of region to include around lead SNP in base pairs (default: 10Mb)

Returns:

Filtered DataFrame containing only variants that:

Are on the same chromosome as lead SNP

Meet p-value threshold

Fall within specified region around lead SNP

Also includes calculated -log10(p-value) column

Return type:

pandas.DataFrame

Raises:

TypeError – If input parameters are not of the expected types
ValueError – If specified columns are not found in the DataFrame If lead SNP is not found in the DataFrame

Notes

The function adds a ‘log10p’ column containing -log10 transformed p-values to the filtered DataFrame before returning it.

ideal_genom.visualization.zoom_heatmap.get_gene_information(genes: list, gtf_path: str | None = None, build: str = '38', anno_source: str = 'ensembl') → DataFrame

Retrieves genomic information for a list of genes using Ensembl annotation.

This function fetches start position, end position, strand, and length information for each gene in the provided list using either Ensembl GRCh37 or GRCh38 annotations.

Parameters:

genes (list) – List of gene IDs (Ensembl format)
gtf_path (str, optional) – Path to a custom GTF file. If None, will download and use Ensembl GTF.
build (str, default "38") – Human genome build version. Supported values: “19”, “37”, “38” Note: “19” and “37” are equivalent.
anno_source (str, default "ensembl") – Source of genome annotations. Currently only supports “ensembl”

Returns:

DataFrame containing gene information with columns:

gene: gene ID
start: gene start position
end: gene end position
strand: gene strand
length: gene length

Return type:

pandas.DataFrame

Raises:

ValueError – If unsupported build version or annotation source is provided
FileNotFoundError – If provided GTF file path does not exist
TypeError – If provided GTF path is not a string

Notes

When gtf_path is None, the function will automatically download and process the appropriate Ensembl GTF file based on the specified build version. The function uses the Ensembl Python API to fetch gene information.

ideal_genom.visualization.zoom_heatmap.get_ld_matrix(data_df: DataFrame, snp_col: str, pos_col: str, bfile_folder: str, bfile_name: str, output_path: str) → dict

Calculate LD matrix using PLINK for a set of SNPs.

This function takes a DataFrame containing SNP information and calculates the LD (Linkage Disequilibrium) matrix using PLINK. The SNPs are first sorted by position, and then PLINK is used to compute pairwise r2 values between SNPs.

Parameters:

data_df (pandas.DataFrame) – DataFrame containing SNP information
snp_col (str) – Name of the column containing SNP IDs
pos_col (str) – Name of the column containing SNP positions
bfile_folder (str) – Path to the folder containing PLINK binary files
bfile_name (str) – Base name of the PLINK binary files (without extensions)
output_path (str) – Path where output files will be saved

Returns:

Dictionary containing:

’pass’bool
True if process completed successfully
’step’str
Name of the processing step (‘get_ld_matrix’)
’output’dict
Dictionary with output file paths

Return type:

dict

Raises:

FileNotFoundError – If any required files or directories are not found
TypeError – If input parameters are not of the expected types
ValueError – If specified columns are not found in the DataFrame

ideal_genom.visualization.zoom_heatmap.get_zoomed_data(data_df: DataFrame, lead_snp: str, snp_col: str, p_col: str, pos_col: str, chr_col: str, output_folder: str, pval_threshold: float = 5e-06, radius: float | int = 1000000.0, build: str = '38', anno_source: str = 'ensembl', gtf_path: str | None = None, batch_size: int = 100, request_persec: int = 15) → DataFrame

Filter and annotate SNP data around a lead SNP within a specified radius.

This function filters significant SNPs in a region around a lead SNP and annotates them with gene names and functional consequences. The position values are scaled to Megabase pairs (Mbp).

Parameters:

data_df (pandas.DataFrame) – Input DataFrame containing SNP data
lead_snp (str) – Identifier of the lead SNP to center the region around
snp_col (str) – Name of the column containing SNP identifiers
p_col (str) – Name of the column containing p-values
pos_col (str) – Name of the column containing position information
chr_col (str) – Name of the column containing chromosome information
output_folder (str) – Path to the output folder (must exist)
pval_threshold (float, optional) – P-value threshold for significance filtering (default: 5e-6)
radius (Union[float, int], optional) – Radius around the lead SNP in base pairs (default: 1e6)
build (str, optional) – Genome build version (‘38’ or ‘37’) (default: ‘38’)
anno_source (str, optional) – Source for annotations (‘ensembl’ or other supported sources) (default: ‘ensembl’)
gtf_path (str, optional) – Path to GTF file for annotations (default: None)
batch_size (int, optional) – Number of SNPs to process in each batch (default: 100)
request_persec (int, optional) – Number of API requests per second allowed (default: 15)

Returns:

Filtered and annotated DataFrame with added Mbp column and removed duplicates

Return type:

pandas.DataFrame

Raises:

TypeError – If input parameters are not of the expected types
FileNotFoundError – If output_folder does not exist
ValueError – If no significant SNPs are found in the specified region

Notes

The function removes duplicate SNPs, keeping the first occurrence only. Position values are converted to Megabase pairs in the output DataFrame.

ideal_genom.visualization.zoom_heatmap.snp_annotations(data_df: DataFrame, snp_col: str, pos_col: str, chr_col: str, build: str = '38', anno_source: str = 'ensembl', gtf_path: str | None = None, batch_size: int = 100, request_persec: int = 15) → DataFrame

Annotate SNPs with gene names and functional consequences using Ensembl databases.

This function takes a DataFrame containing SNP information and adds gene name annotations and functional consequence annotations using Ensembl VEP (Variant Effect Predictor) API.

Parameters:

data_df (pandas.DataFrame) – Input DataFrame containing SNP information
snp_col (str) – Name of column containing SNP IDs
pos_col (str) – Name of column containing genomic positions
chr_col (str) – Name of column containing chromosome numbers
build (str, optional) – Genome build version (‘38’, ‘37’, or ‘19’), by default ‘38’
anno_source (str, optional) – Source for annotations (‘ensembl’), by default ‘ensembl’
gtf_path (str, optional) – Path to GTF file for annotations, by default None
batch_size (int, optional) – Number of SNPs to process in each API request batch, by default 100
request_persec (int, optional) – Maximum number of API requests per second, by default 15

Returns:

Input DataFrame augmented with gene name and functional consequence annotations. Added columns: - GENENAME: Gene name from Ensembl - Functional_Consequence: Most severe consequence from VEP

Return type:

pd.DataFrame

Raises:

ValueError – If specified genome build version is not supported If annotation source is not supported If the specified columns are not found in the DataFrame
TypeError – If input parameters are not of the expected types

Notes

Supports genome builds 19/37 and 38 using different Ensembl REST API endpoints. Implements rate limiting and batch processing for API requests.