Zoom-Heatmap Plot Module

This module provides functions to create a zoomed heatmap visualization of SNP associations, gene annotations, and linkage disequilibrium (LD) patterns.

It includes filtering and annotating SNP data, calculating LD matrices, and generating a three-panel plot with: 1. Association plot with SNPs colored by functional consequences 2. Gene track showing gene locations and orientations 3. LD heatmap showing correlation patterns between SNPs

ideal_genom.visualization.zoom_heatmap.draw_zoomed_heatmap(data_df: DataFrame, lead_snp: str, snp_col: str, p_col: str, pos_col: str, chr_col: str, output_folder: str, bfile_folder: str, bfile_name: str, pval_threshold: float = 5e-06, radius: int | float = 1000000.0, build: str = '38', gtf_path: str | None = None, anno_source: str = 'ensembl', batch_size: int = 100, effect_dict: dict = {}, extension: str = 'pdf', request_persec: int = 15) bool

Creates a zoomed heatmap visualization around a lead SNP showing LD patterns and gene annotations.

This function generates a three-panel plot: 1. Association plot with SNPs colored by functional consequences 2. Gene track showing gene locations and orientations 3. LD heatmap showing correlation patterns between SNPs

Parameters:
  • data_df (pandas.DataFrame) – Input DataFrame containing GWAS summary statistics

  • lead_snp (str) – Identifier of the lead SNP to center the plot around

  • snp_col (str) – Column name containing SNP identifiers

  • p_col (str) – Column name containing p-values

  • pos_col (str) – Column name containing genomic positions

  • chr_col (str) – Column name containing chromosome numbers

  • output_folder (str) – Path to save output files

  • bfile_folder (str) – Folder containing PLINK binary files

  • bfile_name (str) – Base name of PLINK binary files (without extensions)

  • pval_threshold (float, optional) – P-value threshold for significance, default 5e-6

  • radius (Union[int, float], optional) – Distance in base pairs to plot around lead SNP, default 1e6

  • build (str, optional) – Genome build version, default ‘38’

  • gtf_path (str, optional) – Path to custom GTF file, default None

  • anno_source (str, optional) – Source for gene annotations (‘ensembl’ or ‘refseq’), default ‘ensembl’

  • batch_size (int, optional) – Batch size for API requests, default 100

  • effect_dict (dict, optional) – Dictionary mapping functional effects to display names, default empty dict

  • extension (str, optional) – File extension for output plot, default ‘pdf’

  • request_persec (int, optional) – Number of API requests per second allowed, default 15

Returns:

True if plot was generated successfully

Return type:

bool

Raises:
  • TypeError – If input parameters are not of the expected types

  • FileNotFoundError – If specified folders or PLINK binary files do not exist

  • ValueError – If the required columns are not found in the DataFrame

Notes

Required input DataFrame must contain columns for SNP IDs, p-values, positions and chromosomes. PLINK binary files (.bed, .bim, .fam) must exist in specified folder. Generates and saves a zoomed heatmap plot in the specified output folder.

ideal_genom.visualization.zoom_heatmap.filter_sumstats(data_df: DataFrame, lead_snp: str, snp_col: str, p_col: str, pos_col: str, chr_col: str, pval_threshold: float = 5e-08, radius: float | int = 10000000.0) DataFrame

Filter GWAS summary statistics based on a lead SNP, p-value threshold and genomic region.

This function filters a DataFrame containing GWAS summary statistics to return variants within a specified genomic region around a lead SNP that meet a p-value significance threshold.

Parameters:
  • data_df (pandas.DataFrame) – DataFrame containing GWAS summary statistics

  • lead_snp (str) – Identifier of the lead SNP to center the region on

  • snp_col (str) – Name of column containing SNP identifiers

  • p_col (str) – Name of column containing p-values

  • pos_col (str) – Name of column containing genomic positions

  • chr_col (str) – Name of column containing chromosome numbers/identifiers

  • pval_threshold (float, optional) – P-value significance threshold for filtering variants (default: 5e-8)

  • radius (float, optional) – Size of region to include around lead SNP in base pairs (default: 10Mb)

Returns:

Filtered DataFrame containing only variants that:

  • Are on the same chromosome as lead SNP

  • Meet p-value threshold

  • Fall within specified region around lead SNP

Also includes calculated -log10(p-value) column

Return type:

pandas.DataFrame

Raises:
  • TypeError – If input parameters are not of the expected types

  • ValueError – If specified columns are not found in the DataFrame If lead SNP is not found in the DataFrame

Notes

The function adds a ‘log10p’ column containing -log10 transformed p-values to the filtered DataFrame before returning it.

ideal_genom.visualization.zoom_heatmap.get_gene_information(genes: list, gtf_path: str | None = None, build: str = '38', anno_source: str = 'ensembl') DataFrame

Retrieves genomic information for a list of genes using Ensembl annotation.

This function fetches start position, end position, strand, and length information for each gene in the provided list using either Ensembl GRCh37 or GRCh38 annotations.

Parameters:
  • genes (list) – List of gene IDs (Ensembl format)

  • gtf_path (str, optional) – Path to a custom GTF file. If None, will download and use Ensembl GTF.

  • build (str, default "38") – Human genome build version. Supported values: “19”, “37”, “38” Note: “19” and “37” are equivalent.

  • anno_source (str, default "ensembl") – Source of genome annotations. Currently only supports “ensembl”

Returns:

DataFrame containing gene information with columns:
  • gene: gene ID

  • start: gene start position

  • end: gene end position

  • strand: gene strand

  • length: gene length

Return type:

pandas.DataFrame

Raises:
  • ValueError – If unsupported build version or annotation source is provided

  • FileNotFoundError – If provided GTF file path does not exist

  • TypeError – If provided GTF path is not a string

Notes

When gtf_path is None, the function will automatically download and process the appropriate Ensembl GTF file based on the specified build version. The function uses the Ensembl Python API to fetch gene information.

ideal_genom.visualization.zoom_heatmap.get_ld_matrix(data_df: DataFrame, snp_col: str, pos_col: str, bfile_folder: str, bfile_name: str, output_path: str) dict

Calculate LD matrix using PLINK for a set of SNPs.

This function takes a DataFrame containing SNP information and calculates the LD (Linkage Disequilibrium) matrix using PLINK. The SNPs are first sorted by position, and then PLINK is used to compute pairwise r2 values between SNPs.

Parameters:
  • data_df (pandas.DataFrame) – DataFrame containing SNP information

  • snp_col (str) – Name of the column containing SNP IDs

  • pos_col (str) – Name of the column containing SNP positions

  • bfile_folder (str) – Path to the folder containing PLINK binary files

  • bfile_name (str) – Base name of the PLINK binary files (without extensions)

  • output_path (str) – Path where output files will be saved

Returns:

Dictionary containing:
  • ’pass’bool

    True if process completed successfully

  • ’step’str

    Name of the processing step (‘get_ld_matrix’)

  • ’output’dict

    Dictionary with output file paths

Return type:

dict

Raises:
  • FileNotFoundError – If any required files or directories are not found

  • TypeError – If input parameters are not of the expected types

  • ValueError – If specified columns are not found in the DataFrame

ideal_genom.visualization.zoom_heatmap.get_zoomed_data(data_df: DataFrame, lead_snp: str, snp_col: str, p_col: str, pos_col: str, chr_col: str, output_folder: str, pval_threshold: float = 5e-06, radius: float | int = 1000000.0, build: str = '38', anno_source: str = 'ensembl', gtf_path: str | None = None, batch_size: int = 100, request_persec: int = 15) DataFrame

Filter and annotate SNP data around a lead SNP within a specified radius.

This function filters significant SNPs in a region around a lead SNP and annotates them with gene names and functional consequences. The position values are scaled to Megabase pairs (Mbp).

Parameters:
  • data_df (pandas.DataFrame) – Input DataFrame containing SNP data

  • lead_snp (str) – Identifier of the lead SNP to center the region around

  • snp_col (str) – Name of the column containing SNP identifiers

  • p_col (str) – Name of the column containing p-values

  • pos_col (str) – Name of the column containing position information

  • chr_col (str) – Name of the column containing chromosome information

  • output_folder (str) – Path to the output folder (must exist)

  • pval_threshold (float, optional) – P-value threshold for significance filtering (default: 5e-6)

  • radius (Union[float, int], optional) – Radius around the lead SNP in base pairs (default: 1e6)

  • build (str, optional) – Genome build version (‘38’ or ‘37’) (default: ‘38’)

  • anno_source (str, optional) – Source for annotations (‘ensembl’ or other supported sources) (default: ‘ensembl’)

  • gtf_path (str, optional) – Path to GTF file for annotations (default: None)

  • batch_size (int, optional) – Number of SNPs to process in each batch (default: 100)

  • request_persec (int, optional) – Number of API requests per second allowed (default: 15)

Returns:

Filtered and annotated DataFrame with added Mbp column and removed duplicates

Return type:

pandas.DataFrame

Raises:
  • TypeError – If input parameters are not of the expected types

  • FileNotFoundError – If output_folder does not exist

  • ValueError – If no significant SNPs are found in the specified region

Notes

The function removes duplicate SNPs, keeping the first occurrence only. Position values are converted to Megabase pairs in the output DataFrame.

ideal_genom.visualization.zoom_heatmap.snp_annotations(data_df: DataFrame, snp_col: str, pos_col: str, chr_col: str, build: str = '38', anno_source: str = 'ensembl', gtf_path: str | None = None, batch_size: int = 100, request_persec: int = 15) DataFrame

Annotate SNPs with gene names and functional consequences using Ensembl databases.

This function takes a DataFrame containing SNP information and adds gene name annotations and functional consequence annotations using Ensembl VEP (Variant Effect Predictor) API.

Parameters:
  • data_df (pandas.DataFrame) – Input DataFrame containing SNP information

  • snp_col (str) – Name of column containing SNP IDs

  • pos_col (str) – Name of column containing genomic positions

  • chr_col (str) – Name of column containing chromosome numbers

  • build (str, optional) – Genome build version (‘38’, ‘37’, or ‘19’), by default ‘38’

  • anno_source (str, optional) – Source for annotations (‘ensembl’), by default ‘ensembl’

  • gtf_path (str, optional) – Path to GTF file for annotations, by default None

  • batch_size (int, optional) – Number of SNPs to process in each API request batch, by default 100

  • request_persec (int, optional) – Maximum number of API requests per second, by default 15

Returns:

Input DataFrame augmented with gene name and functional consequence annotations. Added columns: - GENENAME: Gene name from Ensembl - Functional_Consequence: Most severe consequence from VEP

Return type:

pd.DataFrame

Raises:
  • ValueError – If specified genome build version is not supported If annotation source is not supported If the specified columns are not found in the DataFrame

  • TypeError – If input parameters are not of the expected types

Notes

Supports genome builds 19/37 and 38 using different Ensembl REST API endpoints. Implements rate limiting and batch processing for API requests.