Manhattan-type Plots Module
Generate Manhattan and Miami plots for genome-wide association studies (GWAS).
This module provides functions to create Manhattan and Miami plots to visualize the significance of genetic variants across the genome.
Features: - Data processing and visualization of GWAS summary statistics. - Annotation of SNPs with gene information from various sources. - Highlighting and labeling of specific SNPs of interest.
- ideal_genom.visualization.manhattan_type.compute_relative_pos(data: DataFrame, chr_col: str = 'CHR', pos_col: str = 'POS', p_col: str = 'p') DataFrame
Compute the relative position of probes/SNPs across chromosomes and add a -log10(p-value) column.
- Parameters:
data (pandas.DataFrame) – Input DataFrame containing genomic data.
chr_col (str) – Column name for chromosome identifiers. Default is ‘CHR’.
pos_col (str) – Column name for base pair positions. Default is ‘POS’.
p_col (str) – Column name for p-values. Default is ‘p’.
- Returns:
DataFrame with additional columns for relative positions and -log10(p-values).
- Return type:
pandas.DataFrame
- Raises:
TypeError – If input data is not a pandas DataFrame.
ValueError – If chr_col, pos_col or p_col columns are not found in the DataFrame.
- ideal_genom.visualization.manhattan_type.find_chromosomes_center(data: DataFrame, chr_col: str = 'CHR', chr_pos_col: str = 'rel_pos') DataFrame
Calculate the center positions of chromosomes in a given DataFrame.
This function takes a DataFrame containing chromosome data and calculates the center position for each chromosome based on the specified chromosome column and chromosome position column.
- Parameters:
data (pandas.DataFrame) – The input DataFrame containing chromosome data.
chr_col (str, optional) – The name of the column representing chromosome identifiers (default is ‘CHR’).
chr_pos_col (str, optional) – The name of the column representing relative positions within chromosomes (default is ‘rel_pos’).
- Returns:
A DataFrame with columns ‘CHR’ and ‘center’, where ‘CHR’ contains chromosome identifiers and ‘center’ contains the calculated center positions for each chromosome.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If the input data is not a pandas DataFrame.
ValueError – If the specified chromosome or chromosome position columns are not found in the DataFrame.
- ideal_genom.visualization.manhattan_type.manhattan_draw(data_df: ~pandas.core.frame.DataFrame, snp_col: str, chr_col: str, pos_col: str, p_col: str, plot_dir: str, to_highlight: ~pandas.core.frame.DataFrame = Empty DataFrame Columns: [] Index: [], highlight_hue: str = 'hue', to_annotate: ~pandas.core.frame.DataFrame = Empty DataFrame Columns: [] Index: [], gen_col: str | None = None, build: str = '38', anno_source='ensembl', gtf_path: str | None = None, save_name: str = 'manhattan_plot.png', upper_cap: float | None = None, genome_line: float = 5e-08, suggestive_line: float = 1e-05, yaxis_margin: float = 10, dpi: int = 400) bool
Generate a Manhattan plot for genomic data.
This function creates a Manhattan plot for visualizing the statistical significance of genetic variants across the genome, typically used in Genome-Wide Association Studies (GWAS). The plot shows -log10(p-values) against genomic positions, organized by chromosome.
- Parameters:
data_df (pandas.DataFrame) – DataFrame containing the SNP data to be plotted.
snp_col (str) – Column name in data_df that contains the SNP identifiers.
chr_col (str) – Column name in data_df that contains the chromosome information.
pos_col (str) – Column name in data_df that contains the position information.
p_col (str) – Column name in data_df that contains the p-values.
plot_dir (str) – Directory path where the plot will be saved.
to_highlight (pandas.DataFrame, optional) – DataFrame containing SNPs to be highlighted in the plot.
highlight_hue (str, default='hue') – Column name in to_highlight to be used for color-coding highlighted SNPs.
to_annotate (pandas.DataFrame, optional) – DataFrame containing SNPs to be annotated in the plot.
gen_col (str, optional) – Column name for gene information to be used in annotation.
build (str, default='38') – Genome build version (‘38’ for GRCh38, etc.).
anno_source (str, default='ensembl') – Source for gene annotation information.
gtf_path (str, optional) – Path to a GTF file for gene annotation if not using online sources.
save_name (str, default='manhattan_plot.png') – Filename for the saved plot.
upper_cap (float, optional) – Upper limit for log10(p-values) to be capped at.
genome_line (float, default=5e-8) – P-value threshold for genome-wide significance line.
suggestive_line (float, default=1e-5) – P-value threshold for suggestive significance line.
yaxis_margin (float, default=10) – Additional margin to add to the y-axis maximum.
dpi (int, default=400) – Resolution of the saved image in dots per inch.
- Returns:
True if the plot was successfully created and saved.
- Return type:
bool
- Raises:
TypeError – If input data is not a pandas DataFrame or to_highlight/to_annotate are not of correct type.
ValueError – If required columns are not found in the input DataFrame.
FileNotFoundError – If the specified plot directory does not exist.
Notes
Saves a Manhattan plot image to the specified directory.
- ideal_genom.visualization.manhattan_type.manhattan_process_data(data_df: DataFrame, chr_col: str = 'CHR', pos_col: str = 'POS', p_col: str = 'p') dict
Processes the input DataFrame to prepare data for a Manhattan plot.
The function computes the relative positions of SNPs across chromosomes, calculates the -log10(p-values), and finds the center positions of each chromosome.
- Parameters:
data_df (pandas.DataFrame) – The input DataFrame containing genomic data.
chr_col (str (optional)) – The column name for chromosome data. Defaults to ‘CHR’.
pos_col (str (optional)) – The column name for position data. Defaults to ‘POS’.
p_col (str (optional)) – The column name for p-value data. Defaults to ‘p’.
- Raises:
TypeError – If data_df is not a pandas DataFrame. If chr_col, pos_col, or p_col are not strings.
ValueError – If the specified columns (chr_col, pos_col, p_col) are not found in the DataFrame.
- Returns:
A dictionary containing processed data for the Manhattan plot with the following keys:
- datapandas.DataFrame
The processed DataFrame with relative positions and log-transformed p-values.
- axisdict
The center positions of each chromosome for plotting.
- maxpfloat
The maximum log-transformed p-value.
- Return type:
dict
- ideal_genom.visualization.manhattan_type.manhattan_type_annotate(axes: Axes, data: DataFrame, variants_toanno: DataFrame, max_x_axis: float, suggestive_line: float, genome_line: float) Axes
Annotates a Manhattan plot with gene names.
This function uses the textalloc library to place gene names on a Manhattan plot.
- Parameters:
axes (Axes (matplotlib.axes.Axes)) – The matplotlib axes object where the Manhattan plot is drawn.
data (pandas.DataFrame) – DataFrame containing the scatter plot data with columns ‘rel_pos’ and ‘log10p’.
variants_toanno (pandas.DataFrame) – DataFrame containing the variants to annotate with columns ‘rel_pos’, ‘log10p’, and ‘GENENAME’.
max_x_axis (float) – The maximum value for the x-axis.
suggestive_line (float) – The y-value for the suggestive significance line.
genome_line (float) – The y-value for the genome-wide significance line.
- Returns:
The matplotlib axes object with annotations.
- Return type:
matplotlib.axes.Axes.
- Raises:
TypeError – If the input parameters are not of the expected types.
- ideal_genom.visualization.manhattan_type.miami_draw(df_top: ~pandas.core.frame.DataFrame, df_bottom: ~pandas.core.frame.DataFrame, snp_col: str, chr_col: str, pos_col: str, p_col: str, plots_dir: str, top_highlights: list = [], top_annotations: ~pandas.core.frame.DataFrame = Empty DataFrame Columns: [] Index: [], bottom_highlights: list = [], bottom_annotations: ~pandas.core.frame.DataFrame = Empty DataFrame Columns: [] Index: [], top_gen_col: str | None = None, bottom_gen_col: str | None = None, gtf_path: str | None = None, source: str = 'ensemble', build: str = '38', save_name: str = 'miami_plot.jpeg', legend_top: str = 'top GWAS', legend_bottom: str = 'bottom GWAS', dpi: int = 400) bool
Draws a Miami plot (a combination of two Manhattan plots) for visualizing GWAS results.
- Parameters:
df_top (pandas.DataFrame) – DataFrame containing the top plot data.
df_bottom (pandas.DataFrame) – DataFrame containing the bottom plot data.
snp_col (str) – Column name for SNP identifiers.
chr_col (str) – Column name for chromosome identifiers.
pos_col (str) – Column name for base pair positions.
p_col (str) – Column name for p-values.
plots_dir (str) – Directory where the plot will be saved.
top_highlights (list, optional) – List of SNPs to highlight in the top plot.
top_annotations (list, optional) – List of SNPs to annotate in the top plot.
bottom_highlights (list, optional) – List of SNPs to highlight in the bottom plot.
bottom_annotations (list, optional) – List of SNPs to annotate in the bottom plot.
gtf_path (str, optional) – Path to the GTF file for gene annotation. If None, the file will be downloaded.
save_name (str, optional) – Name of the file to save the plot as. Default is ‘miami_plot.jpeg’.
- Returns:
True if the plot is successfully created and saved, False otherwise.
- Return type:
bool
- Raises:
TypeError – If df_top, df_bottom, top_annotations, or bottom_annotations are not pandas DataFrames.
TypeError – If top_highlights or bottom_highlights are not lists of SNP identifiers.
TypeError – If save_name, legend_top, or legend_bottom are not strings.
ValueError – If required columns (chr_col, pos_col, p_col) are not found in the DataFrames.
Notes
Saves a Miami plot image to the specified directory.
- ideal_genom.visualization.manhattan_type.miami_draw_anno_lines(renderer: RendererBase, axes: Axes, texts: list, variants_toanno: DataFrame)
Draws annotation lines from text labels to their corresponding data points on a plot.
- Parameters:
renderer (RendererBase) – The renderer used to draw the plot.
axes (Axes (matplotlib.axes.Axes)) – The axes on which the plot is drawn.
texts (list) – A list of text objects to annotate.
variants_toanno (pandas.DataFrame) – A DataFrame containing the data points to annotate, with columns ‘GENENAME’, ‘rel_pos’, and ‘log10p’.
- Returns:
The axes with the annotation lines drawn.
- Return type:
matplotlib.axes.Axes
- ideal_genom.visualization.manhattan_type.miami_process_data(data_top: DataFrame, data_bottom: DataFrame, chr_col: str, pos_col: str, p_col: str) dict
Processes Miami plot data by preparing, computing relative positions, and splitting the data.
For each part of the Miami plot (top and bottom), this function computes the relative positions of SNPs, calculates the -log10(p-values), and finds the center positions of chromosomes.
- Parameters:
data_top (pandas.DataFrame) – The top part of the data to be processed.
data_bottom (pandas.DataFrame) – The bottom part of the data to be processed.
- Returns:
- A dictionary containing the processed data with the following keys:
- ’upper’pandas.DataFrame
DataFrame containing the top part of the processed data.
- ’lower’pandas.DataFrame
DataFrame containing the bottom part of the processed data.
- ’axis’pandas.DataFrame
DataFrame with the center positions of the chromosomes.
- ’maxp’float
The maximum -log10(p-value) in the data.
- Return type:
dict
- Raises:
TypeError – If data_top or data_bottom is not a pandas DataFrame.
ValueError – If the specified columns (chr_col, pos_col, p_col) are not found in the DataFrames.