GWAS with Generalized Linear Models (GLM)
This module provides a class for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Model (GLM) with PLINK2.
It includes methods for association analysis, obtaining top hits, and annotating SNPs with gene information.
- class ideal_genom.gwas.gwas_glm.GWASfixed
Bases:
objectClass for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Model (GLM) with PLINK2.
This class provides methods to perform association analysis, obtain top hits, and annotate SNPs with gene information.
- input_path
Path to the input directory.
- Type:
str
- output_path
Path to the output directory.
- Type:
str
- input_name
Base name of the input PLINK files.
- Type:
str
- output_name
Base name for the output files.
- Type:
str
- recompute
Flag indicating whether to recompute the analysis.
- Type:
bool
- results_dir
Directory where the results will be saved.
- Type:
str
- Raises:
ValueError – If input_path, output_path, input_name, or output_name are not provided.
FileNotFoundError – If the specified input_path or output_path does not exist.
FileNotFoundError – If the required PLINK files (.bed, .bim, .fam) are not found in the input_path.
TypeError – If input_name or output_name are not strings, or if recompute is not a boolean.
- __init__(input_path: str, input_name: str, output_path: str, output_name: str, recompute: bool = True) None
- annotate_top_hits(gtf_path: str | None = None, build: str = '38', anno_source: str = 'ensembl') dict
Annotate top SNP hits from COJO analysis with gene information.
This method reads the COJO joint analysis results, extracts the top SNPs, and annotates them with gene information using the specified genome build and annotation source. The annotated results are saved to a TSV file.
- Parameters:
gtf_path (Optional[str], default=None) – Path to the GTF (Gene Transfer Format) file for custom annotation. If None, the annotation will use default resources.
build (str, default='38') – Genome build version to use for annotation (‘38’ for GRCh38, etc.).
anno_source (str, default="ensembl") – Source of annotations to use (e.g., “ensembl”, “refseq”).
- Returns:
A dictionary containing: - ‘pass’: Boolean indicating if the process completed successfully - ‘step’: The name of the step (‘annotate_hits’) - ‘output’: Dictionary with output file paths
- Return type:
dict
- Raises:
FileExistsError – If the COJO results file is not found in the results directory.
Notes
The annotated results are saved to ‘top_hits_annotated.tsv’ in the results directory.
- fixed_model_association_analysis(maf: float = 0.01, mind: float = 0.1, hwe: float = 5e-06, ci: float = 0.95) dict
Perform fixed model association analysis using PLINK2.
This method performs a fixed model association analysis on genomic data using PLINK2. It checks the validity of the input parameters, ensures necessary files exist, and executes the PLINK2 command to perform the analysis.
- Parameters:
maf (float) – Minor allele frequency threshold. Must be between 0 and 0.5.
mind (float) – Individual missingness threshold. Must be between 0 and 1.
hwe (float) – Hardy-Weinberg equilibrium threshold. Must be between 0 and 1.
ci (float) – Confidence interval threshold. Must be between 0 and 1.
- Returns:
A dictionary containing the status of the process, the step name, and the output directory.
- Return type:
dict
- Raises:
TypeError – If any of the input parameters are not of type float.
ValueError – If any of the input parameters are out of their respective valid ranges.
FileNotFoundError – If the required PCA file is not found.
- get_top_hits(maf: float = 0.01) dict
Get the top hits from the GWAS results.
- Parameters:
maf (float) – Minor allele frequency threshold. Must be a float between 0 and 0.5.
- Returns:
A dictionary containing the process status, step name, and output directory.
- Return type:
dict
- Raises:
TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 0.5.
Notes
- The function performs the following steps:
Validates the type and range of the maf parameter.
Computes the number of threads to use based on the available CPU cores.
Loads the results of the association analysis and renames columns according to GCTA requirements.
Prepares a .ma file with the necessary columns.
If recompute is True, constructs and executes a GCTA command to perform conditional and joint analysis.
Returns a dictionary with the process status, step name, and output directory.