Preparatory Module
Class designed to run preparatory steps before conducting Genomic-Wide Association Studies (GWAS).
This class handles the pruning of high linkage disequilibrium (LD) regions and performs Principal Component Analysis (PCA) on the pruned data. It uses PLINK software for the pruning and PCA operations, ensuring that the input data is in the correct format and that necessary files are present. It also manages the fetching of high LD regions if they are not provided, using the FetcherLDRegions class from the ideal_genom package. It is designed to be flexible with parameters such as missing rate, minor allele frequency, and number of principal components to compute. It also allows for memory and thread management during the execution of PLINK
- class ideal_genom.preprocessing.preparatory.Preparatory
Bases:
objectA class for preprocessing genomic data in preparation for analysis.
This class handles the preparatory steps needed for genomic data analysis, including input validation, LD (Linkage Disequilibrium) pruning, and PCA (Principal Component Analysis) decomposition.
- input_path
Path to the directory containing input PLINK files (.bed, .bim, .fam)
- Type:
str or Path
- input_name
Base name of the input PLINK files (without extension)
- Type:
str
- output_path
Path to the directory where output files will be saved
- Type:
str or Path
- output_name
Base name for the output files
- Type:
str
- high_ld_file
Path to the high LD regions file. If not found, will be fetched automatically
- Type:
str or Path
- build
Genome build version, either ‘38’ or ‘37’
- Type:
str, default=’38’
- Raises:
ValueError – If input_path or output_path is None, or if input_name or output_name is None
TypeError – If input_path or output_path is not of type str or Path, or if input_name or output_name is not of type str, or if build is not of type str
FileNotFoundError – If the specified input_path or output_path does not exist, or if the required PLINK files (.bed, .bim, .fam) are not found, or if the high LD file is not found and cannot be fetched.
Notes
This class uses PLINK software for genomic data processing operations.
Note
The class assumes that PLINK is installed and available in the system PATH.
- __init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, high_ld_file: str | Path, build: str = '38') None
- execute_ld_prunning(mind: float = 0.2, maf: float = 0.01, geno: float = 0.1, hwe: float = 5e-06, ind_pair: list = [50, 5, 0.2], memory: int | None = None, threads: int | None = None) None
Execute LD (Linkage Disequilibrium) pruning on genetic data using PLINK.
This method performs LD pruning in two steps: 1. Excludes high LD regions and identifies independent SNPs 2. Extracts the identified independent SNPs
- Parameters:
mind (float, optional (default=0.2)) – Missing rate per individual threshold. Excludes individuals with missing rate higher than threshold.
maf (float, optional (default=0.01)) – Minor allele frequency threshold. Must be between 0 and 0.5.
geno (float, optional (default=0.1)) – Missing rate per SNP threshold. Must be between 0 and 1.
hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium exact test p-value threshold. Must be between 0 and 1.
ind_pair (list, optional (default=[50, 5, 0.2])) – Parameters for pairwise pruning: [window size(variants), step size(variants), r^2 threshold]
memory (int, optional (default=None)) – Memory in MB to allocate. If None, uses 2/3 of available system memory.
- Returns:
The results are saved to disk and the pruned file path is stored in self.pruned_file
- Return type:
None
- Raises:
TypeError – If mind, maf, geno, or hwe are not float
ValueError – If maf is not between 0 and 0.5 If geno is not between 0 and 1 If hwe is not between 0 and 1
Notes
Uses PLINK software for the pruning operations. Operates on chromosomes 1-22 only. Automatically determines optimal thread count based on system CPU cores.
- execute_pc_decomposition(pca: int = 10, threads: int | None = None) None
Execute PCA decomposition on pruned PLINK binary files.
This method performs Principal Component Analysis (PCA) on the pruned genotype data using PLINK software. It requires the existence of pruned binary PLINK files (.bed, .bim, .fam) and generates PCA eigenvectors and eigenvalues.
- Parameters:
pca (int, default=10) – Number of principal components to compute. Must be greater than 0.
- Return type:
None
- Raises:
TypeError – If pca parameter is not an integer.
ValueError – If pca parameter is less than 1.
FileNotFoundError – If any of the required pruned PLINK files (.bed, .bim, .fam) are not found.
Notes
The method automatically determines the optimal number of threads to use based on CPU count, reserving 2 cores for other processes. If CPU count cannot be determined, it defaults to 10 threads.
The output files will be created in the same directory as the input files, using the input name as prefix with extensions .eigenvec and .eigenval.