Preparatory Module

Class designed to run preparatory steps before conducting Genomic-Wide Association Studies (GWAS).

This class handles the pruning of high linkage disequilibrium (LD) regions and performs Principal Component Analysis (PCA) on the pruned data. It uses PLINK software for the pruning and PCA operations, ensuring that the input data is in the correct format and that necessary files are present. It also manages the fetching of high LD regions if they are not provided, using the FetcherLDRegions class from the ideal_genom package. It is designed to be flexible with parameters such as missing rate, minor allele frequency, and number of principal components to compute. It also allows for memory and thread management during the execution of PLINK

class ideal_genom.preprocessing.preparatory.Preparatory

Bases: object

A class for preprocessing genomic data in preparation for analysis.

This class handles the preparatory steps needed for genomic data analysis, including input validation, LD (Linkage Disequilibrium) pruning, and PCA (Principal Component Analysis) decomposition.

input_path

Path to the directory containing input PLINK files (.bed, .bim, .fam)

Type:

str or Path

input_name

Base name of the input PLINK files (without extension)

Type:

str

output_path

Path to the directory where output files will be saved

Type:

str or Path

output_name

Base name for the output files

Type:

str

high_ld_file

Path to the high LD regions file. If not found, will be fetched automatically

Type:

str or Path

build

Genome build version, either ‘38’ or ‘37’

Type:

str, default=’38’

Raises:
  • ValueError – If input_path or output_path is None, or if input_name or output_name is None

  • TypeError – If input_path or output_path is not of type str or Path, or if input_name or output_name is not of type str, or if build is not of type str

  • FileNotFoundError – If the specified input_path or output_path does not exist, or if the required PLINK files (.bed, .bim, .fam) are not found, or if the high LD file is not found and cannot be fetched.

Notes

This class uses PLINK software for genomic data processing operations.

Note

The class assumes that PLINK is installed and available in the system PATH.

__init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, high_ld_file: str | Path, build: str = '38') None
execute_ld_prunning(mind: float = 0.2, maf: float = 0.01, geno: float = 0.1, hwe: float = 5e-06, ind_pair: list = [50, 5, 0.2], memory: int | None = None, threads: int | None = None) None

Execute LD (Linkage Disequilibrium) pruning on genetic data using PLINK.

This method performs LD pruning in two steps: 1. Excludes high LD regions and identifies independent SNPs 2. Extracts the identified independent SNPs

Parameters:
  • mind (float, optional (default=0.2)) – Missing rate per individual threshold. Excludes individuals with missing rate higher than threshold.

  • maf (float, optional (default=0.01)) – Minor allele frequency threshold. Must be between 0 and 0.5.

  • geno (float, optional (default=0.1)) – Missing rate per SNP threshold. Must be between 0 and 1.

  • hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium exact test p-value threshold. Must be between 0 and 1.

  • ind_pair (list, optional (default=[50, 5, 0.2])) – Parameters for pairwise pruning: [window size(variants), step size(variants), r^2 threshold]

  • memory (int, optional (default=None)) – Memory in MB to allocate. If None, uses 2/3 of available system memory.

Returns:

The results are saved to disk and the pruned file path is stored in self.pruned_file

Return type:

None

Raises:
  • TypeError – If mind, maf, geno, or hwe are not float

  • ValueError – If maf is not between 0 and 0.5 If geno is not between 0 and 1 If hwe is not between 0 and 1

Notes

Uses PLINK software for the pruning operations. Operates on chromosomes 1-22 only. Automatically determines optimal thread count based on system CPU cores.

execute_pc_decomposition(pca: int = 10, threads: int | None = None) None

Execute PCA decomposition on pruned PLINK binary files.

This method performs Principal Component Analysis (PCA) on the pruned genotype data using PLINK software. It requires the existence of pruned binary PLINK files (.bed, .bim, .fam) and generates PCA eigenvectors and eigenvalues.

Parameters:

pca (int, default=10) – Number of principal components to compute. Must be greater than 0.

Return type:

None

Raises:
  • TypeError – If pca parameter is not an integer.

  • ValueError – If pca parameter is less than 1.

  • FileNotFoundError – If any of the required pruned PLINK files (.bed, .bim, .fam) are not found.

Notes

The method automatically determines the optimal number of threads to use based on CPU count, reserving 2 cores for other processes. If CPU count cannot be determined, it defaults to 10 threads.

The output files will be created in the same directory as the input files, using the input name as prefix with extensions .eigenvec and .eigenval.