References Module
- class ideal_genom.get_references.AssemblyReferenceFetcher
Bases:
objectA class for fetching and preparing genomic reference files from online repositories.
This class handles the process of: 1. Finding the appropriate reference file URL based on build parameters 2. Downloading the reference file 3. Unzipping compressed reference files if necessary
- Parameters:
base_url (str) – The base URL where reference files are hosted
build (str) – The genome build identifier (e.g., ‘GRCh38’, ‘hg19’)
extension (str) – File extension to look for (e.g., ‘.gtf.gz’, ‘.fa.gz’)
destination_folder (Optional[str], default=None) – Path where files should be downloaded. If None, uses project_root/data/assembly_references
avoid_substring (str, default='extra') – Substring to avoid when selecting reference files
- reference_url
URL of the identified reference file
- Type:
str or None
- reference_file
Filename of the identified reference file
- Type:
str or None
- file_path
Local path to the downloaded reference file
- Type:
Path or None
- Raises:
Exception – If the base URL cannot be accessed
FileNotFoundError – If no matching reference file is found
AttributeError – If methods are called out of sequence
ValueError – If required attributes are None when needed
- __init__(base_url: str, build: str, extension: str, destination_folder: str | None = None, avoid_substring: str = 'extra') None
- _download_file(url: str, file_path: Path) None
Download a file from a given URL and save it to the specified path.
- Parameters:
url (str) – The URL to download the file from.
file_path (Path) – The path where the downloaded file will be saved.
- Return type:
None
- Raises:
HTTPError – If the HTTP request returns an unsuccessful status code.
- download_reference_file() str
Downloads a reference file from the specified URL to the destination folder.
This method first checks whether the reference file already exists locally. If not found, it also looks for an alternative version with a ‘.fa’ extension. If neither is present, it downloads the file from the given URL.
- Raises:
AttributeError – If self.reference_url or self.reference_file are not set.
ValueError – If self.reference_url or self.reference_file are set to None.
- Returns:
The path to the downloaded or existing reference file.
- Return type:
str
Note
self.reference_url and self.reference_file must be set by calling get_reference_url() before using this method.
- get_destination_folder() Path
Determines and creates (if necessary) the destination folder for reference files.
If a destination folder was provided during initialization, it uses that path. Otherwise, it defaults to a ‘data/assembly_references’ directory in the project root.
- Returns:
The path to the destination folder where reference files will be stored.
- Return type:
Path
- get_reference_url() str
Retrieves the URL for the reference file from the base URL.
This method performs an HTTP GET request to the base URL, parses the HTML content, and searches for links matching specific criteria: - Contains the build version string - Ends with the specified extension - Does not contain the specified substring to avoid The first matching link is considered the reference file.
- Returns:
str
- Return type:
The complete URL to the reference file
- Raises:
Exception – If the base URL cannot be accessed
FileNotFoundError – If no matching reference file is found
Notes
Sets self.reference_file to the name of the found file
Sets self.reference_url to the complete URL
Logs information about the found file and URL
- unzip_reference_file() str
Unzips a reference genome file (typically .fa.gz to .fa) and returns the path to the unzipped file.
This method checks if the file is already unzipped, and if not, unzips it using gzip. After successful unzipping, the original compressed file is deleted.
- Returns:
Path to the unzipped reference file (.fa)
- Return type:
str
- Raises:
AttributeError – If self.reference_file is not set (get_reference_url should be called first)
AttributeError – If self.file_path is not set or None (download_reference_file should be called first)
OSError – If an error occurs during the unzipping process
- class ideal_genom.get_references.Ensembl37Fetcher
Bases:
ReferenceDataFetcherA class for fetching reference genome data from Ensembl’s GRCh37 (hg19) repository.
This class specializes the ReferenceDataFetcher to work specifically with Ensembl’s GRCh37 human genome build. It provides functionality to automatically detect and download the latest available GTF file for Homo sapiens from the Ensembl GRCh37 archive. The fetcher connects to Ensembl’s FTP server, identifies the most recent release available for GRCh37, and locates the chromosome GTF file for human genome data.
- base_url
The base URL for Ensembl’s GRCh37 repository
- Type:
str
- build
The genome build identifier (‘37’)
- Type:
str
- source
The data source identifier (‘ensembl’)
- Type:
str
- latest_url
The complete URL to the latest GTF file, populated after calling get_latest_release()
- Type:
str
- __init__(destination_folder=None)
Initialize a reference genome downloader for Ensembl GRCh37. This constructor configures the downloader to retrieve data from Ensembl’s GRCh37 repository.
- Parameters:
destination_folder (str, optional) – The folder where downloaded files will be stored. If None, a default location will be used based on the parent class implementation.
- get_latest_release() None
Fetches the URL of the latest GTF file for Homo sapiens GRCh37 from Ensembl.
This method: 1. Connects to the base URL and identifies all available release folders 2. Determines the latest release by finding the highest release number 3. Navigates to the GTF directory for that release 4. Locates the Homo sapiens GRCh37 chromosome GTF file 5. Stores the complete download URL in self.latest_url
- Raises:
Exception – If the base URL cannot be accessed
Exception – If no release folders are found
Exception – If the latest release folder cannot be accessed
FileNotFoundError – If the GTF file is not found in the latest release
- Return type:
None
- class ideal_genom.get_references.Ensembl38Fetcher
Bases:
ReferenceDataFetcherA class for fetching human genome reference data from Ensembl based on GRCh38 build.
This class extends ReferenceDataFetcher to specifically handle Ensembl’s human genome data with build 38. It provides functionality to find and retrieve the latest GTF file from Ensembl’s FTP server.
- base_url
Base URL for Ensembl FTP server where GTF files are stored
- Type:
str
- build
Genome build version (‘38’)
- Type:
str
- source
Data source (‘ensembl’)
- Type:
str
- destination_folder
Local folder to store downloaded files
- Type:
str
- latest_url
URL of the latest GTF file after calling get_latest_release()
- Type:
str
- Raises:
Exception – If the Ensembl FTP server cannot be accessed
FileNotFoundError – If no matching GTF file is found
- __init__(destination_folder=None)
- get_latest_release() None
Retrieves the URL of the latest GTF file for human genome (GRCh38) from the base URL.
This method scrapes the base URL to find the most recent Homo_sapiens GRCh38 GTF file available for download. Upon finding the file, it constructs the complete URL and stores it in the instance variable latest_url.
- Return type:
None
- Raises:
Exception – If the base URL cannot be accessed (non-200 response)
FileNotFoundError – If no GTF file matching the criteria is found
- class ideal_genom.get_references.FetcherLDRegions
Bases:
objectA class for fetching high Linkage Disequilibrium (LD) regions files.
This class handles downloading or creating files containing genomic regions with high LD for different genome builds (37 or 38). These regions are often excluded in GWAS analyses to avoid confounding effects.
- Parameters:
destination (Path, optional) – Directory path where the LD regions files will be stored. Default is “../data/ld_regions_files” relative to the module location.
built (str, default '38') – Genome build version. Must be either ‘37’ or ‘38’.
- destination
Directory where LD regions files are stored.
- Type:
Path
- built
Genome build version being used.
- Type:
str
- ld_regions
Path to the LD regions file once retrieved, None otherwise.
- Type:
Path or None
- __init__(destination: Path | None = None, built: str = '38')
- get_ld_regions() Path
Downloads or creates high LD regions file based on genome build version.
This method handles the retrieval of high Linkage Disequilibrium (LD) regions for different genome builds (37 or 38). For build 37, it downloads the regions from a GitHub repository. For build 38, it creates the file from predefined coordinates.
- Returns:
Path to the created/downloaded LD regions file. Returns empty Path if download fails for build 37.
- Return type:
Path
Notes
For build 37: Downloads from genepi-freiburg/gwas repository
For build 38: Creates file from hardcoded coordinates from GWAS-pipeline
Files are named as ‘high-LD-regions_GRCh{build}.txt’
Creates destination directory if it doesn’t exist
- class ideal_genom.get_references.RefSeqFetcher
Bases:
ReferenceDataFetcherA class for fetching and downloading reference genome data from NCBI’s RefSeq repository.
This class extends ReferenceDataFetcher to specifically handle downloading human genome reference files from the RefSeq database. It supports different genome builds (e.g., ‘GRCh37’, ‘GRCh38’) and automatically identifies the latest version available for the specified build. The class handles navigating the NCBI FTP directory structure, finding the appropriate GTF files for the requested genome build, and managing the download process.
- base_url
The base URL for the NCBI RefSeq FTP server directory.
- Type:
str
- build
The genome build version (‘37’ for GRCh37, ‘38’ for GRCh38).
- Type:
str
- source
The source of the reference data (set to ‘refseq’).
- Type:
str
- latest_url
URL to the latest GTF file, set after calling get_latest_release().
- Type:
str
- __init__(build: str, destination_folder: str | None = None)
- get_latest_release() None
Fetches the latest GTF file dynamically from the specified base URL.
This method sends a GET request to the base URL, parses the HTML response to find the latest GTF file link, and sets the latest_url attribute to the full URL of the latest GTF file.
- Raises:
FileNotFoundError – If no GTF file is found in the HTML response.
- Return type:
None
- class ideal_genom.get_references.ReferenceDataFetcher
Bases:
objectA class for fetching, downloading, and processing reference genome data.
This class provides a framework for retrieving genomic reference data from various sources. It handles downloading compressed files, unzipping them, and extracting gene information from GTF files.
- build
The genome build (e.g., ‘hg38’, ‘GRCh38’).
- Type:
str
- source
The data source (e.g., ‘ensembl’, ‘ucsc’).
- Type:
str
- base_url
The base URL to fetch data from.
- Type:
str
- destination_folder
The directory to save downloaded files. If None, defaults to project_root/data/{source}_latest.
- Type:
Optional[str]
- latest_url
The URL of the latest release after calling get_latest_release().
- Type:
Optional[str]
- gz_file
Path to the downloaded compressed file.
- Type:
Optional[str]
- gtf_file
Path to the uncompressed GTF file.
- Type:
Optional[str]
Notes
This is an abstract base class that requires subclasses to implement the get_latest_release() method for specific data sources.
- __init__(base_url: str, build: str, source: str, destination_folder: str | None = None) None
- _download_file(url: str, file_path: Path) None
Download a file from the given URL and save it to file_path.
- download_latest() str
Downloads the latest file from self.latest_url to self.destination_folder.
- Raises:
AttributeError – If self.latest_url is not set.
requests.exceptions.RequestException – If the HTTP request fails.
- get_all_genes() str
Extract all genes from the GTF file and save them to a new compressed file.
This method reads the GTF file specified in self.gtf_file, filters for gene features, and creates a new GTF file containing only the gene entries. If the output file already exists, it will return the path without reprocessing.
- Returns:
Path to the compressed GTF file containing all genes
- Return type:
str
- Raises:
FileNotFoundError – If the reference GTF file (self.gtf_file) is not found
TypeError – If read_gtf does not return a pandas DataFrame
Note
The output file will be named based on the input GTF file with “-all_genes.gtf.gz” suffix
- get_destination_folder() Path
Determine the destination folder for downloads.
- get_latest_release() None
Determine the specific URL for fetching data.
- unzip_latest() str
Unzips the latest downloaded file and stores it as a GTF file.