References Module

class ideal_genom.get_references.AssemblyReferenceFetcher

Bases: object

A class for fetching and preparing genomic reference files from online repositories.

This class handles the process of: 1. Finding the appropriate reference file URL based on build parameters 2. Downloading the reference file 3. Unzipping compressed reference files if necessary

Parameters:
  • base_url (str) – The base URL where reference files are hosted

  • build (str) – The genome build identifier (e.g., ‘GRCh38’, ‘hg19’)

  • extension (str) – File extension to look for (e.g., ‘.gtf.gz’, ‘.fa.gz’)

  • destination_folder (Optional[str], default=None) – Path where files should be downloaded. If None, uses project_root/data/assembly_references

  • avoid_substring (str, default='extra') – Substring to avoid when selecting reference files

reference_url

URL of the identified reference file

Type:

str or None

reference_file

Filename of the identified reference file

Type:

str or None

file_path

Local path to the downloaded reference file

Type:

Path or None

Raises:
  • Exception – If the base URL cannot be accessed

  • FileNotFoundError – If no matching reference file is found

  • AttributeError – If methods are called out of sequence

  • ValueError – If required attributes are None when needed

__init__(base_url: str, build: str, extension: str, destination_folder: str | None = None, avoid_substring: str = 'extra') None
_download_file(url: str, file_path: Path) None

Download a file from a given URL and save it to the specified path.

Parameters:
  • url (str) – The URL to download the file from.

  • file_path (Path) – The path where the downloaded file will be saved.

Return type:

None

Raises:

HTTPError – If the HTTP request returns an unsuccessful status code.

download_reference_file() str

Downloads a reference file from the specified URL to the destination folder.

This method first checks whether the reference file already exists locally. If not found, it also looks for an alternative version with a ‘.fa’ extension. If neither is present, it downloads the file from the given URL.

Raises:
  • AttributeError – If self.reference_url or self.reference_file are not set.

  • ValueError – If self.reference_url or self.reference_file are set to None.

Returns:

The path to the downloaded or existing reference file.

Return type:

str

Note

self.reference_url and self.reference_file must be set by calling get_reference_url() before using this method.

get_destination_folder() Path

Determines and creates (if necessary) the destination folder for reference files.

If a destination folder was provided during initialization, it uses that path. Otherwise, it defaults to a ‘data/assembly_references’ directory in the project root.

Returns:

The path to the destination folder where reference files will be stored.

Return type:

Path

get_reference_url() str

Retrieves the URL for the reference file from the base URL.

This method performs an HTTP GET request to the base URL, parses the HTML content, and searches for links matching specific criteria: - Contains the build version string - Ends with the specified extension - Does not contain the specified substring to avoid The first matching link is considered the reference file.

Returns:

str

Return type:

The complete URL to the reference file

Raises:
  • Exception – If the base URL cannot be accessed

  • FileNotFoundError – If no matching reference file is found

Notes

  • Sets self.reference_file to the name of the found file

  • Sets self.reference_url to the complete URL

  • Logs information about the found file and URL

unzip_reference_file() str

Unzips a reference genome file (typically .fa.gz to .fa) and returns the path to the unzipped file.

This method checks if the file is already unzipped, and if not, unzips it using gzip. After successful unzipping, the original compressed file is deleted.

Returns:

Path to the unzipped reference file (.fa)

Return type:

str

Raises:
  • AttributeError – If self.reference_file is not set (get_reference_url should be called first)

  • AttributeError – If self.file_path is not set or None (download_reference_file should be called first)

  • OSError – If an error occurs during the unzipping process

class ideal_genom.get_references.Ensembl37Fetcher

Bases: ReferenceDataFetcher

A class for fetching reference genome data from Ensembl’s GRCh37 (hg19) repository.

This class specializes the ReferenceDataFetcher to work specifically with Ensembl’s GRCh37 human genome build. It provides functionality to automatically detect and download the latest available GTF file for Homo sapiens from the Ensembl GRCh37 archive. The fetcher connects to Ensembl’s FTP server, identifies the most recent release available for GRCh37, and locates the chromosome GTF file for human genome data.

base_url

The base URL for Ensembl’s GRCh37 repository

Type:

str

build

The genome build identifier (‘37’)

Type:

str

source

The data source identifier (‘ensembl’)

Type:

str

latest_url

The complete URL to the latest GTF file, populated after calling get_latest_release()

Type:

str

__init__(destination_folder=None)

Initialize a reference genome downloader for Ensembl GRCh37. This constructor configures the downloader to retrieve data from Ensembl’s GRCh37 repository.

Parameters:

destination_folder (str, optional) – The folder where downloaded files will be stored. If None, a default location will be used based on the parent class implementation.

get_latest_release() None

Fetches the URL of the latest GTF file for Homo sapiens GRCh37 from Ensembl.

This method: 1. Connects to the base URL and identifies all available release folders 2. Determines the latest release by finding the highest release number 3. Navigates to the GTF directory for that release 4. Locates the Homo sapiens GRCh37 chromosome GTF file 5. Stores the complete download URL in self.latest_url

Raises:
  • Exception – If the base URL cannot be accessed

  • Exception – If no release folders are found

  • Exception – If the latest release folder cannot be accessed

  • FileNotFoundError – If the GTF file is not found in the latest release

Return type:

None

class ideal_genom.get_references.Ensembl38Fetcher

Bases: ReferenceDataFetcher

A class for fetching human genome reference data from Ensembl based on GRCh38 build.

This class extends ReferenceDataFetcher to specifically handle Ensembl’s human genome data with build 38. It provides functionality to find and retrieve the latest GTF file from Ensembl’s FTP server.

base_url

Base URL for Ensembl FTP server where GTF files are stored

Type:

str

build

Genome build version (‘38’)

Type:

str

source

Data source (‘ensembl’)

Type:

str

destination_folder

Local folder to store downloaded files

Type:

str

latest_url

URL of the latest GTF file after calling get_latest_release()

Type:

str

Raises:
  • Exception – If the Ensembl FTP server cannot be accessed

  • FileNotFoundError – If no matching GTF file is found

__init__(destination_folder=None)
get_latest_release() None

Retrieves the URL of the latest GTF file for human genome (GRCh38) from the base URL.

This method scrapes the base URL to find the most recent Homo_sapiens GRCh38 GTF file available for download. Upon finding the file, it constructs the complete URL and stores it in the instance variable latest_url.

Return type:

None

Raises:
  • Exception – If the base URL cannot be accessed (non-200 response)

  • FileNotFoundError – If no GTF file matching the criteria is found

class ideal_genom.get_references.FetcherLDRegions

Bases: object

A class for fetching high Linkage Disequilibrium (LD) regions files.

This class handles downloading or creating files containing genomic regions with high LD for different genome builds (37 or 38). These regions are often excluded in GWAS analyses to avoid confounding effects.

Parameters:
  • destination (Path, optional) – Directory path where the LD regions files will be stored. Default is “../data/ld_regions_files” relative to the module location.

  • built (str, default '38') – Genome build version. Must be either ‘37’ or ‘38’.

destination

Directory where LD regions files are stored.

Type:

Path

built

Genome build version being used.

Type:

str

ld_regions

Path to the LD regions file once retrieved, None otherwise.

Type:

Path or None

__init__(destination: Path | None = None, built: str = '38')
get_ld_regions() Path

Downloads or creates high LD regions file based on genome build version.

This method handles the retrieval of high Linkage Disequilibrium (LD) regions for different genome builds (37 or 38). For build 37, it downloads the regions from a GitHub repository. For build 38, it creates the file from predefined coordinates.

Returns:

Path to the created/downloaded LD regions file. Returns empty Path if download fails for build 37.

Return type:

Path

Notes

  • For build 37: Downloads from genepi-freiburg/gwas repository

  • For build 38: Creates file from hardcoded coordinates from GWAS-pipeline

  • Files are named as ‘high-LD-regions_GRCh{build}.txt’

  • Creates destination directory if it doesn’t exist

class ideal_genom.get_references.RefSeqFetcher

Bases: ReferenceDataFetcher

A class for fetching and downloading reference genome data from NCBI’s RefSeq repository.

This class extends ReferenceDataFetcher to specifically handle downloading human genome reference files from the RefSeq database. It supports different genome builds (e.g., ‘GRCh37’, ‘GRCh38’) and automatically identifies the latest version available for the specified build. The class handles navigating the NCBI FTP directory structure, finding the appropriate GTF files for the requested genome build, and managing the download process.

base_url

The base URL for the NCBI RefSeq FTP server directory.

Type:

str

build

The genome build version (‘37’ for GRCh37, ‘38’ for GRCh38).

Type:

str

source

The source of the reference data (set to ‘refseq’).

Type:

str

latest_url

URL to the latest GTF file, set after calling get_latest_release().

Type:

str

__init__(build: str, destination_folder: str | None = None)
get_latest_release() None

Fetches the latest GTF file dynamically from the specified base URL.

This method sends a GET request to the base URL, parses the HTML response to find the latest GTF file link, and sets the latest_url attribute to the full URL of the latest GTF file.

Raises:

FileNotFoundError – If no GTF file is found in the HTML response.

Return type:

None

class ideal_genom.get_references.ReferenceDataFetcher

Bases: object

A class for fetching, downloading, and processing reference genome data.

This class provides a framework for retrieving genomic reference data from various sources. It handles downloading compressed files, unzipping them, and extracting gene information from GTF files.

build

The genome build (e.g., ‘hg38’, ‘GRCh38’).

Type:

str

source

The data source (e.g., ‘ensembl’, ‘ucsc’).

Type:

str

base_url

The base URL to fetch data from.

Type:

str

destination_folder

The directory to save downloaded files. If None, defaults to project_root/data/{source}_latest.

Type:

Optional[str]

latest_url

The URL of the latest release after calling get_latest_release().

Type:

Optional[str]

gz_file

Path to the downloaded compressed file.

Type:

Optional[str]

gtf_file

Path to the uncompressed GTF file.

Type:

Optional[str]

Notes

This is an abstract base class that requires subclasses to implement the get_latest_release() method for specific data sources.

__init__(base_url: str, build: str, source: str, destination_folder: str | None = None) None
_download_file(url: str, file_path: Path) None

Download a file from the given URL and save it to file_path.

download_latest() str

Downloads the latest file from self.latest_url to self.destination_folder.

Raises:
  • AttributeError – If self.latest_url is not set.

  • requests.exceptions.RequestException – If the HTTP request fails.

get_all_genes() str

Extract all genes from the GTF file and save them to a new compressed file.

This method reads the GTF file specified in self.gtf_file, filters for gene features, and creates a new GTF file containing only the gene entries. If the output file already exists, it will return the path without reprocessing.

Returns:

Path to the compressed GTF file containing all genes

Return type:

str

Raises:
  • FileNotFoundError – If the reference GTF file (self.gtf_file) is not found

  • TypeError – If read_gtf does not return a pandas DataFrame

Note

The output file will be named based on the input GTF file with “-all_genes.gtf.gz” suffix

get_destination_folder() Path

Determine the destination folder for downloads.

get_latest_release() None

Determine the specific URL for fetching data.

unzip_latest() str

Unzips the latest downloaded file and stores it as a GTF file.