FASTA files for BRCA1, BRCA2, SMA, KIR, MHC (bash scripting Entrez Direct)

I needed to create some FASTA files of important regions of the human genome for work. BRCA1 & 2 are important in breast cancer research, SMA is implicated in spinal muscular atrophy, and KIR and MHC are important for your immune system (e.g. why you have to have organ donor “compatibility”). I wrote a bash script that used Entrez Direct to automatically download these files from the NCBI servers.

FASTA files

If you just want the FASTA files to play with, they are here.

They were downloaded from the NCBI website and based on the NCBI Gene database coordinates against hg38.

source code

https://github.com/nouyang-curoverse/GA4GH_regions

This repository contains a link to download the final FASTA files, the ids.csv file I used as the master list of “mhc” and “kir” genes, the bash script file, and the xsl “xml transform” file I used to extract the information I needed from the xml file.

what i learned

Hey, what’s a gene anyway? If you use various databases (ensembl, ncbi, genenames, lumc, omim, lrg), you”ll get a whole range of coordinates for the same gene.

Take, for instance, BRCA2:

Start	End	Length	Source
32,889,611	32,974,403	84,792	ensembl
32,884,617	32,975,809	91,192	lrg
32,889,617	32,973,809	84,192	ncbi entrez
32,889,616	32,973,808	84,192	OMIM
32,890,598	32,972,907	82,309	COSMIC sanger
32,889,641	32,907,422	17,781	CGAP

In the end I standardized around using the NCBI Gene database and ignored the rest.

Hey, what’s the KIR gene region anyway? Turns out there’s a gazillion KIR genes, and there’s not exactly a “list” of them. Same for HLA. I just used my best human judgement and culled them from searches on NCBI Gene. From the git repo Readme:

For the HLA genes, I used the IMGT list of gene names, found at ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/ . For the KIR genes, I used NCBI Gene query to list all the KIR genes. http://www.ncbi.nlm.nih.gov/gene and "Homo sapiens"[porgn] AND KIR

The final list is in ids.csv.

The manual process was to download the FASTA file from the NCBI Gene database.

But for some genes, that is the KIR and HLA genes, each gene (e.g. just HLA-E) has many “alternate locii” versions.

Downloading these by hand would take days. Therefore, I needed to figure out a way to script this. My initial thought was to write a scraper using Google Sheet Scripts or a python library like Beautiful Soup. However, I thought this was dumb, because this NCBI Gene site is clearly the front-end to some database that hopefully had an API for programmatic access of some kind.

After a few days (seriously, this all took way longer than I thought it would) and with help from the biostars community I was able to figure out how to use Entrez Direct and write a bash script to automate the process for all the genes.

(It has a hacky fix, where the bash script needs to be run again for each line in the file,

for run in {1..48} #change this!

I was too lazy to debug it that day and just wanted to get this finally done).

The source code is provided here, and contains reasonable comments and spits out some debugging info when you run it.

That’s all.

Leave a comment if you have questions 🙂

appendix

Emailed from me to ga4gh-dwg :

With help from biostars*, I finished pulling out the KIR and HLA regions and fixed the SMA region.

The updated collection may be publicly viewed at
https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-7zk4muy5grnaqpv/4qji0cfumh25dttlwteo6rj2b83z2b8vz1l0rja3uzo82bf3s/

The updated collection README and scripts are at
https://github.com/nouyang-curoverse/GA4GH_regions (the FASTA files are named by gene name and ncbi gene id, e.g. BRCA1-672.fa)

Of note, I did not mirror the IMGT HLA contents (ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/), even though that was requested on the minutes from the DWG meeting, due to their policy https://www.ebi.ac.uk/ipd/imgt/hla/licence.html .

Feedback appreciated!

Thanks,
–Nancy

*credit to https://www.biostars.org/p/122680/ and https://www.biostars.org/p/122522/

orange narwhals

FASTA files for BRCA1, BRCA2, SMA, KIR, MHC (bash scripting Entrez Direct)

FASTA files

source code

what i learned

appendix

projects blog (nouyang)