I took these two screenshots for work recently. Shows our evolving understanding of the complexity of the human genome at the population-scale, as reflected in the changes in the reference genome between hg19 and hg38.
hg19 / GRCh37: 2009 hg38 / GRCh38: 2013
(ignore the temp folder, not part of hg19)
Note: I make no claims as to actual accuracy or completeness of these two directory listings, I haven’t taken a close look at them in a while. But they’re interesting from a big-picture perspective.
Note to self: continue uploading interesting pictures to imgur album, http://imgur.com/a/PKYyX. (reduce bandwidth consumption on orangenarwhals)
I am posting this on my blog because I spent a lot of effort on this email & there’s no reason for it to be buried inside a private mailing list.
if you, dear reader, are like most of my real-life friends and call biology “bi-lol-logy” — ignore this post, save your sanity, and come back to bioinformatics in a year. i think things will be much better then. in fact, i’m not even going to attempt to explain what’s going on here except to link to ga4gh: http://ga4gh.org/#/beacon.
otherwise… down the rabbit hole we go…
I’d don’t want to stall momentum, since I care much more that Beacon v0.2 happens rather than a particular Beacon v0.2 happens, but as an engineer I’d also hate to see us be too hasty and make poor design choices.
Unfortunately it’s possible to describe a single variant in multiple ways in VCF
Yep, that’s concisely the problem with state-of-the-art.
From my perspective, there are three conflicting use cases and we’re trying to smush them into one Beacon/Server/Variants API spec, which may or may not be advisable.
You may only query for one position, limited to precise string
“Does AAG exist at position 1” –> implicitly asking, does an insertion of “AG” exist between positions 1 and 2 on the reference genome
…I hope this lamppost analogy makes sense outside the confines of my brain…
population-based / reference-free
“Does an insertion of AG between query coordinates 1 and 2 exist where-ever the query ‘ATTATAGAGAG’ is best aligned on each genome in the population”
query string ‘ATTATAGAGAG’ used to locate position on genome
specific variant we’re looking for is AG, that is, we want to find genomes that say “AAGTTATAGAGAG” in the place where population-wide most genomes say “ATTATAGAGAG”
Vision is “future-oriented standard for developer to implement toward / iteratively develop”
IN MY OPINION
My gut feeling is #3 is beyond the scope of Beacon v0.2 and we should be clear that Beacon v0.2 is meant to support the #2 use case.
My personal opinion is that Beacon v0.2 should actually be a standardization of use case #1, but it seems like I’m in the minority (if anyone else cares about #1, please speak up!).
With respect to, “+1 for consistency with other GA4GH APIs” —
My concern is that currently the GA4GH APIs are very VCF-oriented, and VCF is very reference-oriented and not very population-scale-oriented . On the other-hand, Beacon is population-oriented (no sense in having a Beacon to query two genomes, that doesn’t preserve anonymity at all).
My gut instinct is that the Variants API will move toward being population-oriented (reference-free). Consistency is very important, however, I think we should be cautious about moving toward consistency with Variants API in its old state. In fact it’s already starting to reflect this shift —
“graph”, in which all variation is associated with `Allele`s which may participate in `Varaints` or be called on their own. The “graph” mode is to be preferred in new client and server implementations.
 people are spending months merging VCF-based datasets and then indexing them with Tabix and wormtable, then they have to reindex for something as simple as querying a subset of the population … oh, I could got on but I hope I’m preaching to the choir here. If not, I’d much appreciate knowing where I’m incorrect if you’d care to explain. I’m certainly not an expert in bioinformatics.
Thanks Mark Fiume for taking the lead and Stephen Keenan for organizing Beacon work.
I think more lists (specifically ga4gh schema, & ga4gh server) needed to be included in this discussion, or we need an “Issues” for all of GA4GH, or something, but it’s getting very hard to keep tabs on Issues, some of which are closed, in three repositories at once. Or maybe I just need to “watch” and get email notifications on all three repos? How are people handling this crazy explosion of GA4GH work?
PUBLIC MAILING LIST
I also would note that I strongly prefer all ga4gh mailing lists be made public going forward. It’s really ridiculous to have people forward me emails from 3 different private mailing lists and link me to 10 issues on 3 repositories.