Foreword
I am posting this on my blog because I spent a lot of effort on this email & there’s no reason for it to be buried inside a private mailing list.
if you, dear reader, are like most of my real-life friends and call biology “bi-lol-logy” — ignore this post, save your sanity, and come back to bioinformatics in a year. i think things will be much better then. in fact, i’m not even going to attempt to explain what’s going on here except to link to ga4gh: http://ga4gh.org/#/beacon.
otherwise… down the rabbit hole we go…
Hi all,
I’d don’t want to stall momentum, since I care much more that Beacon v0.2 happens rather than a particular Beacon v0.2 happens, but as an engineer I’d also hate to see us be too hasty and make poor design choices.
Unfortunately it’s possible to describe a single variant in multiple ways in VCF
Yep, that’s concisely the problem with state-of-the-art.
From my perspective, there are three conflicting use cases and we’re trying to smush them into one Beacon/Server/Variants API spec, which may or may not be advisable.
USE CASES
1. Simple
You may only query for one position, limited to precise string
- “Does AAG exist at position 1” –> implicitly asking, does an insertion of “AG” exist between positions 1 and 2 on the reference genome
- Vision is “painless way for organizations to visibly commit to wanting to share genetic data by adopting a single standard, i.e. GA4GH”
- Sidestep genetic data privacy and security issues by trading [usefulness to research] for [painless adoption]
2. Current
VCF-based
- “Does an insertion of AG exist between reference coordinates 1 and 2”
- Vision is “people share useful data for researching functional impact using the current industry-standard, VCF” — definitely better than silo-ed no-sharing world, but
- Lamppost = current standards, which sort of support population/functional impact research if you try really hard
- Dark = hopefully the future, where it’s painless to query for things like frame-restoring indels
- …I hope this lamppost analogy makes sense outside the confines of my brain…
3. Future
population-based / reference-free
- “Does an insertion of AG between query coordinates 1 and 2 exist where-ever the query ‘ATTATAGAGAG’ is best aligned on each genome in the population”
- query string ‘ATTATAGAGAG’ used to locate position on genome
- specific variant we’re looking for is AG, that is, we want to find genomes that say “AAGTTATAGAGAG” in the place where population-wide most genomes say “ATTATAGAGAG”
- Vision is “future-oriented standard for developer to implement toward / iteratively develop”
IN MY OPINION
My gut feeling is #3 is beyond the scope of Beacon v0.2 and we should be clear that Beacon v0.2 is meant to support the #2 use case.
My personal opinion is that Beacon v0.2 should actually be a standardization of use case #1, but it seems like I’m in the minority (if anyone else cares about #1, please speak up!).
FURTHER NOTES
With respect to, “+1 for consistency with other GA4GH APIs” —
My concern is that currently the GA4GH APIs are very VCF-oriented, and VCF is very reference-oriented and not very population-scale-oriented [1]. On the other-hand, Beacon is population-oriented (no sense in having a Beacon to query two genomes, that doesn’t preserve anonymity at all).
My gut instinct is that the Variants API will move toward being population-oriented (reference-free). Consistency is very important, however, I think we should be cautious about moving toward consistency with Variants API in its old state. In fact it’s already starting to reflect this shift —
“graph”, in which all variation is associated with `Allele`s which may participate in `Varaints` or be called on their own. The “graph” mode is to be preferred in new client and server implementations.
[1] people are spending months merging VCF-based datasets and then indexing them with Tabix and wormtable, then they have to reindex for something as simple as querying a subset of the population … oh, I could got on but I hope I’m preaching to the choir here. If not, I’d much appreciate knowing where I’m incorrect if you’d care to explain. I’m certainly not an expert in bioinformatics.
THANKS
Thanks Mark Fiume for taking the lead and Stephen Keenan for organizing Beacon work.
CARBON COPY?
I think more lists (specifically ga4gh schema, & ga4gh server) needed to be included in this discussion, or we need an “Issues” for all of GA4GH, or something, but it’s getting very hard to keep tabs on Issues, some of which are closed, in three repositories at once. Or maybe I just need to “watch” and get email notifications on all three repos? How are people handling this crazy explosion of GA4GH work?
PUBLIC MAILING LIST
I also would note that I strongly prefer all ga4gh mailing lists be made public going forward. It’s really ridiculous to have people forward me emails from 3 different private mailing lists and link me to 10 issues on 3 repositories.
- https://github.com/ga4gh/
schemas/issues/243 - https://github.com/ga4gh/serve
r/issues/183 - https://github.com/ga4gh/schemas/issues/256
- https://github.com/ga4gh/schemas/pull/244#issuecomment-78610169
Although ga4gh-dwb-beacon is private mailing list :/ I’m still emailing instead of opening a public Issue on Github because it keeps feeling like “my calls are dropping” and no one is hears me…
other links
wow, the more i poke around on ga4gh github the more related links I see… here are some I need to read