Thursday, 11 August 2016

Database headaches for genetics data

Turns out designing database for genetics data is quite hard because:

  1.  Genetic data is very large and both common and sparse. In a whole genome (4M variants) each individual harbours thousands of variants which are only see in that individual and close relatives (so very rare at population level).  Conversely many variants are very common  and this often because what we think to be the alternative allele should actually be the reference allele this could have been caused by population effect (we started sequencing europeans when we should have started with africans (thank you Craig Venter!)).
  2. Genetic data is messy. I mentioned the problem of flipped alleles, well there is worse. Some sites are multiallellic for single nucleotide variants and if one starts to considers indels of variable length then these sites are nearly always mulitallellic. Futrhermore, if you are in a repetitive region a variant can have more than one name (although right normalisation seems to be the rule but what happens in practice). Also sometimes at a multiallelic site, the reference allele may not even be seen!
  3. We need to query data column and row-wise.  Beside the basic filtering of variants, there are many biological question which require looking at large number of variants shared between individuals.
  4. Distributions of variants per gene is very skewed.  They are few genes eg TTN with a huge number of variants.
  5. In biology there are always exceptions to the rules or anything is possible: one variant belonging to many genes etc

Anyway all the above reason make db design and implementation difficult.

  1. Primary key design.  This is difficult when a variant has many synonyms,
  2.  SQL style database are not good at storing large many to many relationships. Our exome database with 1.3M variants and 5k individuals has 144M individual to variant relationship!
Solutions?

  1. Distributed databases which can be run on mutlitple nodes like Cassandra, SOLR, ElasticSearch, BigTable
  2. Specialised indexing of VCF with bgt or gqt.
The future:

We would like to be in a place where we do not need to worry about data formatting and the practicalities of indexing data.  Ideally, I want to give a VCF file a program and get a queriable database of variants out so I can focus on the analysis of my data rather than the formatting.

Statistical intuition

I'm often asked how to go about developing some statistical intuition.
Not being a proper statistician might make this hard to answer because it is not intuitive to me.

Some non-intuitive statistical principles that everyone should know:


  • Winner's curse or regression to the mean
  • Multiple testing