Saturday, 31 January 2015

The coming of age of humanity

Here just a thought that's crossed my mind a few times and this might not be the best place to share it but I'd be interested to hear any views on this matter.


The collective being which is humanity has learned a lot since its birth.
A giant with a hazy memory of his early days and disturbing past still faces inner qualms.
It has grown to adulthood, its cells multiply and specialise, some faster than others, its still holds vestiges from its early childhood


Humanity as an entity, is a bit like a person going through different stages of life.  In it's infancy, humanity was uncoordinated knowledge was lost and many beliefs were held which were not founded on any sort of empirical evidence.
As humanity matured it went through a pubescent crisis

I believe (albeit naively) that humanity, the collective consciousness that has evolved through the transmission of knowledge across generations by the spoken and written word,  has come of age; that our tolerance, understanding, and scientific openness has reached a point where data can be put in the public space without fear of confidentiality, judgement or reprisal.
That we become as open about our genetics and medical problems than about our thoughts, religion, sexual orientation.  That these things don't become newsworthy anymore.  If anything genetics shows us that we are all exceptions, we all carry private mutations, minor alleles, strange intricacies in our DNA that distinguish us from everyone else.

What is interesting is not what makes us the same but what makes us different.

Group dynamics or group psychology

Democracy is founded on the "if everyone did this" thought excercise

Statistics and Geometry

Correlation is a cosine.

http://www.johndcook.com/blog/2010/06/17/covariance-and-law-of-cosines/

r(X,Y) = cov(X,Y) / sqrt(var(X)*var(Y))
<X,Y> = ||X|| ||Y|| cos(theta)

cos(theta) = <X,Y> / ||X|| ||Y|| = cov(X,Y) / sqrt(var(X)*var(Y)) = r(X,Y)

Sum of squares of X is X X'.

Matrix approach to regression.

X X' B = X Y


Avoiding repetition in science writing

Although english is possibly the richest language in terms of the number of words, it may not be the richest in terms of structure or conjunctive terms (terms used to string sentences together to build logical arguments).  These are especially useful in science when we build logical argument and marry many threads of seemingly contradicting evidence.

Words like "however" and "since", to introduce nuance or causation, get used a lot in english, and even more so in scientific writing.

To avoid repetition, I've regularly looked up synonyms for these words.
I've now done this enough times, that I thought I might blog about it.

Formal synonyms of suggest:

The hybrid method we advocate leverages the information available from
targeted qPCR assays ...

The hybrid method we present leverages the information available from
targeted qPCR assays ...

Formal synonyms of since:

The differential bias between cases and controls is very likely be the result of batch effect since the case and control DNA samples were prepared and processed in two different centers.

The differential bias between cases and controls is very likely be the result of batch effect as the case and control DNA samples were prepared and processed in two different centers.

The differential bias between cases and controls is very likely be the result of batch effect given that the case and control DNA samples were prepared and processed in two different centers.

The differential bias between cases and controls is very likely be the result of batch effect considering that the case and control DNA samples were prepared and processed in two different centers.

Formal synonyms of have implications:

bearing on


Formal synonyms of however:

We believe that those genes are unlikely to show association in the sample sizes currently available, in light of copy numbers greater than 2 being rare for all KIR genes.
However, given it is true that LD only accounts for presence/absence not for copy number (i.e. the LD pattern between 0 and 1 is the same as that between 0 and 2), the reviewer is right to remark that these genes cannot be definitively excluded based on the LD frequencies obtained from Allele Frequency Net database.

We believe that those genes are unlikely to show association in the sample sizes currently available, in light of copy numbers greater than 2 being rare for all KIR genes.
Nonetheless, given it is true that LD only accounts for presence/absence not for copy number (i.e. the LD pattern between 0 and 1 is the same as that between 0 and 2), the reviewer is right to remark that these genes cannot be definitively excluded based on the LD frequencies obtained from Allele Frequency Net database.

We believe that those genes are unlikely to show association in the sample sizes currently available, in light of copy numbers greater than 2 being rare for all KIR genes.
On the other hand, since it is true that LD only accounts for presence/absence not for copy number (i.e. the LD pattern between 0 and 1 is the same as that between 0 and 2), the reviewer is right to remark that these genes cannot be definitively excluded based on the LD frequencies obtained from Allele Frequency Net database.

Formal synonyms of due to:

This one is a bit tougher...

Thank you for noticing this omission which is due to a typographical mistake.

Thank you for noticing this omission which is because of a typographical mistake.

Thank you for noticing this omission which is  the result of a typographical mistake.



Population theory: within individual, between individual and between population variation

The underlying idea in clustering is to find a labelling of the data (clusters) which minimises the ratio of within to between cluster variation.


Some parallels may be drawn between analysing populations of cells within one individual and analysing populations of individuals.


Obviously they are clear differences:
In the case of cells the difference is an expression level (mRNA and protein) whereas the individual level the difference is at sequence (DNA level).
The time scale and dynamics are very different.  Gene expression variation is very flexible whereas the mechanism underlying DNA sequence variation are much better understood.


Some principles from analysing populations of cells are applicable to populations of individuals.  Although the data is longer instead of wider.


Both are snapshots of a dynamic system working on very different timescales.
In both scenarios we are trying to find latent classes/clusters which explain the variation
Clustering methods are particularly relevant.
While density based methods are more suited for data matrices of many rows, distance based methods maybe more suited when the number of columns is greater.
For example the idea of a cell lineage is comparable to the lineage.
Population bottlenecks are as applicable to genetic diversity than to cellular diversity.


Identification of latent factors or discovery of events which influence the population dynamics.


Within these datasets I have identified using various clustering approaches, population of cells and 
phenotypes on those which correlate with genetic variation.
I have developed efficient clustering methods using Bayesian mixture models to identify rare cluster and accounting for large variation between samples.
I have also extended this approach of mixture models to clustering genetic data and this has led to the largest association study of genes of the KIR region with type 1 diabetes.


Individuals like cells within an individual can be considered independent.
There is proximal structure in the genetic code (LD) which can exploited.


Statistical physics, Gaussian probability distribution, additivity

All of deterministic physics is emergent behaviour from quantum physics?
Law of large numbers?

The seemingly deterministic large-scale is an emerging behaviour of the probabilistic small-scale.

The trajectory of single particles is probabilistic but the trajectory of the larger object is deterministic.
In the same that in a large crowd of people the movement of a single individual is unpredictable while the movement of the crowd as an entity is much more predictable.

In weather prediction, we can predict large scale weather patterns say over a large area but we cannot predict with much certainty whether tomorrow it will rain in Cambridge.

When observing very small amounts of data, our measurements are far more uncertain.

On the large scale things tend to be normally distributed so the emerging behaviour of the system tends to be symmetric since most small scale forces cancel to give rise to large scale equilibrium.

Does non-linear imply non-additive?
Does non-linear imply interaction effects?  I.e non-marginal effects.
Marginal effects can detected by summing/integrating over all latent variables.

Biologists and statisticians: what can we learn from each other?

Disclaimer: All the views here are merely based on observations and the authors own experience.

Biology: learning how systems work by breaking them.

What can we learn from biologists?

Ask important and fundamental questions about the real world.
For example why do vaccinations work?
Come up with interesting and innovative ideas and experiments.
They tend to be more in the real world (although that's debatable).
All data is not equal, some datasets are worse quality than others.
We statisticians have metrics like the CV to measure the signal to noise ratio, but the problem is we don't really understand what is the signal in the first place.
We also only know about what is actually measured in the experiment and not about confounders.

What biologists can learn from us?

Rigour.  Naming conventions and standards are important dammit.
Being organised!  Store data preferably in text files so that it is machine readable, easily searchable without proprietary software etc
Efficiency!
Objectivity, not let our beliefs cloud our judgement, influence our analysis/experiments.
This goes for both us but "don't let data get in the way of a good story" is not science.
All datasets are important.  Small datasets do contain useful information which can sometimes be exploited when joined on other datasets.

Statistical common sense often not shared with biologists:

Winners curse or regression to the mean

Multiple testing

Simpson's paradox

Optimality vs Consistency?


Is it better to consistently overshoot or to be inconsistent.
It depends if the data moves no?


Bias vs variance?

Finding reproducible cluster partitions for the k-means algorithm

K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions.
This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset.

Circular arguments in biology: Ouroboros



Once bias is introduced, we enter the dangerous realm of self-fulfilling prophecies.

Vicious circle

Winner's curse.
Prior knowledge in the form of pathway databases. For example we cannot distinguish between missing data and false negatives. If something is not there does it mean it has not been found yet or does it truly not exist?

Posterior feedback into priors

In GWAS associated SNPs are preferentially annotated (Mike Keale).  Does the annotation add any value?

Minor allele frequencies obtained from previous studies guide the clustering of genotypes in current studies. Will this online updating of probabilities eventually converge to the true MAF or are we reinforcing wrong freq estimates by influencing the influence of new data?
The posterior should become the new prior.

Double counting


What I hate about R


Greatly inspired by:

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

R is not the best designed/intuitive programming language in the world and it's very easy to make mistakes.
It is a bit of organic Perlish language which has grown out of need.
Some of the syntax is Perlish


1) Namespaces!  Accidental variable reuse or function overloading is a very easy mistake to make in R.  I try to prepend package names as often as I can to specify the actual function I need.

2) Factors and strings.  Enough said.

3) Welcome to indexing hell.
Sometimes things work too well. Imagine index =c(0,1,...,) of size n and I do this:
x[index]
then I just get elements 1 over and over
what I wanted to do was:
x[as.logical]
Sometimes don't work well enough.  A single index matrix is no longer a matrix!
This means that special cases need to be written when accessing a single row or column (fml).


Why did I chose science?

It took me 3 years to realise how stupid I am (a PhD).
I did some great things (at my scale) but still feel unfulfilled on my academic side.
I know appreciate how long good research can take and that my time is precious.
I have rediscovered the importance of long periods of uninterrupted concentration.
My obsession with productivity has often made me unproductive: I often have to remind myself that real science is done with a pen and paper and by following ideas to the end.
Part of the problem is that computers are a big toy.

I got a motivational message from my Dad yesterday (31/05/2014).
My Dad is a particularly passionate man which makes him a great motivator:

"
On Francis Crick
It was like sitting next to an intellectual nuclear reactor. I never had a feeling of such incandescence.

These are the words of Oliver Sacks, a famous American psychologist and writer (author of Awakenings that was made a film with de Niro in one of his best performances), on first meeting Francis Crick. The article, from the New York Review of Books, written by Sacks, goes on to describe the immense impression the fertile mind of Crick made to Sacks.
Crick showed an interest on colour-blindedness and how the sense of colour is generated in the retina. Sacks had sent him his paper on a colour-blinded patient of his and Crick responded with a five page letter, saying that some of his ideas were wild speculation. 
Sacks: The letter seemed to get deeper and more suggestive every time we read it, and we got the sense that it would need a decade or more of work, by a dedicated team of psychophysicists, neuroscientists, brain imagers and others to follow up on the torrent of suggestions Crick had made.

I am writing this to you, my son, because I realize upon reading such articles, how only science can focus a man's mind and give real meaning to his days. Literature and the arts can easily lead one astray, sometimes into paths that are very difficult to tread. The driving presence of focus, the incandescent idea that dominates a scientists mind, can prove excessively taxing, but at the end of the day, his work and effort have not gone to waste. Someone, humanity, will benefit from them.
Consider yourself privileged, my son, that you are into science, and although at times you are disheartened by the sluggishness of things and the hardheadedness of colleagues, the end result will carry part of your signature, and that is never lost.
Have a good, productive day, my Nikolas

Papa
"

Important things to know about human chromosomes

Human chromosomes are numbered from largest to smallest.
As well as having different sizes, they have different shapes too, with some having proportionally smaller or long p and q arms and centromeres.

http://book.bionumbers.org/