The underlying idea in clustering is to find a labelling of the data (clusters) which minimises the ratio of within to between cluster variation.
Some parallels may be drawn between analysing populations of cells within one individual and analysing populations of individuals.
Some parallels may be drawn between analysing populations of cells within one individual and analysing populations of individuals.
Obviously they are clear differences:
In the case of cells the difference is an expression level (mRNA and protein) whereas the individual level the difference is at sequence (DNA level).
The time scale and dynamics are very different. Gene expression variation is very flexible whereas the mechanism underlying DNA sequence variation are much better understood.
Some principles from analysing populations of cells are applicable to populations of individuals. Although the data is longer instead of wider.
Both are snapshots of a dynamic system working on very different timescales.
In both scenarios we are trying to find latent classes/clusters which explain the variation
Clustering methods are particularly relevant.
While density based methods are more suited for data matrices of many rows, distance based methods maybe more suited when the number of columns is greater.
For example the idea of a cell lineage is comparable to the lineage.
Population bottlenecks are as applicable to genetic diversity than to cellular diversity.
Identification of latent factors or discovery of events which influence the population dynamics.
Within these datasets I have identified using various clustering approaches, population of cells and
phenotypes on those which correlate with genetic variation.
phenotypes on those which correlate with genetic variation.
I have developed efficient clustering methods using Bayesian mixture models to identify rare cluster and accounting for large variation between samples.
I have also extended this approach of mixture models to clustering genetic data and this has led to the largest association study of genes of the KIR region with type 1 diabetes.
Individuals like cells within an individual can be considered independent.
There is proximal structure in the genetic code (LD) which can exploited.
No comments:
Post a Comment