Power law is as universal as the exponential function, you might have heard of 90-percent-10-percent-rule. Maybe that would be a better name for it. The negative exponential law.
An exponential decay process
This pattern arises naturally from count data.
First time I heard of it was in the context of software development whereby 10% of the code takes up 90% of the time to write. This tends to be the case because programming bugs can be very difficult to discover and when they are found it is usually that the error originates from a single line of code.
However I am now seeing a similar sort of pattern emerge in the context of biological diversity.
For example if we consider the cells in our blood, the great majority (maybe up to 99%) of them are as you would expect, red whereas a small percentage of them are white.
However there is much greater polymorphism (fancy way of saying diversity) in the minority of white blood cells than there is in the red blood cells. Red bloods cells pretty much all do the same thing whereas white blood cells, which are part the immune system, are incredibly functionally diverse.
Another example from biology is when looking at polymorphic regions of DNA across populations. If we consider a region across different people the majority of people will probably have the similar sequences but there is a small proportion which will account for most of the diversity.
This is why biology is getting more and more complicated because we are constantly opening small Pandora boxes of diversity as we dig deeper and deeper. New discoveries are always by definition rare. This is something that might be obvious to some of you but wasn't to me.
The logic here is they were not rare then we would have found them earlier.
Either there is some intrinsic property for rare events to be heterogeneous because they originate from diverse sources or maybe it's simply because patterns only emerge once you have a sufficient sample size.
And that is I believe a key insight:
Rare things will always seem more heterogeneous because patterns only emerge once you have a sufficient sample size. As an example if you generate number from a known distribution say the normal distribution you need a sufficient sample size before you ascertain with a certain degree of certitude what the nature of the distribution is (until then you will have competing models in your mind (think Bayesian)). Also groups can only formed once you have seen something substantial different.
For example if was to place points randomly on a plane you might no see any pattern but say i placed a point much further away from the center of group of points then suddenly because i have increased the scale (zoomed out) you would tend to want to cluster those original points and the new point might be classified as an outlier from the existing random generation process or become a cluster of its own if.
An example would be:
Say I was to generate random numbers from a normal distribution with a mean of 0 and a standard deviation of 1. Only once a sufficient number of points had been generated would you be able to assert whether there was any relationship between these points. A more visual example is say you had customer coming to sit at invisible tables in a restaurant. Only once you would have observed enough customers would you have any idea of the number, shape and capacity of these tables. In this particular context would probably want to make some assumptions about the tables.
In networks, the edge count distribution in scale-free networks follows a power-law. Which means that there are a few hubs of high connectivity and many nodes of low connectivity. The hubs can be thought of as clusters.
The distribution of wealth follows a negative exponential distribution. Or at least used to but is becoming now more multimodal.
The distribution of city size follows a power law.
The tail of the distribution will get longer as more cities pop up.
Mark McCarthy said at GCD2013: "taken individually variants are rare but genetic variation is common". This is perhaps a bit of a tautology but diversity implies that things are different in unique ways meaning that taken individually they are rare, much like the long tail of the negative exponential function.
Another impact of new allele discoveries is that they decrease the overall frequency of older alleles.
This decreases the MAF and consequently the odds ratio since the MAF in the cases stays the same but the OR in the controls decreases.
What about the diversity of alleles in common genes vs the allele diversity in rarer genes?
For example KIR3DL1 KIR3DS1?
The Gini index is measure of statistical dispersion.
And yes you will probably spend 90% of your time looking at the rarest 10%...
Stumpf, M. P. H., & Porter, M. A. (2012). Critical Truths About Power Laws. Science, 335(6069), 665–666. doi:10.1126/science.1216142
Where λ is the exponent of the power law.
An exponential decay process
This pattern arises naturally from count data.
First time I heard of it was in the context of software development whereby 10% of the code takes up 90% of the time to write. This tends to be the case because programming bugs can be very difficult to discover and when they are found it is usually that the error originates from a single line of code.
However I am now seeing a similar sort of pattern emerge in the context of biological diversity.
For example if we consider the cells in our blood, the great majority (maybe up to 99%) of them are as you would expect, red whereas a small percentage of them are white.
However there is much greater polymorphism (fancy way of saying diversity) in the minority of white blood cells than there is in the red blood cells. Red bloods cells pretty much all do the same thing whereas white blood cells, which are part the immune system, are incredibly functionally diverse.
Another example from biology is when looking at polymorphic regions of DNA across populations. If we consider a region across different people the majority of people will probably have the similar sequences but there is a small proportion which will account for most of the diversity.
This is why biology is getting more and more complicated because we are constantly opening small Pandora boxes of diversity as we dig deeper and deeper. New discoveries are always by definition rare. This is something that might be obvious to some of you but wasn't to me.
The logic here is they were not rare then we would have found them earlier.
Either there is some intrinsic property for rare events to be heterogeneous because they originate from diverse sources or maybe it's simply because patterns only emerge once you have a sufficient sample size.
And that is I believe a key insight:
Rare things will always seem more heterogeneous because patterns only emerge once you have a sufficient sample size. As an example if you generate number from a known distribution say the normal distribution you need a sufficient sample size before you ascertain with a certain degree of certitude what the nature of the distribution is (until then you will have competing models in your mind (think Bayesian)). Also groups can only formed once you have seen something substantial different.
For example if was to place points randomly on a plane you might no see any pattern but say i placed a point much further away from the center of group of points then suddenly because i have increased the scale (zoomed out) you would tend to want to cluster those original points and the new point might be classified as an outlier from the existing random generation process or become a cluster of its own if.
An example would be:
Say I was to generate random numbers from a normal distribution with a mean of 0 and a standard deviation of 1. Only once a sufficient number of points had been generated would you be able to assert whether there was any relationship between these points. A more visual example is say you had customer coming to sit at invisible tables in a restaurant. Only once you would have observed enough customers would you have any idea of the number, shape and capacity of these tables. In this particular context would probably want to make some assumptions about the tables.
In networks, the edge count distribution in scale-free networks follows a power-law. Which means that there are a few hubs of high connectivity and many nodes of low connectivity. The hubs can be thought of as clusters.
The distribution of wealth follows a negative exponential distribution. Or at least used to but is becoming now more multimodal.
The distribution of city size follows a power law.
The tail of the distribution will get longer as more cities pop up.
Mark McCarthy said at GCD2013: "taken individually variants are rare but genetic variation is common". This is perhaps a bit of a tautology but diversity implies that things are different in unique ways meaning that taken individually they are rare, much like the long tail of the negative exponential function.
Another impact of new allele discoveries is that they decrease the overall frequency of older alleles.
This decreases the MAF and consequently the odds ratio since the MAF in the cases stays the same but the OR in the controls decreases.
What about the diversity of alleles in common genes vs the allele diversity in rarer genes?
For example KIR3DL1 KIR3DS1?
The Gini index is measure of statistical dispersion.
And yes you will probably spend 90% of your time looking at the rarest 10%...
Stumpf, M. P. H., & Porter, M. A. (2012). Critical Truths About Power Laws. Science, 335(6069), 665–666. doi:10.1126/science.1216142
No comments:
Post a Comment