Sunday, October 12, 2008

Statistics: Hype'n'fantastic

Statistics are everywhere, everyone is doing them, and there are a whole bunch of tools to help you produce them and then present them as pretty little charts. They give a report a certain standing, and they are eminently quotable at meetings, and they are evidence of the work that has been put in, quite the stamp of research. However, do the mean anything?

Sadly, statistics have three essential problems:

- They represent only the data that has been included in the research or calculations
- People tend to believe them over the reality
- The calculations themselves ignore certain kinds of data

Since statistics are most often applied to supersystems, when you gather the data, some of it will not fit, if not today, then when someone tries to repeat the test. If you are not aware of this and try to tighten down the tolerances of the calculations to give more exact, repeatable data then what you get is a set of answers which varies from week to week, from test to test. This might be a brilliant way of generating more reports, but it can actually make understanding the results much more difficult. If you get a better trained statistician to try and help you refine the process, they will not understand the significance of the data they are seeing. Statistics are nothing more than a guide, and the best way to prepare results using them is to remember that you should be generating insights not definite answers.

Because statistics seem so authoritative, they are seductive in their believability. If you want to use results presented in this way for anything beyond an opinion, you have to learn more about the situation and the conditions under which they were taken. Generally you will find a ton of special circumstances that invalidate the statistics for other situations.

The main thing I want to focus on here is ignored data, the greatest problem for people who apply statistics, and I see uncountable numbers of scientific papers where I have major questions that are left unanswered because the researcher seems unaware that there was a problem.

If you look at data relating to people during their lifetimes, you have a great advantage - no one existed before they were born, they continue to exist throughout their lives, and never return after they die - yet. Statistics works well with such life-cycle data. However, imagine for a moment that you could die when you were forty and get reborn when you were fifty? If this were possible, it would completely invalidate life-cycle statistics, because you could not predict when, where and for how long people would be dead, and so the number of 'living' people you use to calculate the mean, for example, would be false. Luckily, people do not tend to die and get reborn, but a similar kind of thing does occur in other data. I have read of a report where it was observed that dandelions remove heavy metals from the soils, but failed to analyse what happened to heavy metal in the dandelions that died or the weight of dandelion mown or eaten and so removed from the parks in questions - and completely invalidated the monitoring of heavy metal in those soils.

Statistics can also be applied to continuous systems, such as flow in a pipe or the grading of translated texts. Again, a good use - as long as the pipe does not leak between testing points or you do not have new translators beginning work and others leaving. I have seen both of these problems ignored in scientific papers written in the relevant fields of engineering and linguistics. There are uses elsewhere: I remember my father describing how, as a Royal Engineer building roads during the Mau Mau uprising in Kenya, they would count the locals they employed in the morning and then at the end of the day - assuming that if the latter count was larger there were some Mau Mau infiltrators in the group. However, it could not possibly measure any Mau Mau replacing workers, or any change in sympathies during the working day, and as such it was a dangerously unsafe methodology.

The kind of data the straight application of statistics ignores are, therefore, those little datalets that enter and leave the system unobserved. Rarely do I see checks for this kind of problem since too often their is overconfidence in the use of statistics, too much effort in their application and a consequent lack in the assessment of likely data leaks.

So, next time an authoritative reports thumps onto your desk or into your inbox, spend some time thinking about what might have happened to the tested quantities between observations.

No comments: