February 4, 2014

Why you need to learn statistical thinking

"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write," is a quote by H.G. Wells. This quote is one of four quotes that opens the book How to Lie With Statistics by Darrell Huff. The book may be from the 1950s, hence the frequent use of the "N-word", but I believe the quote is today more true than ever before. We need statistics, but without writers who use the words with honesty and understanding and readers who know what they mean, the result will be a filthy mess. 

It's often common to read an article based on statistics data and then read a blog post where the same data is explained in a different way. A recent example is when a Swedish newspaper wrote an article on the Los Angeles police force (LAPD). The article compared the number of police officers in Los Angeles with the number of police officers in Sweden. It said that Los Angeles has 10 000 police officers and Sweden, with twice the number of people as in Los Angeles, has 20 000 police officers. But the article forgot that Los Angeles also has more organizations that can fight crime, such as California Bureau of Firearms, California Bureau of Investigation, California Highway Patrol, California State Parks Police, FBI, and so on. Sweden doesn't have these organizations.

How to Lie With Statistics is a short book, 129 pages, and it is filled with funny pictures and not a single equation can be found, so it doesn't take a long time to read it. According to the book, there are a number of ways to use statistics to deceive, and we need to learn them in self-defense.
  • The result of a sampling study is no better than the sample it is based on. US Navy recruiters explained that it was safe to join the Navy because only 9 per 1000 navy sailors died during the Spanish-American War. It was less than the 16 per 1000 civilians who died in New York during the same time period. But the navy consists of young men while civilians consists of infants, elderly, and the ill - you can't compare them. So a report based on sampling must use a representative sample, which is one from which every source of bias has been removed. What we need to ask ourselves is: "Does every name or thing in the whole group have an equal chance to be in the sample?" But the problem here is that it's almost impossible to achieve this "equal chance."
  • The world "average" has a very loose meaning. When you are told that something is an average you still don't know very much about it unless you can find out which of the common kinds of average it is: mean, median, or mode?    
  • If you use a small group as sample, the results will be more random than if you use a large group. Only when there is a substantial number of trials involved is the law of averages a useful description or prediction. If you toss a coin 10 times, you will probably not get heads 50 percent of the times.
  • The only way to think about sampling results is in ranges. If your result is 100, then it's probably somewhere between say 100 +/- 10. So comparisons between figures with small differences are meaningless. Even comparisons between figures with large differences are meaningless if the error is large. 
  • Charts can be manipulated to improve the message you want to tell. 
  • Flaws in assumptions of causality. It was said that cigarette smokers get lower grades than non-smokers. But there may be a third factor involved. Maybe extroverts smoke more than introverts who prefer to sit home and study, or it may be a random result? 

Source: Cornubot, How to Lie With Statistics (available for free)