When God made people He had statistics in mind!

Milan Mitrović
6 min readMar 4, 2020

--

When you decide to learn statistics, you will soon realize that there is nothing new to be learned. You just need to realize that you are already, subconsciously, exploiting basic principles and building blocks of statistics.

If you want to understand only one data science algorithm, choose central limit theorem. It is father of almost all statistical algorithms.

After reading this article you will:

  • Understand big picture of statistics
  • Find out how to use central limit theorem
  • Figure out how to become better decision maker

What this article is not?

  • It is not comprehensive introduction to statistics
  • It is not going to make you millionaire after you read it
  • It is not about mathematical details. It is about building intuition.

Central limit theorem is all about decision making based on evidence. If you have bad/wrong/misleading evidence, then you will make wrong decisions. Garbage in garbage out!

Boring thing about statistics is that you can’t be 100% sure weather your data is of good enough quality. You will be ready to say something like “There is high chance data is of good/bad quality.” Later on we will find out what exactly data quality mean.

Central limit theorem is easy to write down mathematically, but difficult to understand conceptually. Numerical proof of central limit theorem can be written in one line of code.

For beginners, it is very important to accept that you cannot know weather you are right or not regarding some particular case. You can be sure that if you follow CLT, you will be right 90% of time. 10% of time you will be mislead by CLT.

Liars

Imagine you got a job in different city in different country. You are living in city with no family and no friends. So you decide to spend more time in downtown in order to meet new friends. After while, you got new acquaintances. How should you know if you can rely on them?

That’s where statistics comes in! Imagine you met three girls during your first month in new country. You need to figure out if you can trust them. So you start dating and talking about different topics. After you met new friends several times and had spoken about different topics, outcome is following:

  • Girl A, pretty cool girl, talked correctly and honestly about 9 out of 10 topics
  • Girl B, really bad girl, talked correctly and honestly 2 out of 10 times you met her
  • Girl C, which has extremely volatile temperament, talked honestly 5 out of 10 times

Unfortunately life is too short and you do not have time to waste. You need to choose one to hang out. Which one to choose?

Based on evidence, girl A is the most trustful and girl B is the least trustful one. Girl C is very changeable in nature. Some of the time she lies, some of the time tells truth.

What should you learn from this example? How can you put your findings into action?

Of course, thinking about credibility of someones story is extremely energy and time consuming activity. You want to spend as little as possible time on it. So, you will check credibility of first several stories and conclude about person’s general credibility. Later on, you use your findings about person’s general credibility when it comes to figuring out truthfulness of a new story.

Put it simple, suppose you go jogging with girl A and listen her talking about event that happened day before. You should be pretty sure that 90% of story is true.

Similar logic is applied to girl B. Only 20% of her new story should be considered as true.

For girl C, situation is way more complex. You cannot know weather her new story is correct. Sometimes she lies, sometimes she tells truth. Only thing you can be sure is that you cannot be sure about truthfulness of her story.

From example we can observe correlation between girl and ‘level of truthfulness.’

  • Girl A, strong positive correlation. If story comes from mouth of girl A, there is high chance it is truth.
  • Girl B, strong negative correlation. If you heard story from girl B, there is extremely low chance it to be truthful story.
  • Girl C, absence of correlation. If you know that story comes from girl C, based on that evidence, you can’t say anything about credibility of story.

Significant correlation is called pattern. For correlation to be significant, it should be calculated on large sample. Sample is topic of next chapter. Keep on reading!

Avoid people who are constantly telling lies. Invest time in people who tell truth.

In previous section we acquired intuition about statistical inference. Let’s be more technical!

Statistics is all about making conclusions based on sample of evidence. When you go out and hang out, talk to people you met in new city, you are doing sampling.

In order to be good, sample should be representative. As name says, representative sample represents well. Wait, wait, what represents what??!!

Sample VS Population

Sample describes population. Sample should be credible. Sample should be good description of population.

Imagine you want to measure average height of world population. It is not possible to measure every member of mankind. So, we call inferential statistics to help us. We take sample i.e. choose some group of people for research and measure their height. We calculate average height from this group and call it estimate of average height of world population.

We need to collect sample as diverse as possible. There should be included people from all around the world. Imagine what happens if you go in some African village where all people are above 2m height and take sample. You would get non representative sample. Also, age should be diverse, youngsters tends to be smaller, of course. Women are lower than men. Sample should have 50:50 proportion men to women.

Technical definition for random sample: “For sample to be random, every member of population should have same probability to be drawn.”

In our three women story, population is everything what she ever said to everyone. Sample is what she told you.

For sample to be good, it should be drown from population independently. Meaning that we should talk to her about whole range of different topics, during longer period of time, in different periods of day, in different social circles…

Maybe there is tendency for people to be more honest in the morning or when they are in some particular social circle or during some temporary period of their life. In order to avoid these effects, let’s randomize! Let’s draw sample from all those circumstances!

People look at you through eyes of your past behavior. Relevance of your new story is assessed based on your past story credibility. It is naturally embedded algorithm in human brain. We learned that it is the best to go through life this way. It is going to make us 90% correct. But, think for a moment about rest of the cases. These are times when we make wrong decisions. If someone used to lie doesn’t mean he always lies. If someone used to talk truth doesn’t mean he never lies. What is life impact of those cases when we make wrong decision? How impactful those 10% cases can be?

Trust your intuitions, but be aware that you can be deeply wrong!

--

--

Milan Mitrović
Milan Mitrović

Written by Milan Mitrović

B.A. in Economics, Faculty of Economics, University of Belgrade

No responses yet