top of page
Lance Darragh

Statistical Significance Demystified

Updated: Mar 12, 2021

Here we want to provide you with a straightforward explanation of statistical significance, what it means to you, and how to use it right away in Einstein Discovery.


Why do we need to know about statistical significance?


The most common answer to this question is that it tells us that the correlation between two variables is not due to random chance. But how do we apply this answer? Currently, the most common application of statistical significance is feature selection for machine learning models. If two features have the same cause, one could not have caused the other. If we can prove that this is the case for two variables, one input and one output, that input variable will not make a good predictor of that output variable. However, if we can prove that they, almost certainly, come from different causes, then it's possible that one could have caused the other, or at least played a part in causing the other.


You may be familiar with correlation. If so, you may have heard the phrase, "Correlation does not imply causation". Correlation can give you a model with good accuracy when developed, but if it's not based on sound reasoning, it may not do well when deployed in the real world. Using correlation is good algorithmically, but using statistical significance is more likely to discover meaningful relationships between variables.


The term statistical significance is agreeably a misnomer. A more accurate term is statistically significant difference, and we can really just focus on the last two words, significant difference. A significant difference in what? For two variables, a significant difference in their distributions. (To demystify distributions, click here.)



On the left, we see two overlapping distributions. If we randomly picked a value and tried to determine which distribution it came from, it would not be easy. Because they have similar mean and spread, they're most likely caused by the same event. There is not likely a significant difference between them.


The middle distributions are a little farther apart, so it's more likely that two different events caused them, but it's still not entirely clear, maybe like the area of pollination of two similar species of Iris flowers.


The rightmost distributions are most likely caused by different events; even though they have the same standard deviations, they have very different mean values. There are tests to quantify the similarity of distributions, such as the two-sample t-test.


This t-test compares the means of each distribution, µ1 and µ2, by subtracting them. It normalizes by a term that includes the standard deviations, σ1 and σ2 (actually variances, which are the square of the standard deviations), and the numbers of terms of each variable, n1 and n2.



The result is known as a test statistic, specifically a t-value. It is a number of standard deviations from a mean in a distribution. Then we take the result and calculate a probability using a distribution (if you're unclear on this process, see Distributions Demystified). The only difference is that we're going to use our t-value in what's called a t-distribution, which is a specific type of distribution that's very similar to a normal distribution. The t-distribution is primarily used only if we have 20 data points or less. If there are any more, it converges to the standard normal distribution. In some applications, people use the standard normal distribution instead because it gives more accurate results.


Taking the resulting probability and subtracting it from one gives us a p-value. If it's less than our threshold, our two variables are significantly different from each other, so any correlation is not likely to be by chance, and it is more likely that one could have affected the other.


How to Apply Statistical Significance in Einstein


Einstein uses the t-test in the "What Happened" and "Why it Happened" insights pages. They are listed in order of decreasing statistical significance. Within each insight, the bars that are blue are determined to be statistically significant by using a two-sample t-test comparing two distributions, the output variable with that value of the input variable and the output variable with all other values of that input variable. Using these insights can help you know which features are more important to your model and your business. To drill down further, use the "What Could Happen" page. These results will be listed by decreasing statistical significance. For a method to optimize your Einstein Discovery model using statistical significance, click here.



References:


Christian Pascual. "Tutorial: Basic Statistics in Python — Probability" From DataQuest.com: https://www.dataquest.io/blog/basic-statistics-in-python-probability/


Rebecca Bevans. "T-distribution: What it is and how to use it" From Scribbr.com: https://www.scribbr.com/statistics/t-distribution/


"Worksheet for how to calculate T Test" nCalculators.com https://ncalculators.com/math-worksheets/how-to-calculate-t-test.htm (accessed March 11, 2021).



42 views0 comments

Recent Posts

See All

Commentaires


bottom of page