Distributions Demystified
Why Do We Need to Know About Distributions?
We may see distributions when editing our model in Einstein Discovery, but what are they really telling us? Distributions tell us many things, but we’ll focus on one aspect of them for our purposes. They help us calculate statistical significance between two variables for feature selection in machine learning applications (If you’re not familiar with statistical significance, don’t worry. This blog is intended to serve as a prerequisite for the blog Statistical Significance Demystified). Distributions are like functions that are derived from phenomena in the real world; their qualitative characteristics give us insight into what may have caused the numeric values we have in our data. Using this is more likely to help us select features that have a meaningful relationship with the outcome variable than using correlation. Distributions are best learned when illustrated with examples.
Examples of Distributions
If we flip a coin and plot how many times we get heads and how many times we get tails, then graph them, we get something like this.
We used 100 coin flips to illustrate that each result occurs about 50% of the time. Each result has a 50% probability of occurring because only two results can occur, and each is equally probable (equiprobable). These results are theoretical, but in reality, when we increase the number of repetitions of the experiment (times we flip the coin), the actual values converge to the theoretical values. This is an established principle known as the Law of Large Numbers. These types of equiprobable events have what is called a discrete uniform distribution. Discrete means we have integer values. There is also a continuous uniform distribution for continuous values.
Another example of a discrete uniform distribution is rolling a die. Each number on the die is equally probable, with a probability of 1/6 (about %16.67). To show these as a percentage with integer values, we’ll simulate rolling a die 102 times (the number closest to 100 that’s divisible by 6). Theoretically, we get the following results.
Each value occurs 17 times. Now let’s see what happens when we roll two dice and record the sum of them. Here, not all values are equiprobable. There are eleven possible values, the integers 2 through 12, inclusive. But each value can occur in a different number of ways. A 2 can only occur in one way, when both dice equal a 1.
Similarly, a 12 can only occur in one way, when both dice equal a 6. Yet a 3 can occur in two ways, when one die equals a 1 and the other die equals a 2 and when one die equals a 2 and the other die equals a 1. Similarly for 11, when one die equals a 5 and the other equals a 6 or vice versa. 4 and 10 can be rolled in 3 ways. 5 and 9 can be rolled in 4 ways. 6 and 8 can be rolled in 5 ways. And 7 can be rolled in 6 ways.
The total number of ways the dice can roll is 1 + 1 + 2 + 2 + 3 + 3 + 4 + 4 + 5 + 5 + 6 = 36. We calculate the probability of an event (such as rolling a 2) as the number of ways that event can occur divided by the total number of ways all possible events can occur. So 2 and 12 have a probability of 1/36, 7 has a probability of 6/36 or 1/6, and so on. And in a distribution recorded over a large number of trials, 2 and 12 will take up about 1/36 of the values, and 7 will take up about 1/6 of the values.
To get integer occurrences of each value again, we’ll simulate rolling the dice 108 times. Our theoretical results will look like this.
The values are slightly higher than their actual probabilities, expressed as a percentage (off by 8%). But this is what's known as a discrete triangular distribution. If we calculate the distribution's mean, it comes out to be the number in the middle. [(2 + 12)*3 + (3 + 11)*6 + (4 + 10)*9 + (5 + 9)*12 + (6 + 8)*15 + 7*18] / 7 = 7. This is because the distribution is symmetric. There's a mean value, and there's a certain amount of variation from that mean. We calculate that as the standard deviation (σ).And we can calculate these two statistics for any distribution. Let µ be the mean, and let n be the number of values in the distribution (we subtract 1 because the mean value is equal to itself, so it results in zero).
The standard deviation of this distribution is about 2.415. So, most of the values within the range 7 ± 2.415. We’ll talk more about that soon. We used this example to segue into the distribution that comes next, a (continuous) normal distribution, which looks like this.
It is symmetric; its mean value is right in the middle, and it has a particular standard deviation. It’s described by the general equation (known as its probability density function):
This is how most numeric values occur in nature occur, continuous and discrete. If we counted the number of people that come into a mall over many days, we'd find some variation of a (discrete) normal distribution. If we found a field of flowers and measured many of their heights', all of the same kind, they'd follow some sort of normal distribution. We'd find that they have an average value and a certain variation from that value, some a little taller and some a little shorter. A popular dataset used to introduce statistical concepts is the Iris Dataset. It measures the length and width of sepals for three species of Iris flowers, 50 samples of each. The plots of the sepal length for the three species are shown below.
These are all different types of normal distributions. They all have different means. They also have different standard deviations. But they're all normal distributions. The one on the left (Iris setosa) has the smallest mean and the smallest standard deviation, as seen by the spread of values to the mean's left and right. And the one on the right (Iris virginica) has the highest mean and the highest standard deviation.
A standard normal distribution has a mean of zero and a standard deviation of 1.
Any normal distribution can be converted to a standard normal distribution by subtracting its mean and dividing by its standard deviation, a process known as standardization. Let X be the normal distribution, µ be the mean, σ be the standard deviation, and Z be the standard normal distribution. We convert (standardize) with the following formula.
Z = (X - µ) / σ
Ok, so what do we do with distributions? We calculate probabilities and statistical significance from them. We’ll be using the above equation to do it. The probabilities are necessary to calculate the statistical significance, so we’ll cover those first.
Calculating Probabilities from Distributions
Probabilities are calculated from normal distributions by calculating the area under the curve above or below a given value (or between two values). For example, normal distributions have the interesting property that the same percentage of values lies within plus or minus one standard deviation from the mean, no matter what normal distribution it is. That number is approximately 68.3% of the data.
It’s also true that 95.5% of the data lies within plus or minus two standard deviations from the mean, and 99.7% of the data lies within plus or minus three standard deviations from the mean. So if we had a number equal to three standard deviations from a mean value, we would know that 0.15% of the data lied above it (100% - 99.7%) / 2 (there are two tails to normal distributions). Any number above this value has a 0.15% probability of occurring in this distribution.
Calculating Statistical Significance of Numerical Values
This value of 0.15%, expressed as a decimal, 0.0015, is known as a p-value. And it is the number used to determine if a given value is statistically significant (significantly different) from the values in the distribution we have. When we do this, we set a threshold, known as a confidence level (how far up in the distribution we want to set our threshold). The most common value is 95% because of the probability threshold of 5% proposed by the statistician Ronald Fisher in 1925, who popularized the use of p-values.
If the p-value we calculate is below our threshold, the result is not likely to occur. It only exists in a small percentage of our distribution, and keep in mind that the tails of the distribution (the far left and far right portions) extend to infinity. In other words, any numeric value above the numeric value that corresponds to a p-value equal to our threshold is extremely unlikely to occur. So we conclude that it is (statistically) significantly different from our distribution, and we say that the value is statistically significant.
Using the sum of rolling two dice example above, if we set a confidence level of 95%, our values of 2 and 12 would be considered statistically significant since they each occur less than 5% of the time. However, if we set our confidence level to 99% (which is also common), we would have zero statistically significant values because every value occurs at least 1% of the time. Running the same statistical significance tests with different confidence levels can give different results. A good approach is to use more than one value and compare the results.
Calculating Numerical Values from Probabilities
Given a probability and a distribution, how do we calculate the numeric value that corresponds to it? To answer that, we’ll talk about a common type of probability you’re probably familiar with, a percentile. If you’re in the top 10th percentile, you’re in the top 10 percent of the population. To calculate a percentile from any set of numerical data, we can order the values and count how many are above or below it. So for example, if we were comparing miles per gallon (or km/gal) of cars, and we wanted to know the percentile of our car, we’d take our mpg or km/gal and see how many values are below ours and express it as a percentage of all values. But what if we wanted to go the other way. Say someone told us that our car was at the 90th percentile of all vehicles. What’s our mpg or km/gal?
For that, we need to integrate the normal distribution. When we do that, we integrate its probability density function (shown earlier). To make the calculation more general for anyone’s use, we use the standard normal distribution’s probability density function, which is the same function with a mean (µ) of zero and a standard deviation (σ) equal to one.
The integral can’t be solved in closed form, so we have to use numerical methods. A standard table of values is commonly used to lookup percentiles for given values. It’s known as a Z-table, and it looks like this.
The row indices are the first two digits of the number of standard deviations, and the column indices are the third digit. The entries in the table are the probabilities (percentage of area under the curve). To find the 90th percentile, we look up the value closest to 0.90 (row 1.2 and column 0.09), which tells us that about 90% of the data lies below 1.29 standard deviations from the mean. This 1.29 is known as a Z-score.
To convert this back to our distribution, we rearrange the equation used to standardize our normal distribution to solve for X, plug in the values, and we have our answer. Only we let x be a value in X and z be a value in Z.
x = µ + zσ
So, if our mean mpg (or km / gal) is 3, and we have a standard deviation of 5, the mpg (km / gal) that corresponds to the 90th percentile is x = 30 + 1.29*5 = 36.45 mpg (km / gal).
Ok, this works for a value in a distribution. But how do we use this to tell if two variables are different? We see if their distributions are different.
Different Types of Distributions
A distribution that’s different from what we’ve seen is that of the count of atoms (really isotopes) of an element that radioactively decays over time.
This is known as an exponential distribution, which is way different from the distributions from when we rolled dice or a normal distribution.
Here’s another type of exponential distribution, increasing population size as measurements are taken over time. These two examples are to reiterate how different causes generate different distributions.
If we saw a normal distribution for our output variable and one of these for an input variable, we would know that they come from different causes and look at the correlation to determine if we should include it in our model (possibly with a data transformation). To learn how to determine statistical significance between two variables using their distributions quantitatively, click here.
References:
[1] Wikipedia contributors, "Triangular distribution," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Triangular_distribution&oldid=995101682 (accessed March 11, 2021).
[2] Taylor, Courtney. "Formula for the Normal Distribution or Bell Curve." ThoughtCo. https://www.thoughtco.com/normal-distribution-bell-curve-formula-3126278 (accessed March 11, 2021).
[3] Vijay Kotu, Bala Deshpande PhD "Predictive Analytics and Data Mining" ScienceDirect.com https://www.sciencedirect.com/topics/computer-science/iris-virginica (accessed March 11, 2021).
[4] Guang Jin, "Summary Measures for Quantitative Data" http://my.ilstu.edu/~gjin/hsc204-hed/Module-5-Summary-Measure-2/Module-5-Summary-Measure-28.html (accessed March 11, 2021).
[5] Capital.com, "Standard Deviation" https://capital.com/standard-deviation-definition (accessed March 11, 2021).
[6] Wayne W. LaMorte, MD, PhD, MPH, "Finding Percentiles with the Normal Distribution" https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-Module6-RandomError/PH717-Module6-RandomError7.html#:~:text=A%20percentile%20is%20the%20value,development%20relative%20to%20their%20peers. (accessed March 11, 2021).
[7] Michael Galarnyk, "How to Use and Create a Z-Table (Standard Normal Table)" https://towardsdatascience.com/how-to-use-and-create-a-z-table-standard-normal-table-240e21f36e53
(accessed March 11, 2021).
[8] Stephanie Glen, "Z-table (Right of Curve or Left)" StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/tables/z-table/ (accessed March 11, 2021).
[9] Wikipedia contributors, "P-value," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=P-value&oldid=1008157709 (accessed March 11, 2021).
[10] Joe Sullivan, "General College Chemistry II" LumenLearning.com https://courses.lumenlearning.com/suny-mcc-chemistryformajors-2/chapter/spontaneity/ (accessed March 11, 2021).
[11] Kahn Academy, "Exponential & logistic growth" https://www.khanacademy.org/science/ap-biology/ecology-ap/population-ecology-ap/a/exponential-logistic-growth (accessed March 12, 2021).
Recent Posts
See AllOverview: In dashboards, visualizations are built on top of queries, and the queries are built on top of the data. Therefore, we must...
Comments