A statistical distribution is a parameterized mathematical function that gives the probabilities of different outcomes for a random variable. There are discrete and continuous distributions depending on the random value it models. This article will introduce the seven most important statistical distributions, show their Python simulations with either the Numpy library embedded functions or with a random variable generator, discuss the relationships among different distributions and their applications in data science.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Bernoulli distribution is a discrete distribution. The assumptions of Bernoulli distribution include: 1. only two outcomes; 2. only one trial. Bernoulli distribution describes a random variable that only contains two outcomes. For example, when tossing a coin one time, you can only get “Head” or “Tail.” We can also generalize it by defining the outcomes as “success” and “failure.” If I assume that when I toss a dice, I only care if I get six, I can define the outcome of a dice showing six as “success” and all other outcomes as “failure.” Even though tossing a dice has six outcomes, in this experiment that I define, there are only two outcomes, and I can use Bernoulli distribution.
Simulating a Bernoulli trial is straightforward by defining a random variable that only generates two outcomes with a certain “success” probability p:
#success probability is the same as failure probability
np.random.choice(['success','failure'], p=(0.5, 0.5))
#probabilities are different
np.random.choice(['success','failure'], p=(0.9, 0.1))
'success'
Binomial distribution is also a discrete distribution, and it describes the random variable x as the number of success in n Bernoulli trials. You can think of the binomial distribution as the outcome distribution of n identical Bernoulli distributed random variables. The assumptions of the Binomial distribution are: 1. each trial only has two outcomes (like tossing a coin); 2. there are n identical trials in total (tossing the same coin for n times); 3. each trial is independent of other trials (getting “Head” at the first trial wouldn’t affect the chance of getting “Head” at the second trial); 4. p and 1-p are the same for all trials (the chance of getting “Head” is the same across all trials);
There are two parameters in the distribution, the success probability p and the number of trials n.
The probability that we have x number of success out of n trials is like choosing x out of n when order doesn’t matter.
Thinking about Binomial distribution as n identical Bernoulli distributions.
Python’s Numpy library has a built-in Binomial distribution function. To simulate it, define n and p, and set to simulate 1000 times:
n = 100
p = 0.5
size = 1000
binomial = np.random.binomial(n,p,size)
plt.hist(binomial)
plt.axvline(binomial.mean(), color='r', linestyle='dashed', linewidth=2) # mean around n*p
<matplotlib.lines.Line2D at 0x7f95db751750>
# When setting n equals to 1, we can simulate the Bernoulli distribution:
n = 1
p = 0.5
size = 10000
bernoulli = np.random.binomial(n,p,size)
plt.hist(bernoulli)
(array([4977., 0., 0., 0., 0., 0., 0., 0., 0.,
5023.]),
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
<a list of 10 Patch objects>)
Geometric distribution is a discrete distribution that models the number of failures (x failures) before the first success in repeated, independent Bernoulli trials. For example, the random variable can be how many “Tails” would you get before you get your first “Head.” It can also model the number of trials to get the first success (x-1 failures), like how many times you have to toss until you get the first “Head.” The only difference between these two random variables is the number of failures. The Geometric distribution assumptions are the same as the Binomial distribution because they both derive from some identical independent Bernoulli trails.
When the random variable x is the number of failures before the first success
We need to use the geometric series to derive the expected value and variance for Geometric distribution.
To simulate Geometric distribution, we can use the Bernoulli trials and count for the number of failures before the first success, and then plot the number of failures
geometric = []
failure = 0
n=0
p=0.5
while n<10000:
result = np.random.choice(['success','failure'],p=(p,1-p))
if result == 'failure':
failure+=1
else:
geometric.append(failure)
failure = 0
n+=1
plt.hist(geometric)
plt.axvline(np.mean(geometric),color='red')
<matplotlib.lines.Line2D at 0x7f95db72edd0>
Uniform distribution models a random variable whose outcomes are equally likely to happen. The outcomes can be discrete, like the outcomes getting form tossing a dice, or continuous, like the waiting time for a bus to arrive. Thus Uniform distribution can be a discrete or continuous distribution depending on the random variable. The assumptions are: 1. there are n outcomes (discrete), or a range for the outcomes to be at (continuous); 2. All values in the outcome set or the range are equally likely to occur.
The discrete uniform distribution is straightforward, and it is easy to calculate the expected values and variance.
To simulate Uniform distribution, we can use the Numpy’s embedded function and specifying the range of the distribution. Based on a Uniform distribution at [0,1], we can generate the Uniform distribution at [a, b]:
#generate a random variable follows U(0,1)
uniform = np.random.uniform(0,1,size=10000)
plt.hist(uniform)
(array([ 944., 976., 1021., 945., 1082., 1029., 1083., 1003., 902.,
1015.]),
array([3.34797986e-05, 1.00021879e-01, 2.00010278e-01, 2.99998676e-01,
3.99987075e-01, 4.99975474e-01, 5.99963873e-01, 6.99952272e-01,
7.99940671e-01, 8.99929070e-01, 9.99917469e-01]),
<a list of 10 Patch objects>)
#use U(0,1) to generate U(a,b)
def uniform(a,b):
return a + (b-a) * np.random.uniform(0,1,size=10000)
plt.hist(uniform(1,10))
(array([1011., 951., 951., 1037., 1003., 998., 1054., 975., 1068.,
952.]),
array([1.000632 , 1.9004242 , 2.8002164 , 3.7000086 , 4.59980081,
5.49959301, 6.39938521, 7.29917741, 8.19896961, 9.09876181,
9.99855401]),
<a list of 10 Patch objects>)
Normal (Gaussian) distribution is the most widely used continuous distribution because it represents the most universe situations. Many random variables are normally distributed because of the Central Limit Theory, or they are assumed to be normally distributed before fitting them into a statistical model. Normal distribution has some unique characteristics: 1. mean=mode=median=µ ; 2. the PDF is bell-shaped and symmetric at x=µ; 3. the values between [µ-σ, µ+σ] takes roughly 68% of the data, where σ is the standard deviation
mu = 0
sigma = 1
normal = np.random.normal(mu, sigma, 1000)
plt.hist(normal)
(array([ 4., 8., 53., 114., 230., 258., 199., 103., 22., 9.]),
array([-3.42417676, -2.7809482 , -2.13771965, -1.49449109, -0.85126253,
-0.20803398, 0.43519458, 1.07842314, 1.72165169, 2.36488025,
3.00810881]),
<a list of 10 Patch objects>)
# Or we can also use Central Limit Theory to simulate Normal distribution:
def clm(N,n):
#generate a sample from any random population N
lis = np.random.random(size=N)
means = []
#take a random sub sample with size n from N
for i in range(n):
lis_index = np.random.randint(0,N,n)
means.append(lis[lis_index].mean())
i+=1
#plot means
return plt.hist(means)
clm(10000,1000)
(array([ 8., 15., 78., 169., 256., 218., 153., 79., 20., 4.]),
array([0.47204989, 0.47778138, 0.48351287, 0.48924436, 0.49497585,
0.50070734, 0.50643883, 0.51217032, 0.51790181, 0.5236333 ,
0.5293648 ]),
<a list of 10 Patch objects>)
Poisson distribution is a discrete distribution that models the probability of a number of events occurring in a fixed interval of time or space. Poisson distribution can be used to present the number of customers arriving in a store in an hour, or the number of phone calls a company receives one day, etc. Poisson distribution is closely related to binomial distribution if you measure the number of event occurrences as the number of success. For example, when measuring how many cars will pass a particular street in an hour, the number of cars passing is a random variable that follows Poisson distribution. To understand the car passing event, you can break down one hour into 60 minutes, and see how many cars will pass in a minute, then generalize it into an hour. For a minute, maybe more than one cars pass the street, thus it is not a binary random variable. However, if we break down one minute into 60 seconds, it is likely that only one car passes or no car pass in a single second. We can keep breaking down a second to make a more confident claim. We can then consider car passing as a successful event and no car passing as a failure event, and model how many success events (how many cars passing) in 3600 trials (3600 seconds in an hour), which is a Binomial distribution with success probability p, and the number of trials equals to 3600. Connecting Poisson distribution with binomial distribution helps us understand the assumptions and PMF of Poisson distribution. The assumptions Poisson distribution are: 1. any successful event should not influence the outcome of other successful events (observing one car at the first second doesn’t affect the chance of observing another car the next second); 2. the probability of success p, is the same across all intervals (there is no difference between this hour with other hours to observe cars passing by); 3. the probability of success p in an interval goes to zero as the interval gets smaller (if we are discussing how many cars will pass in a millisecond, the probability is close to zero because the time is too short);
The Poisson distribution gives the probability of observing a certain number of events in an interval, given the average number of events in the interval. The expected value and variance of a random variable that follows Poisson distribution are both λ.
To simulate Poisson distribution, we can use the Numpy function:
lam = 1
poi = np.random.poisson(lam=lam,size=1000)
plt.hist(poi)
plt.axvline(x=lam,c='red')
<matplotlib.lines.Line2D at 0x7f95dee604d0>
The Exponential distribution is a continuous distribution that is closely related to the Poisson distribution. It is the probability distribution of the time intervals between Poisson events. If the number of calls a company receives in an hour follows Poisson distribution, then the time interval between calls follows Exponential distribution. Since exponential distribution is closely related to Poisson distribution, its assumptions follow the Poisson distribution’s assumptions.
We can derive the PDF of exponential distribution by first deriving its Cumulative Distribution Function (CDF) and then take the derivative of the CDF. We are using this trick because we can get the CDF using Poisson distribution. Suppose we are still observing how many cars are passing in the same street, and now we care about the random variable ω, which is defined as when we see one car passing, it takes at least ω minutes to observe another car passing. Let’s start from the beginning — how long it takes to observe the first car? Assume it takes ω minutes to observe the first car in ω minutes, there is zero car passing.
To simulate the Exponential distribution, we can use the Numpy random function:
lam = 2
exp = np.random.exponential(lam, 10000)
plt.hist(exp)
plt.axvline(x=lam, c='red')
<matplotlib.lines.Line2D at 0x7f95dee60d10>
# Or we can use the CDF and Uniform distribution from (0,1) to simulate it.
rate = 1/20
inverse_cdf=np.random.uniform(0,1,size=1000)
interval = [-np.log(1-u)/rate for u in inverse_cdf]
plt.hist(interval)
(array([526., 257., 117., 48., 26., 12., 6., 7., 0., 1.]),
array([3.60583329e-02, 1.51380872e+01, 3.02401160e+01, 4.53421449e+01,
6.04441737e+01, 7.55462026e+01, 9.06482314e+01, 1.05750260e+02,
1.20852289e+02, 1.35954318e+02, 1.51056347e+02]),
<a list of 10 Patch objects>)
As discussed partially above, these distributions are closely related to each other in different ways:
P( X > a + b | X> a ) = P ( X> b )
You can read more details about this property here. One of the implications you can think of is that if the lifetime of batteries has an Exponential distribution, then a used battery is as good as a new one, as long as it’s not dead.
Understanding the assumptions, properties of different statistical distributions definitely helps data scientists with their daily tasks: 1. Monte Carlo simulations start with generating random variables from a Uniform distribution; 2. Binomial distribution sets the foundation of any binary classification model; 3. All regression models that use the least-squares cost function assume the residuals to follow a Normal distribution with mean equals to zero; 4. The transformed distributions from the Normal distribution, like the Student t distribution, Standard Normal distribution are the distributions used for conducting hypothesis testing, computing p values, and getting confidence interval; Log-normal distribution describes all samples of data when the data is right-skewed; 5. Poisson and Exponential distributions are greater for modeling events with a fixed time rate (λ). For example, we can use Poisson distribution to model how many customers will show up in a shop in a day, and use Exponential distribution to model how many time it takes between two consecutive customers to enter the shop.