The Gaussian distribution often called the Normal distribution. The distribution provides a parameterized mathmatical function that can be used to calculate the probability for any individual observation from the sample space. The distribution describes the grouping or the density of the observations ,called the probability density function. A summary of these relationships between observations is called a cummulative density function.
In this blog , you will discover the Gaussian and related distribution functions and how to calculate probability and cumulative density functions for each.After completing this blog, you will know:
A distribution as a function that describes the relationship between observations in a sample space. The distribution is a mathematical function that describes the relationship of observations of different heights.
A distribution is simply a collection of data, or scores , on a variable, Usually, these scores are arranged in order from smallest to largest and then they can be presented. graphically.
Distributions are often described in terms of their density or density functions. Density functions are functions that describe how the proportion of data or likelihood of the proportion of observations change over the range of the distribution. Two types of density functions are probability density functions and cumulative density functions.
A probability density function(PDF) can be used to calculate the likelihood of a given observation in a distribution. It can also be used to summarize the likelihood of observations across the distributions across the distribution's sample space.
A cummulative density function(CDF) is a different way of thinking about the likelihood od observed values. A CDF is often plotted as a curve from 0 to 1 for the distribution.
Both PDFs and CDFs are continuous functions. The equivalent of a PDF for a discrete distribution is called a probability mass function, or PMF.
The Gaussian distribution is the focus of much of the field of statistics. Data from many fields of study surprisingly can be described using a Gaussian distribution, the distribution called the normal distribution because it is so common. A Gaussian distribution can be described using two parameters:
we can work with the Gaussian distribution via the norm SciPy module . The norm.pdf() function can be used to create a Gaussian probability density function with a given sample space,mean, and standard deviation. The example below creates a Gaussian PDF with a sample space from -5 to 5, a mean of 0, and a standard deviation of 1. A Gaussian with these values for the mean and standard deviation is called the Standard Gaussian.
- # plot the gaussian pdf
- from numpy import arange
- from matplotlib import pyplot
- from scipy.stats import norm
- # define the distribution parameters
- sample_space = arange(-5, 5, 0.001)
- mean = 0.0
- stdev = 1.0
- # calculate the pdf
- pdf = norm.pdf(sample_space, mean, stdev)
- # plot
- pyplot.plot(sample_space,pdf)
- pyplot.show()
Running the example creates a line plot showing the sample space in the x-axis and the likelihood of each value of the y-axis. The line plot shows the familiar bell-shape for the Gaussian distribution. The top of the bell shows the most likely value from the distribution, called the expected value or the mean, which in this case is zero, as we specified in creating the distribution
The norm.cdf() function can be used to create a Gaussian cummulative density function. The example below creates a Gaussian CDF for the same sample space.
- # plot the gaussian cdf
- from numpy import arange
- from matplotlib import pyplot
- from scipy.stats import norm
- # define the distribution parameters
- sample_space = arange(-5,5,0.001)
- # calculate the cdf
- cdf = norm.cdf(sample_space)
- # plot
- pyplot.plot(sample_space,cdf)
- pyplot.show()
Running the example creates a plot showing an S-shape with the sample space on the x-axis and the cumulative probability of the y-axis. We can see that a value of 2 covers close to 100% of the observations, with only a very thin tail of the distribution beyond that point. We can also see that the mean value of zero shows 50% of the observations before and after that point.
The Student’s t-distribution, or just t-distribution for short, is named for the pseudonym Student by William Sealy Gosset. It is a distribution that arises when attempting to estimate the mean of a normal distribution with different sized samples. As such, it is a helpful shortcut when describing uncertainty or error related to estimating population statistics for data drawn from Gaussian distributions when the size of the sample must be taken into account. Although you may not use the Student’s t-distribution directly, you may estimate values from the distribution required as parameters in other statistical methods, such as statistical significance tests. The distribution can be described using a single parameter:
Key to the use of the t-distribution is knowing the desired number of degrees of freedom. The number of degrees of freedom describes the number of pieces of information used to describe a population quantity. For example, the mean has n degrees of freedom as all n observations in the sample are used to calculate the estimate of the population mean. A statistical quantity that makes use of another statistical quantity in its calculation must subtract 1 from the degrees of freedom, such as the use of the mean in the calculation of the sample variance. Observations in a Student’s t-distribution are calculated from observations in a normal distribution in order to describe the interval for the populations mean in the normal distribution. Observations are calculated as:
Where x is the observations from the Gaussian distribution, mean is the average observation of x, S is the standard deviation and n is the total number of observations. The resulting observations form the t-observation with (n − 1) degrees of freedom. In practice, if you require a value from a t-distribution in the calculation of a statistic, then the number of degrees of freedom will likely be n − 1, where n is the size of your sample drawn from a Gaussian distribution.
Which specific distribution you use for a given problem depends on the size of your sample
SciPy provides tools for working with the t-distribution in the stats.t module. The t.pdf() function can be used to create a Student’s t-distribution with the specified degrees of freedom. The example below creates a t-distribution using the sample space from -5 to 5 and (10,000 - 1) degrees of freedom.
- # plot the t-distribution pdf
- from numpy import arange
- from matplotlib import pyplot
- from scipy.stats import t
- # define the distribution parameters
- sample_space = arange(-5,5,0.001)
- dof = len(sample_space) -1
- # calculate the pdf
- pdf = t.pdf(sample_space,dof)
- # plot
- pyplot.plot(sample_space, pdf)
- pyplot.show()
Running the example creates and plots the t-distribution PDF. We can see the familiar bell-shape to the distribution much like the normal. A key difference is the fatter tails in the distribution (hard to see by eye), highlighting the increased likelihood of observations in the tails compared to that of the Gaussian.
The t.cdf() function can be used to create the cumulative density function for the tdistribution. The example below creates the CDF over the same range as a
- # plot the t-distribution cdf
- from numpy import arange
- from matplotlib import pyplot
- from scipy.stats import t
- # define the distribution parameters
- sample_space = arange(-5,5,0.001)
- dof = len(sample_space)-1
- # calculate the cdf
- cdf = t.cdf(sample_space,dof)
- # plot
- pyplot.plot(sample_space,cdf)
- pyplot.show()
Running the example, we see the familiar S-shaped curve as we see with the Gaussian distribution, although with slightly softer transitions from zero-probability to one-probability for the fatter tails.
The Chi-Squared distribution is denoted as the lowercase Greek letter chi (χ) pronounced “ki” as in “kite”, raised to the second power (χ 2 ). It is easier to write Chi-Squared. Like the Student’s t-distribution, the Chi-Squared distribution is also used in statistical methods on data drawn from a Gaussian distribution to quantify the uncertainty. For example, the Chi-Squared distribution is used in the Chi-Squared statistical tests for independence. In fact, the Chi-Squared distribution is used in the derivation of the Student’s t-distribution. The Chi-Squared distribution has one parameter:
servation in a Chi-Squared distribution is calculated as the sum of k squared observations drawn from a Gaussian distribution.
Where chi is an observation that has a Chi-Squared distribution, x are observation drawn from a Gaussian distribution, and k is the number of x observations which is also the number of degreesof freedom for the Chi-Squared distribution. Again, as with the Student’s t-distribution, data does not fit a Chi-Squared distribution; instead, observations are drawn from this distribution in the calculation of statistical methods for a sample of Gaussian data. SciPy provides the stats.chi2 module for calculating statistics for the Chi-Squared distribution. The chi2.pdf() function can be used to calculate the Chi-Squared distribution for a sample space between 0 and 50 with 20 degrees of freedom. Recall that the sum squared values must be positive, hence the need for a positive sample space.
- # plot the chi-squared pdf
- from numpy import arange
- from matplotlib import pyplot
- from scipy.stats import chi2
- # define the distribution parameters
- sample_space = arange(0, 50, 0.01)
- dof = 20
- # calculate the pdf
- pdf = chi2.pdf(sample_space,dof)
- # plot
- pyplot.plot(sample_space, pdf)
- pyplot.show()
Running the example calculates the Chi-Squared PDF and presents it as a line plot. With 20 degrees of freedom, we can see that the expected value of the distribution is just short of the value 20 on the sample space. This is intuitive if we think most of the density in the Gaussian distribution lies between -1 and 1 and then the sum of the squared random observations from the standard Gaussian would sum to just under the number of degrees of freedom, in this case 20 (approximately 20 × 1). Although the distribution has a bell-like shape, the distribution is not symmetric.
The chi2.cdf() function can be used to calculate the cumulative density function over the same sample space.
- #plot the chi-squared cdf
- from numpy import arange
- from matplotlib import pyplot
- from scipy.stats import chi2
- # define the distribution parameters
- sample_space = arange(0, 50, 0.01)
- dof = 20
- # calculate the cdf
- cdf = chi2.cdf(sample_space,dof)
- #plot
- pyplot.plot(sample_space,cdf)
- pyplot.show()
Running the example creates a plot of the cumulative density function for the Chi-Squared distribution. The distribution helps to see the likelihood for the Chi-Squared value around 20 with the fat tail to the right of the distribution that would continue on long after the end of the plot.