• Gaussian and Summary Stats


    1.Gaussian Distribution

    2.Sample vs Population

    3. Test Dataset

    4. Central Tendencies

    5.Variance

    6.Describing a Gaussian

    1.2 Gaussian Distribution

    Let’s look at a normal distribution. Below is some code to generate and plot an idealized Gaussian distribution.

    1. # generation and plot an idealized gaussian
    2. from numpy import arange
    3. from matplotlib import pyplot
    4. from scipy.stats import norm
    5. # x-axis for the plot
    6. x_axis = arange(-3, 3, 0.001)
    7. # y-axis as the gaussian
    8. y_axis = norm.pdf(x_axis,0,1)
    9. # plot data
    10. pyplot.plot(x_axis, y_axis)
    11. pyplot.show()

    1.3 Sample vs Population

    The data that we collect is called a data sample, whereas all possible data that could be collected is called the population.

    • Data Sample : A subset of observations from a group
    • Data Population : All possible observations from a group.

    Two examples of data samples that you will encounter in machine learning include:

    • The train and test datasets.
    • The performance scores for a model.

    When using statistical methods, we often want to make claims about the population using only observations in the sample. Two clear examples of this include:

    • The training sample must be representative of the population of observations so that we can fit a useful model.
    • The test sample must be representative of the population of observations so that we can develop an unbiased evaluation of the model skill.

    1.4 Test Dataset

    Before we explore some important summary statistics for data with a Gaussian distribution.We can use the randn() NumPy function to generate a sample of random numbers drawn from a Gaussian distribution.

    We can then plot the dataset using a histogram and look for the expected shape of the plotted data. The complete example is listed below.

    1. # generate a sample of random gaussians
    2. from numpy.random import seed
    3. from numpy.random import randn
    4. from matplotlib import pyplot
    5. # seed the random number generator
    6. seed(1)
    7. # generate univariate observations
    8. data = 5 * randn(10000) + 50
    9. # histogram of generated data
    10. pyplot.hist(data)
    11. pyplot.show()

     Example of calculating and plotting the sample of Gaussian random numbers with more bins.

    1. # generate a sample of random gaussians
    2. from numpy.random import seed
    3. from numpy.random import randn
    4. from matplotlib import pyplot
    5. # seed the random number generator
    6. seed(1)
    7. # generate univariate observations
    8. data = 5 * randn(10000) + 50
    9. # histogram of generated data
    10. pyplot.hist(data, bins=100)
    11. pyplot.show()

     1.5 Central Tendency

    The central tendency of a distribution refers to the middle or typical value in the distribution. The most common or most likely value.In the Gaussian distribution, the central tendency is called the mean, or more formally, the arithmetic mean, and is one of the two main parameters that defines any Gaussian distribution.

                                        

     The example below demonstrates this on the test dataset developed in the previous section.

    1. # calculate the mean of a sample
    2. from numpy.random import seed
    3. from numpy.random import randn
    4. from numpy import mean
    5. # seed the random number generator
    6. seed(1)
    7. # generate univariate observations
    8. data = 5 * randn(10000) + 50
    9. # calculate mean
    10. result = mean(data)
    11. print('Mean: %.3f' % result)

     The median is calculated by first sorting all data and then locating the middle value in the sample.

    The example below demonstrates this on the test dataset.

    1. # calculate the median of a sample
    2. from numpy.random import seed
    3. from numpy.random import randn
    4. from numpy import median
    5. # seed the random number generator
    6. seed(1)
    7. # generate univariate observations
    8. data = 5 * randn(10000) + 50
    9. # calculate median
    10. result = median(data)
    11. print('Median: %.3f' % result)

    1.6 Variance

    The variance of a distribution refers to how much on average that observations vary or differ from the mean value. It is useful to think of the variance as a measure of the spread of a distribution. A low variance will have values grouped around the mean.

    The complete example is listed below.

    1. # generate and plot gaussians with different variance
    2. from numpy import arange
    3. from matplotlib import pyplot
    4. from scipy.stats import norm
    5. # x-axis for the plot
    6. x_axis = arange(-3, 3, 0.001)
    7. # plot low variance
    8. pyplot.plot(x_axis, norm.pdf(x_axis,0,0.5))
    9. # plot high variance
    10. pyplot.plot(x_axis,norm.pdf(x_axis,0,1))
    11. pyplot.show()

    Running the example plots two idealized Gaussian distributions: the blue with a low variance grouped around the mean and the orange with a higher variance with more spread. 

    The variance of a data sample drawn from a Gaussian distribution is calculated as the average squared difference of each observation in the sample from the sample mean:

     The example below demonstrates calculating variance on the test problem.

    1. # calculate the variance of a sample
    2. from numpy.random import seed
    3. from numpy.random import randn
    4. from numpy import var
    5. # seed the random number generator
    6. seed(1)
    7. # generate univariate observations
    8. data = 5 * randn(10000) + 50
    9. # calculate variance
    10. result = var(data)
    11. print('Variance: %.3f' % result)

    Where the standard deviation is often written as s or as the Greek lowercase letter sigma (σ). The standard deviation can be calculated directly in NumPy for an array via the std() function. The example below demonstrates the calculation of the standard deviation on the test problem.

    1. # calculate the standard deviation of a sample
    2. from numpy.random import seed
    3. from numpy.random import randn
    4. from numpy import std
    5. # seed the randomm number generator
    6. seed(1)
    7. # generate univariance number observations
    8. data = 5 * randn(10000) + 50
    9. # calculate standard deviation
    10. result = std(data)
    11. print('Standard Deviation: %.3f' % result)

    Running the example calculates and prints the standard deviation of the sample. The value matches the square root of the variance and is very close to 5.0, the value specified in the definition of the problem.

  • 相关阅读:
    Alibaba Nacos注册中心源码剖析
    手动验证 TLS 证书
    【科学文献计量】RC.networkMultiLevel()中的参数解释
    Elasticsearch7.17 五 :ES读写原理、分片设计和性能优化
    C++基础知识(十)--- I/O
    车载的智能家居模块
    PDF格式分析(七十四)——自由文本注释(Free Text)
    ArcGIS实验教程——实验四十八:ArcGIS制图表达入门及案例教程
    javaScript编译器,Babel详解!
    【探索Linux世界|中秋特辑】--- 倒计时和进度条的实现与演示
  • 原文地址:https://blog.csdn.net/u011868279/article/details/125438911