• About Examples of Statistics in Machine Learning


    After reading the blog, you will know :

    • Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
    • That statistical methods can be used to clean and prepare data ready for modeling.
    • That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.

    1.1 Overview

    we are going to look at 10 examples of where statistical methods are used in an applied machine learning project.This will demonstrate that a working knowledge of statistics is essential for sucessfully working through a predictive modeling problem.

    1. Problem Framing
    2. Data Understanding
    3. Data Cleaning
    4. Data Selection
    5. Data Preparation
    6. Model Evaluation
    7. Model Configuration
    8. Model Selection
    9. Model Presentation
    10. Model Predictions

    1.2 Problem Framing

    Perhaps the point of biggest leverage in a predictive modeling problem is the framing of the problem.

    Statistical methods that can aid in the exploration of the data during the framing of a problem include:

    • Exploratory Data Analysis. Summarization and visualization in order to explore ad hoc views of the data.
    • Data Mining. Automatic discovery of structured relationships and patterns in the data.

    1.3 Data Understanding

    Data understanding means having an intimate grasp of both the distributions of variables and the relationships between variables.

    Two large branches of statistical methods are used to aid in understanding data:

    • Summary Statistics. Methods used to summarize the distribution and relationships between variables using statistical quantities.
    • Data Visualizations. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.

    1.4 Data Cleaning

    • Data corruption
    • Data errors
    • Data loss

    The process of identifying and repairing issues with the data is called data cleaning Statistical methods are used for data cleaning;

    • Outlier detection. Methods for identifying observations that are far from the expected value in a distribution.
    • Imputation. Methods for repairing or filling in corrupt or missing values in observations.

    1.5 Data Selection

    The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection. Two types of statistical methods that are used for data selection include:

    • Data Sample. Methods to systematically create smaller representative samples from larger datasets.
    • Feature Selection. Methods to automatically identify those variables that are most relevant to the outcome variable.

    1.6 Data Preparation

    Data can often not be used directly for modeling. Some transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms. Data preparation is performed using statistical methods. Some common examples include:

    • Scaling. Methods such as standardization and normalization.
    • Encoding. Methods such as integer encoding and one hot encoding.
    • Transforms. Methods such as power transforms like the Box-Cox method.

    1.7 Model Evaluation

    A crucial part of a predictive modeling problem is evaluating a learning method. This often requires the estimation of the skill of the model when making predictions on data not seen during the training of the model.

    This is a whole subfield of statistical methods.

    • Experimental Design. Methods to design systematic experiments to compare the effect of independent variables on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.

    As part of implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model.

    • Resampling Methods. Methods for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.

    1.8 Model Configuration

    • Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
    • Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

    1.9 Model Selection

    The process of selecting one method as the solution is called model selection.

    As with model configuration, two classes of statistical methods can be used to interpret the estimated skill of different models for the purposes of model selection.

    • Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
    • Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

    1.10 Model Presentation

    Methods from the field of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and confidence intervals.

    • Estimation Statistics. Methods that quantify the uncertainty in the skill of a model via confidence intervals.

    1.11 Model Predictions

    we can use methods from the field of estimation statistics to quantify this uncertainty, such as confidence intervals and prediction intervals.

    • Estimation Statistics. Methods that quantify the uncertainty for a prediction via prediction intervals.
  • 相关阅读:
    OpenHarmony教程指南—ArkUI中组件、通用、动画、全局方法的集合
    《HelloGitHub》第 91 期
    学习MySQL 临时表
    阿里大牛解析微服务架构:Docker,Spring全家桶,分布式,数据库
    [onnxrumtime]onnxruntime和cuda对应关系表
    [GIT]版本控制工具
    delete 与 truncate 命令的区别
    笔记本电脑配置知识大全
    douyin【商品抢购js脚本】
    this.$set的用法和作用说明
  • 原文地址:https://blog.csdn.net/u011868279/article/details/125438215