机器学习实战-系列教程3：线性回归2（项目实战、原理解读、源码解读）

机器学习实战-系列教程3：线性回归2（项目实战、原理解读、源码解读）
🌈🌈🌈机器学习实战系列总目录

本篇文章的代码运行界面均在Pycharm中进行
本篇文章配套的代码资源已经上传

手撕线性回归1之线性回归类的实现
 手撕线性回归2之单特征线性回归
 手撕线性回归3之多特征线性回归
 手撕线性回归4之非线性回归# 5、数据预处理

 5.1 数据读入
```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from linear_regression import LinearRegression
1
2
3
4
```
首先是导包numpy、pandas、matplotlib素质三连，从文件中linear_regression导入类
```
data = pd.read_csv('../data/world-happiness-report-2017.csv')
train_data = data.sample(frac = 0.8)
test_data = data.drop(train_data.index)
1
2
3
```
1. 读csv文件
2. 按80比例分配训练数据
3. 按20比例分配训练数据
```
input_param_name = 'Economy..GDP.per.Capita.'
output_param_name = 'Happiness.Score'
x_train = train_data[[input_param_name]].values
y_train = train_data[[output_param_name]].values
x_test = test_data[input_param_name].values
y_test = test_data[output_param_name].values
1
2
3
4
5
6
```
1. 原始数据中取出一列"索引"作为输入
2. 原始数据中取出一列"索引"作为标签
3. 按照索引取出训练集数据
4. 按照索引取出训练集标签
5. 按照索引取出测试集数据
6. 按照索引取出测试集标签
```
plt.scatter(x_train,y_train,label='Train data')
plt.scatter(x_test,y_test,label='test data')
plt.xlabel(input_param_name)
plt.ylabel(output_param_name)
plt.title('Happy')
plt.legend()
plt.show()
1
2
3
4
5
6
7
```
1. 训练数据散点图
2. 测试数据散点图
打印结果：

5.2 训练
```
num_iterations = 500
learning_rate = 0.01
1
2
```
1. 迭代次数，即整个数据集训练次数
2. 学习率
```
linear_regression = LinearRegression(x_train,y_train)
(theta,cost_history) = linear_regression.train(learning_rate,num_iterations)
print ('开始时的损失：',cost_history[0])
print ('训练后的损失：',cost_history[-1])
1
2
3
4
```
1. 数据传入类中，实例化类得到linear_regression 对象
2. linear_regression 对象调用train方法，得到参数和损失
3. 打印开始损失
4. 打印结束损失
打印结果：

开始时的损失： 14.633306098916812
训练后的损失： 0.2275173194286417
```
plt.plot(range(num_iterations),cost_history)
plt.xlabel('Iter')
plt.ylabel('cost')
plt.title('GD')
plt.show()
1
2
3
4
5
```
打印结果：

5.3 测试
```
predictions_num = 100
x_predictions = np.linspace(x_train.min(),x_train.max(),predictions_num).reshape(predictions_num,1)
y_predictions = linear_regression.predict(x_predictions)
1
2
3
```
1. 选用100个数据
2. x_predictions：
  1. x_train.min(),x_train.max()，前面的训练数据中的最小值和最大值
  2. np.linspace(x_train.min(),x_train.max(),predictions_num)，最小值和最大值为范围均匀分成100个数据
  3. 维度调整为（100,1）
3. 使用定义的线性回归将x_predictions预测成y_predictions
```
plt.scatter(x_train,y_train,label='Train data')
plt.scatter(x_test,y_test,label='test data')
plt.plot(x_predictions,y_predictions,'r',label = 'Prediction')
plt.xlabel(input_param_name)
plt.ylabel(output_param_name)
plt.title('Happy')
plt.legend()
plt.show()
1
2
3
4
5
6
7
8
```
1. 训练数据和训练标签的散点图
2. 测试数据和测试标签的散点图
3. x_predictions和y_predictions 对应的一条直线
4. 画图
打印结果：

6、数据预处理

机器学习开发流程中一定有一个数据预处理的重要流程，在很多实际的任务中，数据预处理甚至比网络设计更复杂更重要。

6.1 归一化函数

这部分函数主要为了将原始数据放入到一个合适的范围内，一般是[0,1]的范围或者[-1,1]的范围，人能识别数据，计算机只识别数字，机器学习只能认识特征
```
def normalize(features):
    features_normalized = np.copy(features).astype(float)
    features_mean = np.mean(features, 0)
    features_deviation = np.std(features, 0)
    if features.shape[0] > 1:
        features_normalized -= features_mean
    features_deviation[features_deviation == 0] = 1
    features_normalized /= features_deviation
    return features_normalized, features_mean, features_deviation
1
2
3
4
5
6
7
8
9
```
1. 深度复制传进来的原始数据features，转换为float格式
2. 返回原始数据的均值
3. 返回原始数据的标准差
4. 判断features是否只有一个数字
5. 原始数据减去均值
6. 判断标准差是否为0，如果为0 则改为1（防止分母出现为0的情况）
7. 原始数据减去均值的结果再除以标准差
8. 返回处理结果、均值、标准差
6.2 数据预处理函数

在此次的数据预处理中只用到了归一化操作
```
def prepare_for_training(data, polynomial_degree=0, sinusoid_degree=0, normalize_data=True):
    num_examples = data.shape[0]
    data_processed = np.copy(data)
    features_mean = 0
    features_deviation = 0
    data_normalized = data_processed
    if normalize_data:
        (data_normalized, features_mean, features_deviation ) = normalize(data_processed)
        data_processed = data_normalized
    if sinusoid_degree > 0:
        sinusoids = generate_sinusoids(data_normalized, sinusoid_degree)
        data_processed = np.concatenate((data_processed, sinusoids), axis=1)
    if polynomial_degree > 0:
        polynomials = generate_polynomials(data_normalized, polynomial_degree, normalize_data)
        data_processed = np.concatenate((data_processed, polynomials), axis=1)
    data_processed = np.hstack((np.ones((num_examples, 1)), data_processed))
    return data_processed, features_mean, features_deviation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
```
1. 计算有多少个数
2. 深度复制原始数据
3. 初始均值0（避免提示报错而已）
4. 初始标准差0（避免提示报错而已）
5. 定义初始归一化数据（避免提示报错而已）
6. 将数据传入初始化函数
7. 特征变换sinusoidal
8. 特征变换polynomial
9. 原始数据拼接了一列1
10. 返回数据
7、整体流程解读

单特征线性回归整体流程，从Non-linearRegression.py文件的这行代码开始：
data = pd.read_csv(‘…/data/non-linear-regression-x-y.csv’)
1. 读数据
2. 选择特征
3. 画一下原始数据的散点图（训练数据、测试数据）
4. 进入线性回归类
5. 在线性回归类进入初始化函数
6. 在初始化函数进入数据预处理函数
7. 在数据预处理函数中进入归一化操函数后，返回处理结果、均值、标准差，返回初始化函数
8. 初始化函数系列赋值操作
9. 退出线性回归类，返回线性回归实例化对象
10. 线性回归对象调用trian函数
11. 在trian函数中调用梯度下降函数
12. 在梯度下降函数中多次调用参数更新函数以及损失计算函数
13. 线性回归对象的trian函数返回损失，返回最后的参数
14. 打印损失
15. 画出损失下降过程
16. 进行预测
手撕线性回归1之线性回归类的实现
 手撕线性回归2之单特征线性回归
 手撕线性回归3之多特征线性回归
 手撕线性回归4之非线性回归
相关阅读:
微服务13-Seata的四种分布式事务模式
 vulnhub靶场之FunBox-11
UIView和VC的生命周期
 微前端学习（下）
LeetCode--152. 乘积最大子数组（C++描述）
OFD板式文件创建JAVA工具-EASYOFD 三、图像 Image
轻量级模型设计与部署总结
 点击切换图片-javascript
如何加速JavaScript 代码运行速度
 02-Redis持久化、主从复制
原文地址：https://blog.csdn.net/weixin_50592077/article/details/132729061

🌈🌈🌈机器学习 实战系列 总目录

5.1 数据读入

5.2 训练

5.3 测试

6、数据预处理

6.1 归一化函数

6.2 数据预处理函数

7、整体流程解读

🌈🌈🌈机器学习实战系列总目录