机器学习实战-系列教程2：线性回归1（项目实战、原理解读、源码解读）

🌈🌈🌈机器学习实战系列总目录

本篇文章的代码运行界面均在Pycharm中进行
本篇文章配套的代码资源已经上传

手撕线性回归1之线性回归类的实现
 手撕线性回归2之单特征线性回归
 手撕线性回归3之多特征线性回归
 手撕线性回归4之非线性回归

1、整体流程简介

拿到数据data
数据预处理操作（归一化、标准化）
怎么样的x和k组合能够更加准确的拟合出真实值
使用梯度下降算法GD（Gradient Descent）
通过GD让loss和k之间达到一个收敛关系（这个过程Scikit-learn自动帮我们完成），完成梯度下降算法就可以把线性回归算出来了。
单个特征和多个特征做对比，在代码中实现出来

2、初始化操作

初始化函数就是将数据处理成机器学习所需要格式与范围，并且把每一个数据都对应好标签，标签也就是真实值，并且将数据分为训练集、验证集、测试集等（具体根据任务设定）

import numpy as np
from utils.features import prepare_for_training
class LinearRegression:
    def __init__(self,data,labels,polynomial_degree = 0,sinusoid_degree = 0,normalize_data=True):
        (data_processed, features_mean, features_deviation)  = prepare_for_training(data, polynomial_degree, sinusoid_degree,normalize_data=True)
         
        self.data = data_processed
        self.labels = labels
        self.features_mean = features_mean
        self.features_deviation = features_deviation
        self.polynomial_degree = polynomial_degree
        self.sinusoid_degree = sinusoid_degree
        self.normalize_data = normalize_data
        
        num_features = self.data.shape[1]
        self.theta = np.zeros((num_features,1))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

导包
定义一个线性回归类
初始化函数定义：初始化操作需要传进来等下需要用到的一些值（数据、标签、多项式特征的最高次数、非线性变换、预处理）
经过预处理操作
把定义好的数据拿过来（python特有的）
self.data.shape[1]表示数据有多少列，num_features即特征个数
theta就是k有多少个，一个特征对应一个k

3、训练

3.1 预测函数

@staticmethod
def hypothesis(data,theta):   
    predictions = np.dot(data,theta)
    return predictions
1
2
3
4

预测函数是将原始数据data与权重参数theta进行矩阵的点乘形成的计算结果，第一次计算时，theta是随机产生的（可以自己设置成满足正态分布等）

3.2 参数更新函数

def gradient_step(self,alpha):    
	   num_examples = self.data.shape[0]
	   prediction = LinearRegression.hypothesis(self.data,self.theta)
	   delta = prediction - self.labels
	   theta = self.theta
	   theta = theta - alpha*(1/num_examples)*(np.dot(delta.T,self.data)).T
	   self.theta = theta
1
2
3
4
5
6
7

num_examples：样本个数（data.shape[0]是行数，data.shape[1]列数）
prediction：预测值（得到预测值，用线性回归LinearRegression类调用预测函数hypothesis，将参数和原始数据穿进去）
delta ：预测值减去真实值
$δ=delta =(h_θ(x^k)-y^k)$
theta ：训练得到参数， $θ_1$ 对应的是 $x_1$ ，以此类推
theta 更新公式： $θ_j = θ_j - α\frac{1}{10}\sum_{k=i}^{i+9}δx_j^k$
self.theta = theta，返回更新的参数

参数更新函数就是调用了预测函数，计算delta ，按照参数更新公式更新参数theta

3.3 损失函数

def cost_function(self,data,labels):
    """
                损失计算方法
    """
    num_examples = data.shape[0]
    delta = LinearRegression.hypothesis(self.data,self.theta) - labels
    cost = (1/2)*np.dot(delta.T,delta)/num_examples
    return cost[0][0]
1
2
3
4
5
6
7
8

损失函数就是计算出预测值和真实之间的差异大小的函数。
损失不会是多少样本就计算多少损失，是计算每次平均的损失
num_examples ：样本个数
delta ：就是预测值减去真实值
cost ：就是计算delta 每个差值的平方再除以2，再除以样本个数

3.4 梯度下降函数

def gradient_descent(self,alpha,num_iterations):
    cost_history = []
    for _ in range(num_iterations):
        self.gradient_step(alpha)
        cost_history.append(self.cost_function(self.data,self.labels))
    return cost_history
1
2
3
4
5
6

梯度下降函数实际上就是将参数更新函数和损失函数执行多次，因为将所有的数据都会训练一次，每一次的批量都会执行一次参数更新，每次更新的同时也会记录损失。
cost_history 可以将每次的损失记录下来，cost_function可以计算出损失，通过原始数据得到预测值在和真实值之间做差别计算。

3.5训练函数

def train(self,alpha,num_iterations = 500):
    cost_history = self.gradient_descent(alpha,num_iterations)
    return self.theta,cost_history
1
2
3

alpha：学习率
num_iterations ：迭代次数
cost_history：保存的所有损失值

训练函数调用梯度下降函数，返回损失，定义迭代次数，一次迭代就是将整个数据集完全训练一次

4、测试

测试集也需要自己的计算损失和进行预测的函数，测试集就不参与训练和参数更新了

4.1 获取当前损失

def get_cost(self,data,labels):  
    data_processed = prepare_for_training(data,
     self.polynomial_degree,
     self.sinusoid_degree,
     self.normalize_data
     )[0]
     return self.cost_function(data_processed,labels)
1
2
3
4
5
6
7

prepare_for_training是数据预处理函数
cost_function是损失计算的函数

4.2 测试集预测函数

def predict(self,data):
    """
                用训练的参数模型，与预测得到回归值结果
    """
    data_processed = prepare_for_training(data,
     self.polynomial_degree,
     self.sinusoid_degree,
     self.normalize_data
     )[0]
     
    predictions = LinearRegression.hypothesis(data_processed,self.theta)
    
    return predictions
1
2
3
4
5
6
7
8
9
10
11
12
13

先是预处理数据，再把处理后的数据和theta传进预测函数，就可以得到预测结果了，这里专门在测试过程中实现的。