• Machine Learning(study notes)


    There is no studying without going crazy

    Studying alwats drives us crazy

    course from 吴恩达机器学习系列课程

    Define

    Machine Learning

    A computer program is said to learn from experience E with respect to some task T and some performance measure P , if its performance on T, as measued

    计算机程序从经验E中学习,解决某一任务T进行某一性能度量P,通过P测定在T上的表现因经验E而提高
    realy rhyme

    Supervised Learning(监督学习)

    right answers given

    Regression problem

    tring to predict continuios valued ouput

    需要预测连续的数值输出

    in that problem , you should give its some right valued with different classic and machine learning will learn to predict it

    在这里插入图片描述

    Classidication

    discrete valued output (zero or one)

    离散取值输出

    in that problem, you should give some valued . Different with regression , maybe the type of data
    在这里插入图片描述

    Unspervised Learning

    Clustering

    maybe using clustring algorithm to break that data into two separate clusters

    使用聚类算法将数据分为两簇

    do not know what data mean and data features and so on(just about data information),and you know ,machine learning should classification those data into different clusters

    the classic problem of that maybe Cocktail party problem algorithm

    经典问题就是鸡尾酒派对算法,就是有背景音乐以及人声,能分理分离出两者的声音

    Study

    Model representation(模型概述)

    using this example
    在这里插入图片描述

    And we will give some training set
    在这里插入图片描述
    As you can see , we put
    m as number of training examples,
    x’s as “input” variable / features ,
    y’s as “output” variable / “target” variable ,
    (x,y) as one training example ,
    (x(i),y(i)) refer to the ith training example.(this superscript i over here , this is not exponentiation.The superscript i in parenthess that’s just an index into my training set)

    在这里插入图片描述
    We saw that with the training set like our training set of housing prices and we feed that to our learning algorithm.Is the job of a learning algorithm to then output a function which by convention is usually denoted lowercase h

    const function

    在这里插入图片描述
    In this chart , we want the difference between h(x) and y to be small .And one thing I’m gonna do is try to minimize the square difference between the output of the hypothesis and the actual price of the house.

    在这里插入图片描述
    What we want to do is minimize over theta zero and theta one my function J of theta zero comma theta one

    error cost function is probably the most commonly used one for regeression problem

    How to use and t

    在这里插入图片描述
    Here is something we will use.

    In order to figure out how it use, we think of theta zero as setting the parameter theta zero equal to 0.So we have only one parameter theta one.

    在这里插入图片描述
    In the left , the line we fit and the theta we chose will mapping to the chart in right

    solve problem

    在这里插入图片描述
    Here is our problem formulation as usual with the hypothesis ,parameters, cost function ,and our optimization objective

    在这里插入图片描述
    Using that function, we will finally get that plot

    在这里插入图片描述
    Here is an example of a contour figure

    gradient descent

    It turns out gradient descent is a more general algorithm, and is used not only in linear regression. It’s actually used all over the place in machine learning.

    在这里插入图片描述
    Here is the problem setup.
    We are going to see that we have some function J of (θ01). Maybe it is a cost function from linear regression.And we want to come up with an algorithm for minimizing that as a function of J of (θ01).

    For the sake of brevity , for the sake of your succinctness of notation , so we just goingn to pretend that have only two parameters through the rest of this video.


    The idea for gradient descent :

    What we’re going to do is we are going to strat off with some initial guesses for θ0 and θ1.
    What we are going to do in gradient descent is we’ll keep changing θ0 and θ1 a little bit to try to reduce J of (θ01)

    The summary to gradient descent

    在这里插入图片描述
    Here is the gradient descent algorithm that we saw last time.

    In order to convey these intutions, we want to do is use a slightly simpler example where we want to minimize the function of just one parameter.
    So we have a cost function J of just one parameter ,theta one.

    在这里插入图片描述

    As we choose the red point ,we will through the funtion((d/dθ1)J(θ1)).
    Through the function , we can get the tangent of it .
    And it is absoultly a postive number . So we will get θ1 = θ1 - α*(positive number).α , the learning rate is always a positive number. So θ1 is decrease.
    And it is actually right, the direction of theta move get me closer to the minimum

    Similarly, when a point is selected on the left, it will eventually move to the right

    在这里插入图片描述
    let’s suppose you initialize theta one at a local minimum.And it is already at a local optimum or the local minimum.It turns out that at local optimum your derivative would be equal to zero.So, in your gradient descent update, you have theta one , gives update that theta one ,minus alpha times zero.
    what this means is that ,if you are already at a local optimum , it leaves theta one unchanged cause this.

    在这里插入图片描述

    Gradient descent for linear regression

    在这里插入图片描述

    在这里插入图片描述
    Here is gradient descent for the regression ,which is going to repeat until convergence
    在这里插入图片描述

    在这里插入图片描述
    This kind of algorithm is sometimes called batch gradient descent

    Matrices and vectors(basic knowledge)

    Firstly, we learn what is matrices
    在这里插入图片描述
    Next , let us talk about how to refer to sppecific elements of the matrix

    在这里插入图片描述
    What is vector?
    A vector turns out to be a special case of a matrix.
    A vector is a matrix that has only 1 column

    在这里插入图片描述
    about index:
    in the matchine, the index is from zero .So ,while we use, zero_index vector is a more convenient notation.

    Addition and scalar multiplication

    在这里插入图片描述
    It turns out you can add only two matrices that are of the same dimensions

    在这里插入图片描述

    在这里插入图片描述

    Matrix-vector multiplication

    在这里插入图片描述
    在这里插入图片描述

    Matrix-matrix multiplication

    在这里插入图片描述
    在这里插入图片描述

    Matrix multiplication properties

    在这里插入图片描述
    在这里插入图片描述

    Identity Matrix

    The indentity matrix has the property that it has ones along the diagonals and is zero everywhere else.

    在这里插入图片描述

    Inverse and transpose

    在这里插入图片描述
    It turns out only square matrices have inverses
    Matrices that don’t have an inverse are “singular” or “degenerate”

    在这里插入图片描述

    Multiple features

    在这里插入图片描述
    We have a single feature x, the size of the hourse , and we wanted to use that to predict y the price of the house and the function hθ was our form of our hypothesis.
    But now imagine , what if we had not only the size of the house as a feature or as a variable with which to try to predict the price, but that we also knew the number of bedrooms, the number of floors, and the age of home in years.It seems like this would give us a lot of information with which to predict the price.

    在这里插入图片描述
    if we have N features then rather than summing up over our four features, we would have a sum over our N features.
    在这里插入图片描述
    In order to simplify the function, we add x0 = 1,and the finaly function express like that
    在这里插入图片描述

    Gradient descent for multiple variables

    在这里插入图片描述
    在这里插入图片描述

    Gradient descent in practice(多元梯度下降法)

    I : Feature Scaling(1:特征缩放)

    We are going to repeatedly update each parameter θj according to θj minus α times this derivative tern.
    A useful thing to do is to scale the features. Concretely, if you instead define the feature X1 to be the size of the house divided by 2000, and define X2 to be maybe the number of bedrooms dividied by five, then thw contours of the cost function J can become much less skewed, so the contours may look more like circles.And if you turn gradient descent on a cost function like this, then gradient descent, you can show mathematically, can find a much more direct path to the global minimum, rather than taking a much more convoluted path.

    在这里插入图片描述
    And the feature range should in -3 to 3,too large and too small is not allowed
    在这里插入图片描述
    在这里插入图片描述

    II : Learning rate

    The target to learn learning rate is make sure that gradient descent is working correctly.
    在这里插入图片描述
    What this plot is showing , is it’s showing the value of your cost function after each iteration of gradient descent.And ,if gradient descent is working properly, then J of theta should decrease after every iteration.

    在这里插入图片描述
    在这里插入图片描述
    And the summary of alpha choose is that:

    在这里插入图片描述

    Features and polynomial regression

    Using sold house example

    We have twi features called frontage and depth.You might build a linear regression model like this
    在这里插入图片描述
    where frontage is your first feature x_1 and depth is your second feature x_2, but when you are applying linear regression, you do not necessarily have to use just the features x_1 and x_2 that you are given. What you can do is actually create new features by yourself. So ,if I want to predict the price of a house, what I might do instead is decide that what really determines the size of the house is the area or the land area that I own.So, I might create a new feature.I’m just gonna call this feature x, which is frontage, times depth.This is a multiplication symbol.It’s a frontage times depth because this is the land area that I own and I might then select my hypothesis as that usingjust one feature which is my land area. Because the area of a rectangle is the product of the lengths of the sides. So, depending on what insight you might have into a particular problem ,rather than just taking the features frontage and depth taht we happen to have started off with, sometimes by defining new features you might actually get a better model.

    Closely related to the idea of choosing your features is this idea called polynomial regression.

    Normal equation

    Its essence is actually to take the partial derivative and make it zero, so that its minimum value can be determined。What‘s more ,it just went from univariate to multivariate evidence

    在这里插入图片描述
    why should we set x_0, it just a const ,like ax+b=0,and x_0 is the b.
    在这里插入图片描述
    There is the change of feature x.Its form becomes a matrix。
    在这里插入图片描述
    And this equation will calculate the minimum value we want.(Although I do not know this implementation principle)

    what the advantage of this normal equation is you can take any value you want .You do not think about the range of value.(can do not use feature scaling)
    在这里插入图片描述
    Then , when should you choose gradient descent and when should you choose normal equation , here is some advice.
    在这里插入图片描述

  • 相关阅读:
    编译原理实验--实验二 递归下降法判断算术表达式的正确性--Python实现
    【题解】盛最多水的容器
    RKMEDIA--VI的使用
    【OPENVX】快速角点提取之 vxuFastCorners
    微信小程序开发-----发起网络请求携带后端token
    JS开发3D建模软件
    openGauss学习笔记-89 openGauss 数据库管理-内存优化表MOT管理-内存表特性-使用MOT-MOT使用查询原生编译
    awk RS,ORS RT,FS,OFS 含义
    ESP32C3基于Arduino框架下的 ESP32 RainMaker开发示例教程
    BI智能财务分析真的神,财务人都来用
  • 原文地址:https://blog.csdn.net/yjnlovepy/article/details/133199387