Machine Learning [1]

What is Machine Learning?

“Field of study that gives computers the ability to learn without being explicitly programmed.”

— Arthur Samuel

Supervised Learning

Refers to “algorithms that learn X to Y, or input to output”, learning from given “right answers”.

Regression (predict numeric value)
Classification (predict categories)

Unsupervised Learning

Input x, and no output. Algorithm has to find structure in the data.

Clustering algorithm

Linear Regression

Training set is the data used to train the model. Including:

x= “input variable”
y = “output variable”

Cost Function

Definition:

Objective:

Instead of using 3D plot to show the cost function, we can use contour plot.

def computeCost(X, y, theta):
    inner = np.power((X * theta.T) - y, 2)
    # theta.T就是矩阵theta的转置矩阵
    # np.power(A, B) 对A中的每个元素求B次方
    return np.sum(inner) / (2 * len(X))

Gradient Descent

Say we wanna find the minimum of a function (notice that the form of the function does not need to be the cost function of a linear regression, as it could be of any form).

Outline:

Pseudocode

1
2
3

# 1. Start with SOME w, b (for example)
# 2. Keep changing w, b to try to reduce J (at current point, towards the direction of the fastest gradient)
# 3. Settle at or near a minimum.

Algorithm:

-Learning Rate.
If too small, gradient descent may be too slow.
If too large, gradient descent may overshoot, never reach the minimum (diverge).

Gradient descent can reach local minimum with fixed learning rate.

Example

For cost function in a linear regression situation:

repeat until convergence

def gradientDescent(X, y, theta, alpha, iters):
    # alpha是学习率, iters为迭代次数
    temp = np.matrix(np.zeros(theta.shape))
    # np.zeros(theta.shape)=[0., 0.], 然后将temp变为矩阵[0., 0.]
    parameters = int(theta.ravel().shape[1])
    # theta.ravel(): 将多维数组theta降为一维, .shape[1]是统计这个一维数组有多少个元
    # parameters表示参数
    cost = np.zeros(iters)
    # 初始化代价函数值为0数组, 元素个数为迭代次数
    
    for i in range(iters):
        error = (X * theta.T) - y
        
        for j in range(parameters):
            term = np.multiply(error, X[:, j])  # 将误差与训练数据相乘, term为偏导数
            temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))  # 更新theta
        
        theta = temp
        cost[i] = computeCost(X, y, theta)  # 计算每一次的代价函数
           
    return theta, cost

Multiple Characteristic Linear Regression

Vectorization (and its benefits)

w = np.array([1, 2, -3])
b = 4
x = np.array([10, 20, 30])
f = np.dot(w, x) + b

Normal Equation: An Alternative to Gradient Descent

Works only for linear regression and pretty much none of the other algorithms
Solve for without iterations
Not generalized to other learning algorithms
Quite slow if the number of features is large (over 10,000)
Often used in machine learning libraries that implement linear regression

A good article about Normal Equation: 正规方程详细推导过程

Derivation of Normal Equation

Definition. 1: Matrix derivatives

对从矩阵空间映射到实数空间的函数，我们定义对于矩阵的偏导数：

Definition. 2: Trace

$𝕥 𝕣$

For the training set, define an matrix (if not only but also is contained, then it’ll be an matrix.)

Similarly, let:

be the output parameter of .

The cost function for a linear regression will be: To minimize cost function, operate derivative: Notice: is an matrix.

By letting it equals to 0, we have the normal equation:

# 正规方程
def normalEqn(X, y):
    theta = np.linalg.inv(X.T @ X) @ X.T @ y
    # X.T @ X等价于X.T.dot(X), .dot()表示点积, 也就是矩阵相乘的意思
    return theta

final_theta2 = normalEqn(X, y)
final_theta2

Feature Scaling

Mean normalization:
Z-score normalization:**

Checking Convergence of Gradient Descent

Plotting the learning curve(J-Iteration curve).

Values of learning rate to Try: 0.01, 0.03, 0.1, 0.3, 1…

Polynomial Regression

How to decide which features to use?