What is Machine Learning?

“Field of study that gives computers the ability to learn without being explicitly programmed.”

— Arthur Samuel

Supervised Learning

Refers to “algorithms that learn X to Y, or input to output”, learning from given “right answers”.

  • Regression (predict numeric value)
  • Classification (predict categories)

Unsupervised Learning

Input x, and no output. Algorithm has to find structure in the data.

  • Clustering algorithm

Linear Regression

Training set is the data used to train the model. Including:

  • x= “input variable”
  • y = “output variable”

Cost Function

Definition:

Objective:

Instead of using 3D plot to show the cost function, we can use contour plot.

1
2
3
4
5
def computeCost(X, y, theta):
inner = np.power((X * theta.T) - y, 2)
# theta.T就是矩阵theta的转置矩阵
# np.power(A, B) 对A中的每个元素求B次方
return np.sum(inner) / (2 * len(X))

Gradient Descent

Say we wanna find the minimum of a function (notice that the form of the function does not need to be the cost function of a linear regression, as it could be of any form).

Outline:

Pseudocode

1
2
3
# 1. Start with SOME w, b (for example)
# 2. Keep changing w, b to try to reduce J (at current point, towards the direction of the fastest gradient)
# 3. Settle at or near a minimum.

Algorithm:

  • -Learning Rate.
  • If too small, gradient descent may be too slow.
  • If too large, gradient descent may overshoot, never reach the minimum (diverge).

Gradient descent can reach local minimum with fixed learning rate.

Example

For cost function in a linear regression situation:

repeat until convergence

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def gradientDescent(X, y, theta, alpha, iters):
# alpha是学习率, iters为迭代次数
temp = np.matrix(np.zeros(theta.shape))
# np.zeros(theta.shape)=[0., 0.], 然后将temp变为矩阵[0., 0.]
parameters = int(theta.ravel().shape[1])
# theta.ravel(): 将多维数组theta降为一维, .shape[1]是统计这个一维数组有多少个元
# parameters表示参数
cost = np.zeros(iters)
# 初始化代价函数值为0数组, 元素个数为迭代次数

for i in range(iters):
error = (X * theta.T) - y

for j in range(parameters):
term = np.multiply(error, X[:, j]) # 将误差与训练数据相乘, term为偏导数
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term)) # 更新theta

theta = temp
cost[i] = computeCost(X, y, theta) # 计算每一次的代价函数

return theta, cost

Multiple Characteristic Linear Regression

Vectorization (and its benefits)

1
2
3
4
w = np.array([1, 2, -3])
b = 4
x = np.array([10, 20, 30])
f = np.dot(w, x) + b

Normal Equation: An Alternative to Gradient Descent

  • Works only for linear regression and pretty much none of the other algorithms
  • Solve for without iterations
  • Not generalized to other learning algorithms
  • Quite slow if the number of features is large (over 10,000)
  • Often used in machine learning libraries that implement linear regression

A good article about Normal Equation: 正规方程详细推导过程

Derivation of Normal Equation

Definition. 1: Matrix derivatives

对从 矩阵空间映射到实数空间的函数 ,我们定义对于矩阵的偏导数:

Definition. 2: Trace

𝕥𝕣

For the training set, define an matrix (if not only but also is contained, then it’ll be an matrix.)

Similarly, let:

be the output parameter of .

The cost function for a linear regression will be: To minimize cost function, operate derivative: Notice: is an matrix.

By letting it equals to 0, we have the normal equation:

1
2
3
4
5
6
7
8
# 正规方程
def normalEqn(X, y):
theta = np.linalg.inv(X.T @ X) @ X.T @ y
# X.T @ X等价于X.T.dot(X), .dot()表示点积, 也就是矩阵相乘的意思
return theta

final_theta2 = normalEqn(X, y)
final_theta2

Feature Scaling

  • Mean normalization:

  • Z-score normalization:**

Checking Convergence of Gradient Descent

Plotting the learning curve(J-Iteration curve).

Values of learning rate to Try: 0.01, 0.03, 0.1, 0.3, 1…

Polynomial Regression

How to decide which features to use?