Machine Learning [1]
What is Machine Learning?
“Field of study that gives computers the ability to learn without being explicitly programmed.”
— Arthur Samuel
Supervised Learning
Refers to “algorithms that learn X to Y, or input to output”, learning from given “right answers”.
- Regression (predict numeric value)
- Classification (predict categories)
Unsupervised Learning
Input x, and no output. Algorithm has to find structure in the data.
- Clustering algorithm
Linear Regression
Training set is the data used to train the model. Including:
- x= “input variable”
- y = “output variable”
Cost Function
Definition:
Objective:
Instead of using 3D plot to show the cost function, we can use contour plot.
1 | def computeCost(X, y, theta): |
Gradient Descent
Say we wanna find the minimum of a function (notice that the form of the function does not need to be the cost function of a linear regression, as it could be of any form).
Outline:
Pseudocode
1 | # 1. Start with SOME w, b (for example) |
Algorithm:
-Learning Rate. - If too small, gradient descent may be too slow.
- If too large, gradient descent may overshoot, never reach the minimum (diverge).
Gradient descent can reach local minimum with fixed learning rate.
Example
For cost function in a linear regression situation:
1 | def gradientDescent(X, y, theta, alpha, iters): |
Multiple Characteristic Linear Regression
Vectorization (and its benefits)
1 | w = np.array([1, 2, -3]) |
Normal Equation: An Alternative to Gradient Descent
- Works only for linear regression and pretty much none of the other algorithms
- Solve for without iterations
- Not generalized to other learning algorithms
- Quite slow if the number of features is large (over 10,000)
- Often used in machine learning libraries that implement linear regression
A good article about Normal Equation: 正规方程详细推导过程
Derivation of Normal Equation
Definition. 1: Matrix derivatives
对从
Definition. 2: Trace
For the training set, define an
Similarly, let:
be the output parameter of
The cost function for a linear regression will be:
By letting it equals to 0, we have the normal equation:
1 | # 正规方程 |
Feature Scaling
Mean normalization:
Z-score normalization:**
Checking Convergence of Gradient Descent
Plotting the learning curve(J-Iteration curve).
Values of learning rate to Try: 0.01, 0.03, 0.1, 0.3, 1…
Polynomial Regression
How to decide which features to use?
