Gradient descent — How it works!
Really simple and very powerful mathematical algorithm. Intelligent connection between function optimum and derivative of a function.
There are two ways to achieve success in life. You can go randomly, left to right and vice verse, and maybe one day hit the target. Or you can move small steps in right direction. Gradient descent helps in pursuing right way.
The idea of optimization is to find optimal point. Optimum is defined as point where function has minimal or maximal value. In other words, for which value of input parameter, function return highest/lowest output.
See the the equivalence: min f(x)=max -f(x) !!!
f(x)=x²
We are looking for minimum.
f(x)=-x²
We are looking for maximum of this function.
Intelligent optimization steps:
1. Do random guess of function optimum
2. If you haven’t guessed optimum, determine weather to go left or right (Choose direction)
3. Choose how much left or right you are going to move (Choose step size)
Lets make a random guess. Lets try with number -5.
We see that our initial guess is left to the real optimum (0). We need scientific way which is going to tell us to go to the right. We need some tool that is going to give us clue to which direction (left or right) to go.
Here comes function derivative into action. Here is pattern. Gradient is telling us weather initial guess is left or right to the optimal point. When our guess is left to the optimal point, derivative is negative. And vice verse.
Derivative is just slope of tangent line through some point. In our case through blue point.
When we are at points left to the optimal, following is true:
- We need to increase initial guess x=-5 (like -5, -4,…,-1,0) in order to reach optimal point
- Gradient has negative value
We can merge those two facts into one equation. We need to go through iterations and update value initially guessed.
new_guess=old_guess-gradient(at old_guess)
Why this formula works?
When we are left to the optimal point, gradient is negative and we need to increase initial guess. We just put new_guess= initial_guess-gradient (gradient is negative). Minus minus makes new guess to increase.
Similarly, when initial guess is higher than optimal point, we need to decrease initial guess.
new_guess=old_guess-gradient
Gradient is positive in this case, so new_guess is going to decrease.
In order to make steps even smaller, we are going to multiply value of gradient with some small number, between 0.01 ans 0.03. That is good because it makes travelling to optimal point even smoother.
So final formula for initial guess updating would be:
new_guess=old_guess-alpha*gradient(at old_guess)
I hope you have enjoyed. Feel free to comment and explain how gradient descent has changed the way you look at optimization.