You’re stuck on a mountain, surrounded by a dense fog. You didn’t sign up for this. You simply wanted to solve for the optimal values of a regression, but it was too computationally inefficient and expensive because you had too many features. So, here you magically are. You can get down the mountain and out of the fog the fastest by going down the steepest slope. You do this. You find a Saint Bernard with a flagon of “water” around her neck and you’re saved.
This is the idea behind Gradient Descent, an iterative process (steps) that takes us to the minimum of a function (bottom of the mountain).
For those curious about what the math behind gradient descent looks like:
Objective of gradient descent and a brief dive down the cost curve:
The objective of gradient descent is to find the global minima of the function, the line with the lowest Residual Sum of Squares (RSS) in order to achieve the best fit line. We go step-by-step across the slope of the function, updating our b and m terms (think y = mx + b) to find the line with the lowest RSS. We take these steps by a certain step size, or learning step while being careful not to overshoot the lowest RSS. We mitigate risk of this by finding appropriate step sizes that aren’t too large that we fly right by the minimum, or too small so as to take forever computationally.
For the visual learners out there, here is a helpful graphic of step size leading to a global minima:
Above, we see a Cost Curve which represents the declining cost of the different RSS’ of a model. We see that our model is iteratively moving down the cost curve in an attempt to find the minimum cost function.
Not all cost functions are created equal. Not every cost function looks like a perfect halfpipe for a snowboarder. Some cost functions zig and others zag.
An integral feature in gradient descent that we haven’t covered yet is the Cost Function.
Cost function is a measurement of how incorrect our model is in terms of estimating the relationships between our feature (x) and our target variable (y). The cost function measures the actual value vs. the predicted value.
Now, not to throw you for too much of a loop, but there are three flavors of gradient descent: Batch Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent. Don’t worry, you’ll get you down this mountain with as little harm as possible.
Batch Gradient Descent
Batch gradient descent looks at all of the data you use in order to calculate a single iteration, thus making it extremely time consuming for larger datasets. One epoch in batch gradient descent would be taking the average of the gradients in the training set and then use the average gradient in updating the parameters. Batch gradient descent isn’t recommended for larger datasets, but it can be helpful for smaller ones since it does take into account all of the data being used. The steps that are taken towards the global minima of the loss function are less noisy and there’s a more stable convergence and error of the gradient descent.
- A more stable error gradient because Batch updates less frequently, therefore a more stable convergence
- The algorithm is more computationally efficient since there are fewer updates than Stochastic
- Great for small datasets
- The entire training dataset must be in memory and available to the algorithm
- Larger datasets will cause issues because of the lower training speed and higher number of updates
- Although the error gradient is more stable, this may cause a premature convergence of the model
Stochastic Gradient Descent
In stochastic gradient descent, there is only one sample passed through each learning step. Whereas batch gradient descent would return a less noisy error, stochastic gradient descent can return a much more noisy error given how little data the algorithm is taking into account. Stochastic gradient descent is much preferred for larger datasets since it’s computationally cheaper.
- Simplest form of gradient descent and therefore good for beginners
- Model performance is easily gauged given frequent updates
- Allows for the usage of large datasets because only one piece of data needs to be processed at a time
- Computationally cheaper than Batch
- Because of its stochastic (random) nature, the algorithm is far less regular than Batch
- Can miss the local minima entirely due to it’s noisy update process
Mini-Batch Gradient Descent
The general concept behind Mini-Batch Gradient Descent is that it’s a pared down, small batch method of standard gradient descent. Mini-Batch Gradient Descent is an attempt to marry Batch and Stochastic gradient descent by taking the efficiency of Batch and the robustness of Stochastic. We must divide our training set into mini-batches, n, generally a multiple of the number 32 and calculate our mini-batch gradient descent from there.
- Fits into memory easily and cheap computationally
- More computationally efficient than Stochastic due to the batched updates
- Update frequency in Mini-Batch is higher than Batch, giving a more robust convergence
- Good middle ground between Batch and Stochastic gradient descent
- You must introduce a new hyperparameter, n, called ‘mini-batch size’. This hyperparameter may take a good amount of time to tune
You now have a decent (descent pun?) grasp on gradient descent and the three flavors it comes in. Use this knowledge wisely in your foray into machine learning. In case any of you were wondering, the dog that saved you at the beginning of this article is named Patty, and she’s just as adorable as you assumed.