Gradient Descent. Image taken from –Β https://community.deeplearning.ai/t/difference-between-rmsprop-and-adam/310187
What Is Gradient Descent?
Table of Contents
Gradient descent is like hiking downhill with your eyes closed, following the slope until you hit the bottom (or at least a nice flat spot to rest). Technically, it is a method to minimize an objective function F(ΞΈ), parameterized by a modelβs parameters ΞΈβRn, by updating them in the opposite direction of the gradient βF(ΞΈ).
The size of each step is controlled by the learning rate Ξ±. Think of Ξ± as the cautious guide who ensures you donβt tumble down too fast. If this sounds like Greek, feel free to check out my previous article, Quick Glance At Gradient Descent In Machine Learning β I promise itβs friendlier than it sounds! π
Circle back: An objective function is the mathematical formula or function that your model aims to minimize (or maximize, depending on the goal) during training. It represents a measure of how far off your modelβs predictions are from the actual outcomes or desired values.
For example:
In machine learning, it could be a loss function like Mean Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
The objective function maps the modelβs predictions to a numerical value, where smaller values indicate better performance.
In simpler terms:
Itβs like a βfitness trackerβ for your model β it tells you how good or bad your modelβs predictions are. During optimization, gradient descent helps adjust the modelβs parameters ΞΈ to reduce this value step by step, moving closer to an ideal solution. Got it? π
Gradient Descent Variants
Gradient descent isnβt a one-size-fits-all deal. It comes in three variants, each like a different hiker β some take the scenic route, others sprint downhill, and a few prefer shortcuts (like me π ). These variants balance accuracy and speed, depending on how much data they use to calculate the gradient.
1. Batch Gradient Descent
Batch gradient descent, the βall-or-nothingβ hiker, uses the entire dataset to compute the gradient of the cost function:
Imagine stopping to look π at every rock πͺ¨, tree π΄, and bird π before deciding where to place your foot π¦Άπ½next. Itβs thorough, but not ideal if your dataset is as big as, say, the Amazon rainforest π©. Itβs also not great if you need to learn on the fly β like updating your hiking route after spotting a bear π»ββοΈ.
Code Example:
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
Batch gradient descent shines when you have all the time in the world and a dataset that fits neatly into memory. Itβs guaranteed to find the global minimum for convex surfaces (smooth hills) or a local minimum for non-convex surfaces (rugged mountains).
2. Stochastic Gradient Descent (SGD)
SGD is the βimpulsiveβ hiker who takes one step at a time based on the current terrain:
Itβs faster because it doesnβt bother calculating gradients for the entire landscape. Instead, it uses one training example at a time. While this saves time, the frequent updates can make SGD look like itβs zigzagging downhill, which can be both exciting and a little chaotic. π
Imagine updating your grocery list after each aisle β you get to the essentials faster, but your cart might look wild in the process. However, with a learning rate that slows down over time, SGD can eventually reach the bottom (or the best local minimum).
Code Example:
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
3. The Hero of the Day: Adam ππ½
Now letβs talk about Adam β the βhiking guruβ who combines the wisdom of Momentum and RMSprop. Adaptive Moment Estimation (Adam) is like having a smart guide who tracks the terrain and adjusts your steps based on past experiences and current conditions. Itβs the go-to optimizer when you want to train neural networks and still have time for coffee.
Why Youβll Love Adam
Low Memory Requirements: Itβs like carrying a lightweight backpack β efficient but still packed with essentials.
Minimal Hyperparameter Tuning: Adam works well out of the box, so you wonβt need to fiddle with too many knobs (just keep an eye on the learning rate).
Practical Use: From improving product recommendations to recognizing images of cats on the internet, Adam powers machine learning systems that make your everyday tech smarter.
How Adam Works
Adam maintains moving averages of gradients (mtβ) and squared gradients (vtβ) to adapt learning rates for each parameter:
Hereβs the cool part: Adam corrects biases to ensure accuracy, and updates parameters using this formula:
where m^t and v^t are bias-corrected estimates, and epsilon Ο΅ is a small number to prevent division by zero.
Conclusion
Adam is the Swiss Army knife of optimizers β versatile, efficient, and reliable. Whether youβre training neural networks to detect fraud or create next-gen chatbots, Adam helps you get there faster and with fewer headaches. So, embrace Adam, take confident steps, and enjoy the view from the summit of machine learning success!
References:
https://www.ceremade.dauphine.fr/~waldspurger/tds/22_23_s1/advanced_gradient_descent.pdf
https://www.geeksforgeeks.org/rmsprop-optimizer-in-deep-learning/