Fine-Tune Your Gradient Descent Training with Powerful Optimisation Techniques

Gradient descent helps to train your neural network model by helping minimising your defined cost function by updating the weights in the direction of the gradient.

Photo by Studio Pizza on Unsplash

With deep learning and neural networks, sometimes we just have to experiment with different values of hyperparameters to find out the optimum model for our use case.

Well, having the algorithms which can improve the speed of training can help a lot in the process, which makes it important to learn these algorithms.

These algorithms are based on the concept of weighted moving averages, so lets discuss that first..

Weighted Moving Averages

In weighted moving average , we define a tuneable parameter β and calculate our moving average in the following way…

where v_{t} represents the average at current element , v_{t-1} represents the average upto the previous elements and a_{t} is the current element​.

Image taken from DeepLearning.AI

The effect of high β can be seen in the image above. The curve in green represents the moving average with a higher β than the red one.

As seen, increasing β results in smoothening the curve but it tends to shift a little bit to right as well.

Weighted moving averages are used in multiple areas like machine learning(duh!) and finance to name a few.

Now lets learn about the different optimisations to improve gradient descent!


With the original gradient descent the formula to update the weights W and biases b is..

Where α is the learning rate and dW and db are the derivatives of the loss w.r.t W and b respectively.

Photo by Zulu Fernando on Unsplash

With momentum, this update formula changes to:

As we can see above , we use the concept of moving averages to compute the amount by which the weights to be updated.


RMSProp(Root Mean Squared Propagation) is another algorithm which helps to speed up gradient descent.

It also uses the concept of weighted moving averages to update the weights and biases in the neural network model.

With RMSProp the formulas modify to…

RMSProp Weight updation formula

where we perform element wise matrix multiplication on dW and db to calculate the moving averages.

We also add a small amount ϵ​ to the square root so that the denominator does not go too close to zero.


We have seen Momentum and RMSProp using the concept of weighted averages.

But what happens when we combine the two? Yes, we get the Adam(Adaptive Moment Estimation) algorithm.

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

With Adam, the weight updation formula becomes…

Adam Weight updation formula

Here, lines 4–8 represent the bias correction, where the power term t represents the number of times this step has been run since we are calculating the moving average to take care of the initial deviation while calculating moving averages.

Why do these algorithms work?

To understand this, suppose we draw a contour of our loss function with the minimum in the centre and start gradient descent from a point, as shown below.

Loss function contour | Image by the Author

Now, with regular gradient descent, we don’t directly move towards the minimum but with some movement in the other direction as well. These are apparent as some sort of “vertical oscillations”.

By employing moving averages, we essentially dampen the vertical oscillations as their values tend to move towards zero with more iterations.

This also keeps the “horizontal component” (movement towards minimum) large enough to perform gradient descent effectively.

In this article we learned about the different optimisation algorithms used along with gradient descent to improve its speed. All these algorithms are implemented in popular ML libraries like TensorFlow.

If you like this content, please give a clap. . I will be writing about different things I learn and posting regularly. You can even comment on what you would like to see in the coming weeks. Happy coding…


Improving Deep Neural Networks course by Coursera.Math equations embedded from

Beyond Vanilla: Powerful Optimisers to Enhance Gradient Descent was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.