Writing Gradient Descent From Scratch

Gradient descent is one of those ideas that sounds more complicated than it is until you sit down and write it yourself.

At a high level, it is just a method for improving parameters step by step. You start with a guess, measure how wrong that guess is, figure out which direction reduces the error, and then move a little in that direction. Repeating that process is, at its core, how a model learns.

What It Actually Is

Suppose you have a model with some parameters. Take a simple line:

$y = wx + b$

Here, w and b are the values we want to learn.

We also need a way to measure how bad the model is. That is the loss function. For linear regression, mean squared error is a natural choice:

$L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Gradient descent tries to find values of w and b that make this loss as small as possible.

The word gradient means the direction of steepest increase of a function. If you want to minimize the loss, you move in the opposite direction. That gives you the update rule:

$\theta = \theta - \alpha \nabla L(\theta)$

Where θ is a parameter, α is the learning rate, and ∇L(θ) is the gradient of the loss with respect to that parameter.

That is the entire idea. Compute the slope. Step downhill. Repeat.

The Components Worth Understanding

Before writing any code, it helps to be clear about what each piece is doing.

Parameters are the values the model is trying to learn — weights and biases.

The model maps inputs to predictions. For our line: ŷ = wx + b.

The loss function tells us how wrong the model is. Bigger errors produce bigger loss.

Gradients tell us how the loss changes if we nudge a parameter slightly. They tell the model which direction to move.

The learning rate controls the size of each step. Too small, and training crawls. Too large, and it can overshoot or diverge entirely. In practice, learning rate is one of the most consequential hyperparameters you will tune.

Iterations are why it works at all. A single update does not solve the problem. Hundreds or thousands of updates, each making a small improvement, gradually get you somewhere useful.

Why Gradient Descent Shows Up Everywhere

The reason gradient descent is so universal is that it does not care about the exact shape of your model.

As long as you can define parameters, define a loss, and compute gradients, you can use it. That is why you see the same basic loop — forward pass, compute loss, backpropagate gradients, update parameters — whether someone is training a linear regression or a massive neural network.

Many problems in machine learning reduce to: define a loss, then minimize it. Gradient descent is the most common tool for doing that minimization.

Implementing It in Python

Let us fit a line to some data. The setup is deliberately simple.

x = [1, 2, 3, 4, 5]
y = [5, 7, 9, 11, 13]

This data follows y = 2x + 3. The goal is to recover something close to w = 2 and b = 3 using gradient descent, without any libraries.

def gradient_descent(x, y, learning_rate=0.01, epochs=1000):
    n = len(x)

    w = 0.0
    b = 0.0

    for epoch in range(epochs):
        # forward pass: current predictions
        y_pred = [w * xi + b for xi in x]

        # mean squared error
        loss = sum((yi - y_hat) ** 2 for yi, y_hat in zip(y, y_pred)) / n

        # gradients of loss with respect to w and b
        dw = (-2 / n) * sum(xi * (yi - y_hat) for xi, yi, y_hat in zip(x, y, y_pred))
        db = (-2 / n) * sum((yi - y_hat) for yi, y_hat in zip(y, y_pred))

        # update parameters
        w = w - learning_rate * dw
        b = b - learning_rate * db

        if epoch % 100 == 0:
            print(f"epoch={epoch}, loss={loss:.4f}, w={w:.4f}, b={b:.4f}")

    return w, b


x = [1, 2, 3, 4, 5]
y = [5, 7, 9, 11, 13]

w, b = gradient_descent(x, y, learning_rate=0.01, epochs=1000)

print("\nFinal parameters:")
print("w =", w)
print("b =", b)

Walking Through It

We start with w = 0.0 and b = 0.0. The model's first predictions are terrible — ŷ = 0 for every input — and the loss reflects that.

The forward pass produces predictions using the current line. Then we compute the loss to see how wrong we are.

The two gradient lines are where the real work happens:

dw = (-2 / n) * sum(xi * (yi - y_hat) for xi, yi, y_hat in zip(x, y, y_pred))
db = (-2 / n) * sum((yi - y_hat) for yi, y_hat in zip(y, y_pred))

These answer a specific question: if we nudge w a little, does the loss go up or down? Same for b. The sign of the gradient tells us which direction to move. The update step moves each parameter opposite to its gradient:

w = w - learning_rate * dw
b = b - learning_rate * db

Doing this once will not solve anything. Doing it a thousand times gradually steers w toward 2 and b toward 3. That is learning, in its most basic form.

What the Variants Are Doing

Vanilla gradient descent—the full-batch version, where every update step uses the entire dataset is conceptually simple, and that simplicity is comforting when you’re first learning. But as soon as your data grows past a toy example, that approach falls over. That’s why almost everything practical in machine learning leans on variants that make subtle tradeoffs for the sake of efficiency.

Stochastic Gradient Descent (SGD): Rather than looking at the whole dataset each time, SGD makes a tiny update using just one data point (or, more commonly, a small “mini-batch”). It’s less exact at each step, but vastly more practical, and it’s basically what every deep learning system actually uses.

Momentum: This is a small trick that accumulates a kind of running average of the gradients. It’s not just moving downhill—it builds up velocity, which helps push through flat spots and avoids that annoying zig-zag pattern you get in certain loss landscapes.

Adam: This is the optimizer you reach for when you just want things to work. Adam stacks together momentum and an adaptive step size for every parameter. It’s not magic, but it feels like it, because it works on a huge range of problems with barely any tuning.

The point isn’t that vanilla gradient descent is obsolete. The point is that most of what you actually use—SGD, Adam, RMSProp—are all variations on this same theme. If you understand the basic algorithm deeply, the “fancy” ones start to feel like tweaks and not black boxes.

A few things are worth knowing if you move beyond toy examples.

Learning rate choice matters more than most tutorials suggest. Too high and loss can explode. Too low and you will be waiting a long time for convergence. For non-trivial problems, this is usually the first thing to experiment with.

For non-convex problems — deep neural networks being the obvious case — there is no guarantee of finding a global minimum. The loss landscape has saddle points, plateaus, and local structure that can slow or confuse optimization. In practice, gradient-based methods still work remarkably well on these problems, but the theoretical guarantees are much weaker than in the convex case.

And full-batch gradient descent becomes expensive as dataset size grows. At some point, computing gradients over millions of examples per update is not worth it. Mini-batch SGD is the practical answer to that.

Why It Is Worth Writing From Scratch

Honestly, I wrote this blog so I can keep revisting it, as I found that when I understand some of the fundamental math behind algorithms instills a little more confidence in my toolkit.

When I use PyTorch, the gradient descent algorithm happens behind the scenes. That is super useful in practice, but it can leave the mechanism feeling opaque. Writing dw and db by hand, seeing where those formulas come from, watching w and b converge over a thousand epochs ground you in the core principle, atleast for me.