This article is written for absolute beginners who want a clear, intuitive, and deeply detailed understanding of how neural networks and backpropagation truly work from the ground up.

We're going to focus on just one neuron - the simplest building block of any neural network. No fancy architectures, no complex layers, just the fundamental math that makes everything else possible.

By working through this single neuron example step-by-step, you'll understand the core mechanics that power all of deep learning: how a neuron makes predictions, how we measure mistakes, and how we use those mistakes to get better. These same principles scale up to the massive networks behind modern AI.

Think of this as learning to walk before you run. Master this one neuron, and you'll have the foundation to understand any neural network architecture.

Imagine you are trying to teach a tiny piece of a neural network to decide whether something belongs to a certain class. Think of a single neuron that should output:

• class 1 → "yes, this is a tiger"
• class 0 → "no, this is not a tiger"

Deep down, that neuron is just doing very simple math.

We will build everything step by step, from the raw equation to the full backpropagation, and we will follow the training process until the neuron confidently predicts the correct class.

We will not jump directly to CNNs. Instead, we will deeply understand the backbone that all of deep learning is built on.

The Simplest Neuron: From Input to Prediction

First, we define a very simple neuron.

It takes one input number $x$, multiplies it by a weight $w$, adds a bias $b$, and then passes the result through a squashing function (sigmoid) so that the output is between 0 and 1.

We write:

$$z = w \cdot x + b$$ (this is the linear part)

$$\hat{y} = \text{sigmoid}(z)$$ (this is the squashing to get a probability)

Here:

• x is the input (for example, a simple feature such as "intensity of some pattern")
• w is the weight (the neuron learns this)
• b is the bias (the neuron learns this too)
• z is the activation before the non-linearity
• ŷ is the predicted probability of class 1 (for example, "probability this is a tiger")

The sigmoid function is:

$$\text{sigmoid}(z) = \frac{1}{1 + e^{-z}}$$

This function takes any real number and squashes it into the interval (0, 1). When z is very negative, sigmoid(z) is close to 0. When z is very positive, sigmoid(z) is close to 1.

Defining the True Label and Loss Function

For classification, our true label y is either:

• y = 1 for the positive class (e.g., tiger)
• y = 0 for the negative class (e.g., not tiger)

The neuron's job is to output ŷ close to y.

To measure how wrong the neuron is, we use the binary cross-entropy loss:

$$L = -[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})]$$

This has some nice properties:

• If y = 1 and ŷ is close to 1, the loss is small.
• If y = 1 and ŷ is close to 0, the loss is large.
• If y = 0 and ŷ is close to 0, the loss is small.
• If y = 0 and ŷ is close to 1, the loss is large.

You can think of L as a penalty that becomes very big when the model is "confident and wrong" and small when it is "confident and correct".

The Forward Pass for One Example

For a single input x and label y, the forward pass is:

1. Compute z:
$$z = w \cdot x + b$$

2. Compute ŷ using the sigmoid:
$$\hat{y} = \frac{1}{1 + e^{-z}}$$

3. Compute the loss:
$$L = -[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})]$$

At this point, we know how wrong the neuron is. But we want to fix w and b so that next time it will be less wrong. That is where backpropagation comes in.

Backpropagation: How Weight Updates Are Computed

Backpropagation is just an application of the chain rule from calculus.

We want:

• $\frac{\partial L}{\partial w}$ (how much the loss changes if w changes)
• $\frac{\partial L}{\partial b}$ (how much the loss changes if b changes)

We will compute these step by step.

First, notice the dependency chain:

$$w, b \rightarrow z = w \cdot x + b \rightarrow \hat{y} = \text{sigmoid}(z) \rightarrow L = \text{loss}(\hat{y}, y)$$

So:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$

$$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial b}$$

Let's compute each term.

Step A: Derivative of L with respect to ŷ

The loss is:

$$L = -[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})]$$

Derivative with respect to ŷ:

$$\frac{\partial L}{\partial \hat{y}} = -\left[\frac{y}{\hat{y}} - \frac{1 - y}{1 - \hat{y}}\right]$$

Step B: Derivative of ŷ with respect to z

$$\hat{y} = \text{sigmoid}(z) = \frac{1}{1 + e^{-z}}$$

The derivative of sigmoid is:

$$\frac{\partial \hat{y}}{\partial z} = \hat{y} \cdot (1 - \hat{y})$$

Step C: Derivative of z with respect to w and b

$$z = w \cdot x + b$$

So:

$$\frac{\partial z}{\partial w} = x$$

$$\frac{\partial z}{\partial b} = 1$$

Now combine them.

For the weight w:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$

$$= \left[-\left(\frac{y}{\hat{y}} - \frac{1 - y}{1 - \hat{y}}\right)\right] \cdot [\hat{y} \cdot (1 - \hat{y})] \cdot x$$

For the bias b:

$$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial b}$$

$$= \left[-\left(\frac{y}{\hat{y}} - \frac{1 - y}{1 - \hat{y}}\right)\right] \cdot [\hat{y} \cdot (1 - \hat{y})] \cdot 1$$

This is the full backpropagation for a single logistic neuron with cross-entropy loss.

In practice, deep learning frameworks (like PyTorch and TensorFlow) compute these automatically, but under the hood, they are doing exactly this.

Gradient Descent: How We Update W and B

Once we have $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$, we update w and b using gradient descent.

We pick a learning rate, call it η (for example, 0.5).

Then we update:

$$w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w}$$

$$b_{\text{new}} = b_{\text{old}} - \eta \cdot \frac{\partial L}{\partial b}$$

We move in the opposite direction of the gradient because the gradient points in the direction of steepest increase of the loss. We want to decrease the loss, so we walk in the opposite direction.

 

Complete Numerical Example Until the Neuron Gets the Class Right Confidently

Now we will work through a full, concrete example.

We will use:
- a single input $x = 2.0$
- true label $y = 1$ (meaning this example belongs to class 1)
- initial weight $w = 0.0$
- initial bias $b = 0.0$
- learning rate $\eta = 0.5$

**Important Note:** In practice, the parameters $w$, $b$, and learning rate $\eta$ are typically initialized randomly or using specific initialization strategies. Here, we're starting with simple values ($w = 0.0$, $b = 0.0$) for clarity. Bear with me through these steps - you'll see how deep learning automatically updates these parameters using the loss function as a "punishment" signal to guide the learning process.

At each step, we will:
1. compute $z$
2. compute $\hat{y}$
3. compute the loss $L$
4. compute the gradients $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$
5. update $w$ and $b$
6. observe how $\hat{y}$ moves closer to $1$ (the correct class)

Even on iteration 1, once $\hat{y} > 0.5$, the neuron is already classifying correctly (since we usually use $0.5$ as the threshold). But we will continue training until the neuron is not just correct, but confident ($\hat{y}$ close to $1$).

Iteration 1

Current parameters:
• w = 0.0
• b = 0.0

Forward pass:
$z = w \cdot x + b = 0.0 \cdot 2.0 + 0.0 = 0.0$
$\hat{y} = \text{sigmoid}(0.0) = \frac{1}{1 + e^0} = 0.5$

Loss:
$L = -[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})]$
$= -[1 \cdot \log(0.5) + 0 \cdot \log(0.5)]$
$= -\log(0.5) \approx 0.6931$

Gradients:

First compute $\frac{\partial L}{\partial \hat{y}}$:
$\frac{\partial L}{\partial \hat{y}} = -\left[\frac{y}{\hat{y}} - \frac{1 - y}{1 - \hat{y}}\right]$
$= -\left[\frac{1}{0.5} - \frac{0}{1 - 0.5}\right]$
$= -[2 - 0] = -2$

Then $\frac{\partial \hat{y}}{\partial z}$:
$\hat{y} = 0.5$, so
$\frac{\partial \hat{y}}{\partial z} = \hat{y} \cdot (1 - \hat{y}) = 0.5 \cdot 0.5 = 0.25$

Then $\frac{\partial L}{\partial z}$:
$\frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = -2 \cdot 0.25 = -0.5$

Then $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$:
$\frac{\partial z}{\partial w} = x = 2.0$
$\frac{\partial z}{\partial b} = 1$

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w} = -0.5 \cdot 2.0 = -1.0$
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial b} = -0.5 \cdot 1.0 = -0.5$

Update step:
$w_{\text{new}} = w - \eta \cdot \frac{\partial L}{\partial w} = 0.0 - 0.5 \cdot (-1.0) = 0.5$
$b_{\text{new}} = b - \eta \cdot \frac{\partial L}{\partial b} = 0.0 - 0.5 \cdot (-0.5) = 0.25$

After iteration 1:
• w = 0.5
• b = 0.25
• ŷ = 0.5

Already, 0.5 is at the boundary. For binary classification with threshold 0.5, this is just barely predicting class 1. But we want the neuron to become much more confident.

Iteration 2

Current parameters:
• w = 0.5
• b = 0.25

Forward pass:
$z = w \cdot x + b = 0.5 \cdot 2.0 + 0.25 = 1.25$
$\hat{y} = \text{sigmoid}(1.25) \approx 0.7773$

Loss:
$L \approx -[1 \cdot \log(0.7773) + 0 \cdot \log(0.2227)] \approx 0.2519$

You can see the loss has already dropped from 0.6931 to about 0.2519.

Gradients (summarized numerically):
$\frac{\partial L}{\partial w} \approx -0.4454$
$\frac{\partial L}{\partial b} \approx -0.2227$

Update:
$w_{\text{new}} = 0.5 - 0.5 \cdot (-0.4454) \approx 0.7227$
$b_{\text{new}} = 0.25 - 0.5 \cdot (-0.2227) \approx 0.3614$

After iteration 2:
• w ≈ 0.7227
• b ≈ 0.3614
• ŷ ≈ 0.7773
• L ≈ 0.2519

The neuron is now predicting about 0.78 probability for the correct class 1. This is not only correct (above 0.5) but reasonably confident.

Iteration 3

Current parameters:
• w ≈ 0.7227
• b ≈ 0.3614

Forward pass:
$z = 0.7227 \cdot 2.0 + 0.3614 \approx 1.8068$
$\hat{y} \approx \text{sigmoid}(1.8068) \approx 0.8590$

Loss:
$L \approx 0.1520$

Gradients:
$\frac{\partial L}{\partial w} \approx -0.2821$
$\frac{\partial L}{\partial b} \approx -0.1410$

Update:
$w_{\text{new}} \approx 0.7227 - 0.5 \cdot (-0.2821) \approx 0.8637$
$b_{\text{new}} \approx 0.3614 - 0.5 \cdot (-0.1410) \approx 0.4319$

After iteration 3:
• w ≈ 0.8637
• b ≈ 0.4319
• ŷ ≈ 0.8590
• L ≈ 0.1520

The neuron now predicts about 86 percent probability that the example is class 1.

Iteration 4

Current parameters:
• w ≈ 0.8637
• b ≈ 0.4319

Forward pass:
$z \approx 0.8637 \cdot 2.0 + 0.4319 \approx 2.1593$
$\hat{y} \approx \text{sigmoid}(2.1593) \approx 0.8965$

Loss:
$L \approx 0.1092$

Gradients:
$\frac{\partial L}{\partial w} \approx -0.2069$
$\frac{\partial L}{\partial b} \approx -0.1035$

Update:
$w_{\text{new}} \approx 0.8637 - 0.5 \cdot (-0.2069) \approx 0.9672$
$b_{\text{new}} \approx 0.4319 - 0.5 \cdot (-0.1035) \approx 0.4836$

After iteration 4:
• w ≈ 0.9672
• b ≈ 0.4836
• ŷ ≈ 0.8965
• L ≈ 0.1092

Now the neuron is almost 90 percent sure the example is class 1.

Iteration 5

Current parameters:
• w ≈ 0.9672
• b ≈ 0.4836

Forward pass:
$z \approx 0.9672 \cdot 2.0 + 0.4836 \approx 2.4180$
$\hat{y} \approx \text{sigmoid}(2.4180) \approx 0.9182$

Loss:
$L \approx 0.0854$

Gradients:
$\frac{\partial L}{\partial w} \approx -0.1636$
$\frac{\partial L}{\partial b} \approx -0.0818$

Update:
$w_{\text{new}} \approx 0.9672 - 0.5 \cdot (-0.1636) \approx 1.0490$
$b_{\text{new}} \approx 0.4836 - 0.5 \cdot (-0.0818) \approx 0.5245$

After iteration 5:
• w ≈ 1.0490
• b ≈ 0.5245
• ŷ ≈ 0.9182
• L ≈ 0.0854

Iteration 6

Current parameters:
• w ≈ 1.0490
• b ≈ 0.5245

Forward pass:
$z \approx 1.0490 \cdot 2.0 + 0.5245 \approx 2.6225$
$\hat{y} \approx \text{sigmoid}(2.6225) \approx 0.9323$

Loss:
$L \approx 0.0701$

Gradients:
$\frac{\partial L}{\partial w} \approx -0.1354$
$\frac{\partial L}{\partial b} \approx -0.0677$

Update:
$w_{\text{new}} \approx 1.0490 - 0.5 \cdot (-0.1354) \approx 1.1167$
$b_{\text{new}} \approx 0.5245 - 0.5 \cdot (-0.0677) \approx 0.5584$

After iteration 6:
• w ≈ 1.1167
• b ≈ 0.5584
• ŷ ≈ 0.9323
• L ≈ 0.0701

At this point, the neuron is over 93 percent sure that the example belongs to class 1. It is clearly predicting the correct class, and it is getting more confident each step.

If we continue a few more iterations:

• Iteration 7: ŷ ≈ 0.9422, L ≈ 0.0595
• Iteration 8: ŷ ≈ 0.9496, L ≈ 0.0517
• Iteration 9: ŷ ≈ 0.9553, L ≈ 0.0457
• Iteration 10: ŷ ≈ 0.9598, L ≈ 0.0410

By iteration 8 to 10, the neuron is around 95 to 96 percent confident that this example is class 1. The loss is small, and further improvements become slower. This is typical: at the beginning, training moves quickly, then it gradually stabilizes as the neuron approaches a good solution.

Putting It All Together: What Backprop Really Does

If we zoom out, the process looks like this:

1. We start with random or zero values for w and b.
2. For a given input x and true label y:
3. We compute $z = w \cdot x + b$.
4. We compute $\hat{y} = \text{sigmoid}(z)$.
5. We compute the loss L.
6. We compute how L changes with respect to ŷ ($\frac{\partial L}{\partial \hat{y}}$).
7. We compute how ŷ changes with respect to z ($\frac{\partial \hat{y}}{\partial z}$).
8. We compute how z changes with respect to w and b ($\frac{\partial z}{\partial w} = x$, $\frac{\partial z}{\partial b} = 1$).
9. We combine these with the chain rule to get $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$.
10. We update w and b in the negative direction of these gradients.
11. We repeat this for many iterations and/or many training examples.

That is all backpropagation is: a careful, systematic application of the chain rule to push the loss downward and make predictions better.

Why This Simple Example Is the Foundation of Deep Learning

In this example, we had:

• one input x
• one neuron
• one weight w
• one bias b

In real neural networks:

• x becomes a vector (or an image, or a sequence)
• w becomes a matrix or a large tensor
• we have many neurons per layer
• we have many layers stacked on top of each other

But the core idea does not change. Each neuron still does:

$$z = w \cdot x + b$$

$$\hat{y} = \text{activation}(z)$$

  • loss is computed
  • gradients are computed via chain rule
  • parameters are updated by gradient descent

This fundamental understanding forms the backbone of all deep learning architectures, including Convolutional Neural Networks (CNNs), which we'll explore in future articles.