Building a Neural Network from Scratch to Recognize Handwritten Digits

I learned something mind-blowing today: you can teach a computer to recognize handwritten digits with just a few hundred lines of code! Inspired by 3Blue1Brown’s neural network video series, I dove into building a neural network from scratch in Python to tackle the MNIST dataset—a collection of 28x28 pixel images of handwritten digits (0-9). This project was a journey of discovery, filled with surprises and “aha!” moments. Let’s explore how I built it, what I learned, and the code that made it happen.

Why This Project Sparked My Curiosity

3Blue1Brown’s videos made neural networks feel like magic decoded into math. Neurons, weights, biases, and activation functions clicked for me as a system that mimics learning. I didn’t just want to nod along—I wanted to build it. The MNIST dataset, with 60,000 training images and 10,000 test images, was the perfect challenge. My goal? Feed pixel values into a neural network and have it guess the correct digit. Spoiler: It worked, but not without some hiccups!

Designing the Neural Network

I tried picturing a neural network as a layered sandwich of math. Each layer processes inputs, applies weights and biases, and passes the result through an activation function. For MNIST, my network needed to handle 784 input pixels (28x28 images flattened) and output probabilities for 10 digits.

My architecture:

Input layer: 784 neurons (one per pixel).
Hidden layer: 16 neurons (a compromise after testing 10 and 32).
Output layer: 10 neurons (one per digit, 0-9).

Here’s a diagram to visualize it:

Diagram of neural network with input, hidden, and output layers

What surprised me: Choosing 16 hidden neurons was trial and error. Too few (10), and the model struggled; too many (32), and it got sluggish. This made me wonder: How do experts pick the “perfect” number of neurons?

The Math That Powers It

Thanks to 3Blue1Brown, I grasped the core steps:

Forward propagation: Pass inputs through layers to get predictions.
Loss function: Measure errors (I used mean squared error for simplicity).
Backpropagation: Adjust weights and biases to reduce errors via gradient descent.

Backpropagation felt like reverse-engineering a mistake. It uses the chain rule to figure out how much each weight and bias contributed to the error, then nudges them slightly. Coding this was like solving a puzzle I didn’t know I could crack!

The Code: Bringing It to Life

I used Python and NumPy for matrix math, loading MNIST via tensorflow.keras.datasets for convenience. The network itself? Purely from scratch. Here’s the code, with comments on my thought process:

import numpy as np
from tensorflow.keras.datasets import mnist

# Load and preprocess MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784) / 255.0  # Flatten and normalize
X_test = X_test.reshape(10000, 784) / 255.0
y_train = np.eye(10)[y_train]  # One-hot encode labels
y_test = np.eye(10)[y_test]

# Sigmoid activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Derivative of sigmoid for backpropagation
def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

class NeuralNetwork:
    def __init__(self):
        # Initialize weights and biases
        self.weights1 = np.random.randn(784, 16) * 0.01  # Input to hidden
        self.weights2 = np.random.randn(16, 10) * 0.01   # Hidden to output
        self.bias1 = np.zeros((1, 16))                   # Hidden layer bias
        self.bias2 = np.zeros((1, 10))                   # Output layer bias

    def forward(self, X):
        # Forward propagation
        self.z1 = np.dot(X, self.weights1) + self.bias1
        self.a1 = sigmoid(self.z1)  # Hidden layer activation
        self.z2 = np.dot(self.a1, self.weights2) + self.bias2
        self.a2 = sigmoid(self.z2)  # Output layer activation
        return self.a2

    def backward(self, X, y, output, learning_rate=0.1):
        # Backpropagation
        self.error = y - output
        self.delta2 = self.error * sigmoid_derivative(self.z2)
        self.error_hidden = np.dot(self.delta2, self.weights2.T)
        self.delta1 = self.error_hidden * sigmoid_derivative(self.z1)

        # Update weights and biases
        self.weights2 += learning_rate * np.dot(self.a1.T, self.delta2)
        self.bias2 += learning_rate * np.sum(self.delta2, axis=0, keepdims=True)
        self.weights1 += learning_rate * np.dot(X.T, self.delta1)
        self.bias1 += learning_rate * np.sum(self.delta1, axis=0, keepdims=True)

    def train(self, X, y, epochs=100):
        for epoch in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            if epoch % 10 == 0:
                loss = np.mean(np.square(y - output))
                print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Initialize and train the network
nn = NeuralNetwork()
nn.train(X_train, y_train, epochs=100)

# Test accuracy
predictions = nn.forward(X_test)
accuracy = np.mean(np.argmax(predictions, axis=1) == np.argmax(y_test, axis=1))
print(f"Test Accuracy: {accuracy:.4f}")

I didn’t know this before, but normalizing pixel values (dividing by 255) was a game-changer. Without it, the gradients went haywire, and the loss skyrocketed. Also, scaling weights by 0.01 kept the sigmoid from flattening out early on—a trick I stumbled upon after debugging.

MNIST Digit Gallery

Here are some MNIST digits to show what the network works with:

Training Insights

Here’s how the training loss evolved:

Graph of training loss decreasing over 1000 epochs

After 100 epochs, the network hit 85-90% accuracy on the test set. Not perfect, but I was floored that my code could recognize digits! One surprise: The learning rate (0.1) was finicky—too high (1.0), and the loss jumped; too low (0.01), and training was slow.

A big “oops” moment: I forgot to normalize inputs early on, and the loss exploded. Fixing that taught me how sensitive neural networks are to data prep. This made me wonder: How do real-world models handle messier data, like smudged digits?

What’s Next?

This project felt like building a tiny brain from scratch. I’m curious to try:

Adding a second hidden layer to boost accuracy.
Swapping sigmoid for ReLU to see if it trains faster.
Visualizing weights to understand what the network “sees.”

If you’ve built a neural network or played with MNIST, what surprised you most? Drop a comment—I’d love to hear your insights! Try tweaking the code above. What happens if you change the neurons or epochs?

Photo Credits

Neural network diagram and MNIST digits by Grok.

Training loss graph by Grok