Neural Network on FPGA

Neural Network Inference on FPGA (Verilog)

I implemented a compact feed-forward neural network for MNIST digit recognition that runs entirely on an FPGA. The model is trained in Python and deployed to hardware with quantized weights stored on-chip. The design avoids high-level loops and vendor IP multipliers—everything is built from basic RTL, including a custom multiply-by-add/shift datapath.

Highlights

Architecture: 2-layer fully-connected network (input 28×28 → hidden → 10 classes) with a lightweight nonlinearity implemented in combinational logic.
Fixed-point inference: Integer Q-format throughout (activations & weights), with per-layer scaling to keep values in range.
No loops / minimal primitives: Control is explicit via an FSM; MACs are constructed from add/shift (no DSP blocks), matching course constraints.
On-chip storage: Weights are embedded as initialized memory arrays to fit within Block RAM and simplify bring-up.
Deterministic latency: A streaming controller feeds pixels, sequences layers, and produces a 10-way argmax at a fixed cycle budget.

How it works

Train & quantize (Python): Train the network, then quantize weights/activations to fixed-point. Export weights as hex arrays.
HDL integration (Verilog): Include the weight arrays as initial memory contents in BRAM-synthesizable modules.
Datapath: Time-multiplexed MAC unit (add/shift multiplier + accumulator) iterates over inputs and neurons; saturation + shift handle scaling.
Control: A finite-state machine orchestrates load → accumulate → activate → next neuron/layer → argmax.

Testing & Validation

Python testbench emits fixed-point test vectors; Verilog testbench compares HDL outputs to Python ground truth.
Cycle-accurate simulation verifies controller sequencing and saturation behavior before programming the board.

What I built/learned

Designed a fully fixed-point NN inference path from first principles (quantization, scaling, saturation).
Implemented a resource-lean MAC and memory layout to meet BRAM and logic constraints.
Wrote deterministic, loop-free RTL with a clear FSM interface for portability and timing closure.

▶︎ See it running on hardware: Demo video