Neural Network Inference on FPGA (Verilog)
  
    I implemented a compact feed-forward neural network for MNIST digit recognition that runs entirely on an FPGA.
    The model is trained in Python and deployed to hardware with quantized weights stored on-chip. The design
    avoids high-level loops and vendor IP multipliers—everything is built from basic RTL, including a custom
    multiply-by-add/shift datapath.
  
  Highlights
  
    - Architecture: 2-layer fully-connected network (input 28×28 → hidden → 10 classes) with a lightweight nonlinearity implemented in combinational logic.
- Fixed-point inference: Integer Q-format throughout (activations & weights), with per-layer scaling to keep values in range.
- No loops / minimal primitives: Control is explicit via an FSM; MACs are constructed from add/shift (no DSP blocks), matching course constraints.
- On-chip storage: Weights are embedded as initialized memory arrays to fit within Block RAM and simplify bring-up.
- Deterministic latency: A streaming controller feeds pixels, sequences layers, and produces a 10-way argmax at a fixed cycle budget.
How it works
  
    - Train & quantize (Python): Train the network, then quantize weights/activations to fixed-point. Export weights as hex arrays.
- HDL integration (Verilog): Include the weight arrays as initialmemory contents in BRAM-synthesizable modules.
- Datapath: Time-multiplexed MAC unit (add/shift multiplier + accumulator) iterates over inputs and neurons; saturation + shift handle scaling.
- Control: A finite-state machine orchestrates load → accumulate → activate → next neuron/layer → argmax.
Testing & Validation
  
    - Python testbench emits fixed-point test vectors; Verilog testbench compares HDL outputs to Python ground truth.
- Cycle-accurate simulation verifies controller sequencing and saturation behavior before programming the board.
What I built/learned
  
    - Designed a fully fixed-point NN inference path from first principles (quantization, scaling, saturation).
- Implemented a resource-lean MAC and memory layout to meet BRAM and logic constraints.
- Wrote deterministic, loop-free RTL with a clear FSM interface for portability and timing closure.
    ▶︎ See it running on hardware:
    Demo video