Making RiscyOS · Part 2Project · SPI Peripheral

Making RiscyOS: I SPI an SD Card

In this post, I dive into the design and implementation of an SPI-based SD card interface for RiscyOS. I walk through building a Verilog SPI peripheral and my own debugging experience.

Published July 4, 2026★ Featured

#OS#Verilog#Embedded#RISC-V

This is part two of my RiscyOS development blog. In this part I outline the process of creating a SPI peripheral for communicating with an SD card.

For reference the breakout board that I bought was: Adafruit SD 3V Breakout Board.

An SD card uses two main communication protocols, Serial Peripheral Interface (SPI) and Secure Digital Input Output (SDIO). SDIO was introduced in 2001 and is generally faster because it allows for a higher clock rate (50MHz for SDIO vs 40MHz for SPI) and has four data pins (DAT0-3) and allows for 4 bits to be transferred in comparison to SPI which only has MISO and MOSI and allows for 1 bit of transfer.

Despite SDIO being faster and having a higher bandwidth is not as well documented as SPI, is more complex to implement and is not universally supported. For these reasons I chose to use SPI.

I had two options either bit-bang or create a dedicated peripheral which deals with it in hardware. Lets analyse these two options

Definition Bit banging is the technique of implementing a hardware communication protocol entirely in software by directly controlling and monitoring general-purpose I/O (GPIO) pins, rather than using dedicated hardware peripherals.

My hardware runs at 40MHz and to bit bang spi the general sequence would be:

; a0 = byte to transmit
; a0 = byte received

transfer_byte:
    la   t0, SPI_GPIO        ; GPIO register
    li   t1, 7               ; bit index
    li   t4, 0               ; received byte

loop:
    # Get transmit bit
    srl  t2, a0, t1
    andi t2, t2, 1

    # Set MOSI
    lw   t3, 0[t0]
    li   t5, MOSI_BIT
    not  t6, t5
    and  t3, t3, t6          ; clear MOSI
    beqz t2, f1
    or   t3, t3, t5          ; set MOSI if bit=1
1:
    sw   t3, 0[t0]

    # Clock high
    ori  t3, t3, SCK_BIT
    sw   t3, 0[t0]

    # Sample MISO
    lw   t3, 0[t0]
    andi t3, t3, MISO_BIT
    beqz t3, 2f
    li   t5, 1
    sll  t5, t5, t1
    or   t4, t4, t5
2:

    # Clock low
    lw   t3, 0[t0]
    li   t5, SCK_BIT
    not  t6, t5
    and  t3, t3, t6
    sw   t3, 0[t0]

    addi t1, t1, -1
    bgez t1, loop

    mv   a0, t4
    ret

I know this code is not optimised but it serves as a demonstration of what a bit-banged SPI would look like and from my calculations the worst case scenario (where we send 0x0 and receive 0xF) would take 213 clock cycles for this function to complete. So if we assume that we are reading a standard SD card block which is 512 bytes long it would take about 109,056 clock cycles which running at 40MHz, total time taken would be 2.5 ms which gives a rough slave clock speed of 1.5 MHz.

However using an external peripheral we are able to offload the communication to a hardware peripheral which can have a much higher clock rate using the master 40MHz and a clock divider reaching 20MHz which means that the same 512 byte sequence will take roughly 0.2 ms and will send an interrupt when the transfer is complete. This allows the CPU to do much more urgent tasks whilst the SPI happens in the background, which is of course assuming that we have DMA.

The SPI protocol

I have mentioned SPI many times in this post up to now. I think I should explain what it is. SPI stands for Serial Peripheral Interface and consists of four wires:

MOSI (Master Out Slave In) This is the bit that we are sending to the slave
MISO (Master In Slave Out) This is the bit that the slave sends to us
SCLK (Slave Clock) This is a shared clock which is generated by the master
CS (Chip Select) This is the line which tells slave we are talking or listening to it

SPI defines four operating modes, determined by two configuration bits: clock polarity (CPOL) and clock phase (CPHA). These control the idle state of the clock and the clock edge on which data is sampled.

Clock Polarity (CPOL) defines the idle state of the serial clock (SCLK). When CPOL = 0, the clock idles low, making the leading edge a rising edge and the trailing edge a falling edge. Conversely, when CPOL = 1, the clock idles high, so the leading edge is a falling edge and the trailing edge is a rising edge.

Clock Phase (CPHA) determines on which clock edge data is sampled. When CPHA = 0, data is sampled on the leading edge and changes on the trailing edge. When CPHA = 1, data changes on the leading edge and is sampled on the trailing edge.

This can be quite confusing so I would recommend watching Understanding SPI by Rohde & Schwarz

SD cards typically operate in SPI Mode 0 (CPOL = 0, CPHA = 0). However, it is good practice to consult the documentation for the specific SD card being used to verify the required operating mode.

Memory-mapped peripheral design

The SPI controller is exposed to the processor as a memory-mapped peripheral. This allows software to configure and control the peripheral using ordinary load and store instructions without requiring any specialised I/O instructions. The peripheral occupies a dedicated address range, containing both control registers and an internal transmit buffer used for block transfers.

The address space is divided into two regions. The first contains the control and status registers, while the second provides two 512 byte buffers, Tx and Rx. On a block transfer the peripheral will send the specified number of bytes from the Tx buffer and will overwrite Rx always starting at the first byte in both buffers.

Address Range	Description	Access Type
0x000 - 0x1FF	Memory-mapped registers	Word aligned
0x200 - 0x7FF	Transmit buffer RAM	Byte, half-word or word

Register Map

Offset	Name	Type	Description
`0x000`	CONTROL	WO	Trigger pulses for Start, Stop, and Mode selection.
`0x004`	STATUS	RO/W1C	Read status flags; Write any value to clear IRQs.
`0x008`	CONFIG	RW	SPI Clock polarity, phase, and clock divider.
`0x00C`	CS	RW	Manual Chip Select control.
`0x010`	TXDATA	RW	Data to be transmitted in manual mode.
`0x014`	RXDATA	RO	Last received byte. Reading clears `rx_valid_flag`.
`0x018`	BLOCK_LEN	RW	Number of bytes to transfer in Block Mode.
`0x01C`	IRQ_ENABLE	RW	Interrupt mask for done and error events.

For specific information about how each register works I direct you to the SPI Docs

Unlike many SPI peripherals, SD cards require manual control of the chip select (CS) signal rather than automatically asserting and deasserting it for each byte or word transferred. This is because the CS line must remain low for the entire duration of a command and its response. During the initialisation sequence, however, the SD card specification requires that at least 74 dummy clock cycles are provided with CS held high so that the card can enter SPI mode correctly. Due to these constraints, the SPI controller must allow software to control the CS line independently of individual data transfers. This is discussed in more detail later on but is mentioned here to justify why I provided a direct interface to control CS.

The Verilog architecture

To create this peripheral we were given a blank module which was already implemented into the rest of the processor. This was quite different from other verilog tasks that we had been given previously, to start this was completely optional (although it was noted that projects with hardware and software would receive more marks) and also because other courseworks were more structured and were more of implement this specific thing (like an ALU or a datapath ect) instead this was here is an interface make whatever you want which gave me a lot more freedom. The most annoying thing was that each compilation took around half an hour each time which made debugging quite terrible.

I tried to keep the peripheral deign as general as possible and I tried to avoid making the hardware tailored to an SD card specifically.

I decided to split up the design into smaller components each with a specific purpose. Here is the top level architecture diagram: spi_peripheral_svg

This is a diagram of the peripheral created using yosys: spi_yosys

The main engine

Click to Expand: SPI Engine Source Code

module SPI_Engine (
    input  logic       clk,
    input  logic       reset,

    // Byte interface
    input  logic [7:0] tx_byte,
    input  logic       start_byte,
    output logic [7:0] rx_byte,
    output logic       byte_done,

    // Configuration
    input  logic       cpol,
    input  logic       cpha,
    input  logic [7:0] clk_divider,
    input  logic       cs,

    // SPI pins
    output logic       mosi,
    input  logic       miso,
    output logic       sclk,
    output logic       cs_out
);

    logic [3:0] bit_count;   
    logic [7:0] tx_shifter;  
    logic [7:0] rx_shifter;  
    logic       working;     
    logic       pending;     // High when we are waiting for the next tick to start
    logic       sclk_reg;    
    logic       tick;        

    assign sclk = sclk_reg;
    assign rx_byte = rx_shifter;
    assign cs_out = cs;
    assign mosi = tx_shifter[7];

    // Clock divider is now free-running (only reset by global reset)
    Clock_Divider clk_div_inst (
        .clk_in(clk),
        .reset(reset), 
        .clk_divisor(clk_divider),
        .tick(tick)
    );

logic [4:0] tick_count; // Counts 0 to 15 half-cycles

    always_ff @(posedge clk) begin
        if (reset) begin
            working    <= 1'b0;
            pending    <= 1'b0;
            sclk_reg   <= cpol;
            tick_count <= 5'd0;
            byte_done  <= 1'b0;
        end else begin
            byte_done <= 1'b0;

            if (start_byte && !working) begin
                pending    <= 1'b1;
                tx_shifter <= tx_byte;
            end

            // synchronize start with the next available tick
            if (pending && tick) begin
                pending    <= 1'b0;
                working    <= 1'b1;
                tick_count <= 5'd0;
                
                sclk_reg   <= !sclk_reg; // First Toggle

                // IMMEDIATE SAMPLE for Mode 0
                if (cpha == 1'b0) begin
                    rx_shifter <= {rx_shifter[6:0], miso};
                end
                tick_count <= 5'd1; 
            end
            
            else if (working && tick) begin
                sclk_reg <= !sclk_reg;
                
                // tick_count[0] == 0: Leading Edge (Bit 0, 2, 4, ect)
                // tick_count[0] == 1: Trailing Edge (Bit 1, 3, 5, ect)
                if (tick_count[0] == cpha) begin
                    // SAMPLE PHASE
                    rx_shifter <= {rx_shifter[6:0], miso};
                end else begin
                    // SHIFT PHASE
                    if (tick_count < 15) begin
                        tx_shifter <= {tx_shifter[6:0], 1'b0};
                    end
                end

                // Increment and Exit
                if (tick_count == 5'd15) begin
                    working    <= 1'b0;
                    byte_done  <= 1'b1;
                    sclk_reg   <= cpol; // Force return to idle
                end else begin
                    tick_count <= tick_count + 1'b1;
                end
            end
        end
    end

endmodule

The SPI engine is responsible solely for serialising a byte onto the SPI bus. It has no knowledge of block transfers, buffers or SD card commands; it simply accepts a byte, shifts it out according to the configured SPI mode whilst also sampling and shifting in from MISO and raises a byte_done pulse once eight bits have been transmitted.

The main FSM is as follows: spi_fsm_svg

Clock Divider

Click to Expand: Clock Divider Source Code

module Clock_Divider (
  input  logic       clk_in,
  input  logic       reset,
  input  logic [7:0] clk_divisor, // number of clk cycles per HALF SCLK period
  output logic       tick         // 1-cycle pulse every divisor cycles
);

  logic [7:0] counter;
  logic [7:0] divisor_reg;

  always_ff @(posedge clk_in) begin
      if (reset) begin
          counter     <= 0;
          divisor_reg <= 1;
          tick        <= 0;
      end else begin
          // Latch divisor safely at boundary
          if (counter == 0)
              divisor_reg <= (clk_divisor == 0) ? 1 : clk_divisor;

          if (counter == divisor_reg - 1) begin
              counter <= 0;
              tick    <= 1;
          end else begin
              counter <= counter + 1;
              tick    <= 0;
          end
      end
  end

endmodule

The clock divider is very simple. All it does is maintain an internal counter which increases on the positive edge of the global clock. Once the internal counter reaches the desired threshold it produces a tick which is what the main engine uses.

On my first attempt, I restarted the divider for every transfer which created inaccurate timing problems since the amount of time to took from finishing a byte to starting a new byte was less than a full clock cycle during initialisation where the clock was set at 300KHz it created some problems and violated consistent timing. I had to redesign the the divider to run continuously and when a new transfer starts it will wait for the next divider tick before beginning. This avoids glitches, ensures consistent SPI timing and allows the divider to be reconfigured safely by latching new values only at counter boundaries.

Buffer RAM

Click to Expand: Buffer RAM Source Code

module Buffer_RAM #(
  parameter ADDR_BIT = 10,
  parameter DEPTH    = 1024,
  parameter BUFFER_START_ADDR = 'h200
)(
  input  logic        clk,
  input  logic        reset,
  
  // CPU interface
  input  logic [31:0] cpu_addr,
  input  logic [31:0] cpu_write_data,
  input  logic        cpu_write_en,   
  output logic [31:0] cpu_read_data,
  input  logic [1:0]  cpu_read_mode,  // 00: Byte, 01: Half, 10: Word
  input  logic        cpu_read_en,
  
  // SPI/Transfer_Controller interface
  input  logic [ADDR_BIT-1:0]  tx_addr,
  input  logic [7:0]  tx_write_data,
  input  logic        tx_write_en,
  output logic [7:0]  tx_read_data
);

  // Internal memory
  logic [7:0] buffer [0:DEPTH-1];

  // Local index calculation
  logic [ADDR_BIT-1:0] ram_index;
  assign ram_index = cpu_addr[ADDR_BIT-1:0] - BUFFER_START_ADDR[ADDR_BIT-1:0];

  // CPU Expects zero cycle read
  always_comb begin
    if (cpu_read_en) begin
      cpu_read_data = {buffer[{ram_index[ADDR_BIT-1:2], 2'b11}], 
                              buffer[{ram_index[ADDR_BIT-1:2], 2'b10}], 
                              buffer[{ram_index[ADDR_BIT-1:2], 2'b01}], 
                              buffer[{ram_index[ADDR_BIT-1:2], 2'b00}]};
      end else begin cpu_read_data = 32'h0; end
  end

  // Synchronous Logic
  always_ff @(posedge clk) begin
      if (reset) begin
          tx_read_data  <= 8'h0;
      end else begin
          // WRITE PORT 
          if (tx_write_en) begin
              buffer[tx_addr] <= tx_write_data;
          end else if (cpu_write_en) begin
              buffer[{ram_index[ADDR_BIT-1:2], 2'b00}] <= cpu_write_data[7:0];
              buffer[{ram_index[ADDR_BIT-1:2], 2'b01}] <= cpu_write_data[15:8];
              buffer[{ram_index[ADDR_BIT-1:2], 2'b10}] <= cpu_write_data[23:16];
              buffer[{ram_index[ADDR_BIT-1:2], 2'b11}] <= cpu_write_data[31:24];
          end

          // READ PORT
          tx_read_data <= buffer[tx_addr];

      end
  end

endmodule

Initially, I intended to infer Block RAM (BRAM) for the transmit and receive buffers. However, by this point in the project, all of the FPGA's available BRAM resources had already been allocated elsewhere, making that approach impossible. Instead, I implemented the buffers using distributed RAM, allowing the synthesis tool to map the memory onto lookup tables (LUTs).

This decision came with a number of trade-offs. The most significant disadvantage is resource usage: implementing memory in LUTs is considerably more expensive than using dedicated BRAM and would not scale well to larger buffer sizes. On the other hand, for a relatively small 1 KB buffer, the resource cost was acceptable within the constraints of my design.

An unexpected benefit of this approach was that the CPU could perform combinational (zero-cycle) reads from the buffer, eliminating the latency that would normally be associated with synchronous BRAM reads. Although this behaviour was not a design goal, it simplified the processor interface and slightly improved software performance when accessing the transmit and receive buffers.

Transfer Controller

Click to Expand: Transfer Controller Source Code

module Transfer_Controller #(
  parameter ADDR_BIT = 12
  )(
  input  logic        clk,
  input  logic        reset,
  
  // Control signals
  input  logic        start,
  input  logic        stop,
  input  logic        block_mode,
  input  logic [ADDR_BIT-2:0] block_len,

  	input  logic [7:0]  cpu_tx_byte,
  	output logic [7:0]  cpu_rx_byte,
  // SPI engine interface
  output logic [7:0]  tx_byte,
  input  logic [7:0]  rx_byte,
  output logic        start_byte,
  input  logic        byte_done,
  
  // Buffer interface
  input  logic [7:0]  buffer_read_data,
  output logic [ADDR_BIT-1:0]  buffer_addr,
  output logic [7:0]  buffer_write_data,
  output logic        buffer_write_en,
  
  // Status
  output logic         busy,
  output logic         transfer_done,
  output logic         block_done,
  output logic         error
);
enum {
  IDLE,
  LOAD_BYTE,
  FETCH_BYTE,
  START_BYTE,
  WAIT_BYTE,
  BYTE_DONE,
  TRANSFER_DONE,
  ERROR
} state;

logic internal_block_mode;
logic [ADDR_BIT-2:0] block_counter = 0;
logic [ADDR_BIT-2:0] block_terminator = 0;
logic [7:0]  cpu_tx_passthrough;
assign last_cycle = (block_counter == block_terminator - 1'b1);// NOTE: This does work but is very hacky since 512 becomes 0 but that means 0 will transfer 512 

always_ff @(posedge clk or posedge reset) begin
  if (reset || stop) begin
  	// reset all states 
  	state               <= IDLE;
  	internal_block_mode <= 0;
  	transfer_done       <= 0;
  	block_done          <= 0;
  	block_terminator    <= 0;
  	error               <= 0;
  	buffer_write_en     <= 0;
  	buffer_addr         <= 0;
  	busy                <= 0;
  	start_byte          <= 0;
  	cpu_rx_byte         <= 8'h00;
  end else
  	case(state)
  		IDLE:
  			begin
  				transfer_done <= 0;
  				block_done <= 0;
  				error <= 0;
  				buffer_write_en <= 0;
  				buffer_addr <= 0;
  				busy <= 0;
  				start_byte <= 0;

  				if(start) begin 
  					state <= LOAD_BYTE; 
  					busy  <= 1;
  					block_counter <= 0;
  					if(block_mode) begin 
  							block_terminator    <= block_len; 
  							internal_block_mode <= 1; 
  					end else begin 
  							internal_block_mode <= 0; 
  							block_terminator    <= 1; 
  							cpu_tx_passthrough  <= cpu_tx_byte; 
  					end
  			end
  		end
  		LOAD_BYTE:
  			begin
  				buffer_addr <= block_counter;
  				state <= FETCH_BYTE;
  			end

    FETCH_BYTE:
      // Wait one clock cycle for byte to fetch
      begin
        state <= START_BYTE;
      end

  		START_BYTE:
  			begin
  				tx_byte <= internal_block_mode ? buffer_read_data : cpu_tx_passthrough;
  				start_byte <= 1'b1; // send start signal to engine
  				state <= WAIT_BYTE;
  			end


  		WAIT_BYTE:
  			begin
  				start_byte <= 1'b0;
  				if (byte_done) begin
  					state             <= BYTE_DONE;
  					if (internal_block_mode) begin
  						buffer_addr       <= {1'b1, block_counter[ADDR_BIT-2:0]};
  						buffer_write_data <= rx_byte;
  						buffer_write_en   <= 1'b1;
  					end
  					else cpu_rx_byte <= rx_byte;
  				end
  			end

  		BYTE_DONE:
  			begin
  				buffer_write_en <= 1'b0;
  				if (last_cycle) 
  					state <= TRANSFER_DONE;
  				else begin
  					block_counter <= block_counter + 1;
  					state         <= LOAD_BYTE;
  				end
  			end

  		TRANSFER_DONE:
  			begin
  				busy <= 1'b0;
  				if(internal_block_mode) block_done <= 1'b1;
  				else transfer_done <= 1'b1;
  				state <= IDLE;
  			end


  		ERROR:
  			begin
  				state <= IDLE;
  				error <= 1'b1;
  			end
  		default: state <= ERROR;
  	endcase
end

endmodule

This module sits one level above the SPI engine and is responsible for turning higher-level transfer requests into a sequence of byte transfers. In other words, it acts as the “glue” between the CPU-visible interface and the low-level SPI hardware.

It supports two modes:

Single-byte mode, where the CPU directly provides a byte to transmit
Block mode, where the controller streams data from the buffer RAM and writes responses back into it

So instead of the CPU manually driving SPI byte-by-byte, it can trigger a transfer and let the controller handle the rest.

The overall flow is implemented as a simple finite state machine:

IDLE
LOAD_BYTE
FETCH_BYTE
START_BYTE
WAIT_BYTE
BYTE_DONE
TRANSFER_DONE
ERROR

The bugs

For this section I think I am going put those GCSE English writing skills which allowed me to achieve a 5 to use and write about my personal hardware debugging experience.

There are a lot of things that went wrong during this phase of the project. What was meant to be a one week project turned into two and a half weeks. Going in to uni every other day during Easter was quite nice. The lab was quiet there was only a few people in working on coursework. I had a good laughs and maybe one to two drinks more than I should of at the pub after the building closed.

One fateful Wednesday a terrible thing happened.

I was wiring up the Adafruit breakout board and accidentally swapped VCC and GND. At first, nothing seemed out of the ordinary.Nothing caught on fire. But then I inserted the SD card, and that’s when it happened: a cacophonous buzzing noise erupted from the board.

Alas, the SD card was fried. All data, lost. And that is the story of how a single stupid mistake cost me £18.

Panicked, I rushed down the street to the nearest Ryman, bought a new card, and vowed that this would never happen again. Back at the lab, in a moment of pure gambling, I slotted the fresh SD card into the breakout board and clicked 'run' on the debugging software.

Nothing. It didn't work.

It was at this point I concluded that I hadn't just murdered the card; I had killed the board too. Feeling defeated, I went home, fished out a trusty Arduino, and loaded up some driver code written by someone who actually knew what they were doing. I wired up the potentially dead board, inserted the new SD card, and tried again. Still, absolute silence. With a wounded bank account, my anger and frustration drove me to order three more boards just to have some desperate redundancy.

As it turns out, I am an idiot.

I would later find out that the original board was perfectly fine. I had foolishly assumed that because I was using the 3.3V power pin on the Arduino, the SPI logic pins would magically operate at 3.3V too. I was incredibly wrong. Once I actually wired it up correctly with voltage dividers, the Arduino responded perfectly with the correct card details.

Wiring of the Adafruit Breakout board and Arduino with volatage dividers

Once the new boards arrived a few days later I went to Uni and soldered up the pins and rung the bell for round two.

Confident that the board nor the SD card was the problem that only left one place, my verilog code. The thing is, running the testbench yielded a perfect result. Under simulation everything worked just as it should but reality had a different song to sing (Of course looking back this just meant that my testbench wasn't good enough). Countless hours make a tiny change, wait ages for the compiler to run, flash the board, and then wait in anticipation and start praying for a single response of 0x1 to confirm the SD card was in SPI mode.

It never happened. No matter what I tried I failed.

Running on the brink of near insanity I gave up and went to my course leader. He pulled out an ancient relic of the past, an oscilloscope coated in dust. He then and then proceeded to sit with me whilst probing the clock pins and data lines. We arrived at a consensus as to what the problem might of been. At this point it was quite late and I headed off to pub with a couple of mates. Then it was the weekend and all I could do was wait until Monday.

Monday came and it Easter break was over the lab was filled once more. This was the make or break day. If I couldn't get it to work today, I would admit defeat. After a few lectures I rushed to the lab and set everything up (being very careful to plug in the VCC and GND) compiled and flashed the new design and clicked run.

It worked. The SD card responded with 0x1 and now it worked I could finally begin the mountain of work which was creating the SD and the FAT drivers.

Series Collection

Making RiscyOS

Progress

Comments

Loading comments...

Enjoyed This Post?

If this article sparked your curiosity, subscribe and get the newest deep dives straight to your inbox.

Never Miss a Post

Subscribe to get the latest deep dives directly to your inbox.