NodeLoop

The Complete Guide to DMA (Direct Memory Access) for Microcontrollers

A practical, in-depth guide to mastering DMA on microcontrollers. From theory to hands-on examples on STM32, learn how to free your CPU and optimize your embedded applications.

In the world of embedded systems, every CPU cycle is precious. Yet, many applications spend a considerable amount of time on simple, repetitive tasks, like waiting for an Analog-to-Digital Converter (ADC) to finish its measurement. This is a monumental waste. Fortunately, there is an elegant solution: DMA, or Direct Memory Access.

Imagine your CPU is a creative and very busy head chef. DMA is their personal prep cook: a highly efficient assistant who handles the thankless tasks, like fetching ingredients (data) from the fridge (peripheral) and bringing them to the countertop (RAM). Meanwhile, the chef is free to focus on preparing the main dish (your program's logic). This guide will teach you how to master your prep cook.

💡 Why DMA Matters

Using DMA can reduce CPU load by up to 80% in data-heavy tasks, such as audio streaming or continuous ADC sampling.

By completing transfers faster, it allows the microcontroller to enter sleep mode more often — saving both time and power.

The Core Concepts of DMA

Before diving into code, it's essential to understand the building blocks of a DMA controller. Every microcontroller has its specifics, but these concepts are nearly universal.

Channels and Streams

A DMA controller has several "Channels" (or "Streams"). Each channel is an independent "prep cook," capable of managing one data transfer task. For example, channel 1 can handle the ADC while channel 2 sends data via the UART.

CPU DMA ADC UART
Each DMA channel is an independent data highway.

Transfer Directions

A DMA transfer always has a source and a destination. The three possible directions are:

  • Peripheral-to-Memory: Reading the ADC and storing the result in RAM.
  • Memory-to-Peripheral: Sending data from RAM to UART or SPI.
  • Memory-to-Memory: Copying data between two memory areas.

Addressing Modes

The controller needs to know whether to increment (advance) the source and destination addresses after each data item is transferred.

  • Increment Mode: Address increments — used to fill an array.
  • Fixed Mode: Address fixed — typical for a peripheral register.
Peripheral RAM[0] RAM[1] RAM[2]
Reading an ADC: the source is fixed, the destination increments.

Transfer Modes

  • Normal Mode: DMA performs a fixed number of transfers and stops.
  • Circular Mode: DMA restarts automatically for continuous streaming.

Summary Table

Concept Description Example
Channel Independent DMA transfer unit ADC → RAM
Stream Hardware data path within the DMA Stream 0 for ADC1
Direction Data flow type Peripheral → Memory
Mode Operation type Normal / Circular

Practical Example 1: Reading an ADC Continuously on STM32

Enough theory. Let's take an STM32 (e.g., a NUCLEO-F446RE) and configure it to read 100 values from a potentiometer via the ADC, without ever blocking the CPU.

Configuration with STM32CubeMX

ST's graphical tool is perfect for initializing the DMA. We enable the ADC in "Continuous Conversion Mode" and in the "DMA Settings" tab, we add a channel in "Normal" mode (not circular for now).

Configuration Summary

  • ADC1: Enabled in Continuous Conversion Mode.
  • DMA (for ADC1):
    • Stream: DMA2 Stream 0
    • Direction: Peripheral to Memory
    • Mode: Normal (for this first example)
    • Priority: Low
    • Data Width (Peripheral & Memory): Half Word (16-bit)
    • Memory Address: Increment Enabled
    • Peripheral Address: Increment Disabled

The Code Explained

In our `main.c`, we first declare an array to receive the data:


// Buffer to store the 100 ADC readings
// The 'volatile' keyword is crucial to prevent the compiler from optimizing away
// access to this array, as it is modified by the DMA in the background.
volatile uint16_t adc_values[100];
      

Then, in the `main` function, after initializations, we start the DMA transfer:


// Start the ADC conversion in DMA mode
// The function returns immediately, the CPU is free!
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_values, 100);

// The CPU can do other things here while the DMA works.
while (1)
{
  // ... do calculations, blink an LED, etc ...
}
      

But how do we know when the transfer is complete? The DMA notifies us via an interrupt, and the HAL library provides a callback for us to implement:


// This function is automatically called by the HAL when the DMA has finished transferring all 100 values.
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc)
{
  // The 'adc_values' buffer is now full. We can process the data.
  // For example, calculate the average.
  uint32_t sum = 0;
  for (int i = 0; i < 100; i++)
  {
    sum += adc_values[i];
  }
  uint16_t average = sum / 100;

  // We could restart a new transfer if we wanted to
  // HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_values, 100);
}
      

Practical Example 2: Audio Streaming with Circular Mode

For continuous data like audio, we can't afford to lose a single sample. If we only process data at the end of a large buffer, the processing time might cause us to miss the start of the next buffer. The solution is Circular Mode with half-transfer and full-transfer interrupts.

The principle: we split our buffer in two. The DMA fills the first half. When it's done, it notifies us (Half-Complete interrupt) and starts filling the second half. Meanwhile, our CPU has plenty of time to process the first half. And so on.

Buffer A Buffer B DMA Writing... CPU Processing...
The principle of a double-buffer or ping-pong buffer with DMA circular mode.

In CubeMX, we simply select "Circular" mode. The code changes in the callbacks:


// 200-sample buffer for our "ping-pong" setup
volatile uint16_t audio_buffer[200];

// ... in main() ...
// Start the DMA in circular mode
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)audio_buffer, 200);


// Called when the first half (first 100 samples) is ready
void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef* hadc)
{
  // Process the first half of the buffer: &audio_buffer[0]
}

// Called when the second half (last 100 samples) is ready
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc)
{
  // Process the second half of the buffer: &audio_buffer[100]
}
      

Performance Visualization

It's one thing to say the CPU is free, but it's another to see it. Here are two ways to visualize the impact of DMA.

CPU Load Comparison

The graph below is a conceptual representation of the CPU's workload when continuously sampling a peripheral. Without DMA, the CPU is stuck in a busy-wait loop. With DMA, its load drops dramatically, leaving it free for other tasks.

CPU Load

Without DMA

With DMA

Conceptual CPU Load: DMA frees up the processor for other tasks.

Logic Analyzer: SPI Transfer

This is the ultimate proof. Let's imagine sending 3 bytes over SPI. Without DMA, the CPU must write each byte to the data register, wait for the transfer to complete, and then write the next one. This creates visible gaps on the bus.

Notice the gaps between bytes as the CPU prepares the next transfer.

With DMA, we configure the entire transfer upfront. The DMA controller feeds the SPI peripheral back-to-back, without any CPU intervention, resulting in maximum throughput.

With DMA, the transfer is seamless with no gaps.

Expert Corner: Bare-Metal Configuration

While the HAL is convenient, understanding direct register access is key for optimization and non-ST ecosystems. Here is a conceptual look at configuring the same ADC-to-memory transfer on an STM32F4 by writing directly to registers.


// 1. Enable DMA2 Clock in RCC
RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;

// 2. Configure DMA2 Stream 0, Channel 0 for ADC1
// Disable the stream first
DMA2_Stream0->CR &= ~DMA_SxCR_EN;
while(DMA2_Stream0->CR & DMA_SxCR_EN) {}; // Wait until disabled

// Set peripheral address (ADC1 data register)
DMA2_Stream0->PAR = (uint32_t)&(ADC1->DR);

// Set memory address
DMA2_Stream0->M0AR = (uint32_t)adc_values;

// Set number of items to transfer
DMA2_Stream0->NDTR = 100;

// Configure control register:
// - Channel 0 selection
// - Memory increment mode
// - 16-bit data sizes
// - Peripheral-to-memory direction
// - Circular mode (optional)
// - Enable transfer complete interrupt
DMA2_Stream0->CR = (0 << DMA_SxCR_CHSEL_Pos) | 
                   DMA_SxCR_MINC | 
                   DMA_SxCR_PSIZE_0 | 
                   DMA_SxCR_MSIZE_0 | 
                   DMA_SxCR_TCIE;

// 3. Enable the stream
DMA2_Stream0->CR |= DMA_SxCR_EN;

// 4. Configure and enable DMA interrupt in NVIC
HAL_NVIC_SetPriority(DMA2_Stream0_IRQn, 0, 0);
HAL_NVIC_EnableIRQ(DMA2_Stream0_IRQn);
      

Advanced Topic: Scatter-Gather DMA

On more advanced controllers, DMA can perform even more complex tasks. Scatter-Gather DMA allows the controller to read from or write to multiple, non-contiguous memory blocks in a single operation. It does this by following a linked list of "descriptors" in memory, where each descriptor tells the DMA engine the source, destination, and size of the next block.

RAM Header Payload Footer Descriptor 1 SRC: &data_header LEN: 12 bytes NEXT: &desc_2 Descriptor 2 SRC: &data_payload LEN: 64 bytes NEXT: &desc_3 DMA reads descriptors
Scatter-Gather DMA follows a linked list of descriptors in RAM to process non-contiguous data blocks.

This is extremely powerful for applications like networking, where you might build a packet from a header, a payload, and a footer that are all in different memory locations, and send them in one seamless SPI or Ethernet transfer.

Common Pitfalls & Best Practices

  • Forgetting the DMA Clock: The DMA controller is a peripheral like any other. If you don't enable its clock in the RCC (Reset and Clock Control), it won't work. CubeMX does this for us, but in bare-metal programming, it's a classic mistake.
  • The volatile Keyword: This is probably the most common error. If you don't declare your DMA buffer as `volatile`, the compiler, in an effort to optimize, might decide that your code never modifies this buffer (since the DMA does it) and cache stale values. Your program will never see the new data.
  • Cache Coherency: On more powerful microcontrollers (Cortex-M7), the CPU has a data cache. It's possible for the DMA to write to RAM, but the CPU still reads an old version of the data from its cache. You must use specific functions to invalidate the cache and ensure you're reading fresh data.

Conclusion

DMA transforms how you design an embedded application. By delegating data transfers, it frees up the CPU to focus on higher-value tasks. While its configuration may seem intimidating, a structured approach and an understanding of its key concepts make it an extraordinarily powerful tool. Don't be afraid of it anymore: make it your best ally.

FAQ - DMA

Does DMA use more power?
It's counter-intuitive, but using DMA can greatly reduce power consumption. By finishing transfers faster, it allows the CPU and the rest of the system to return to sleep mode more quickly and for longer periods.
Can DMA be used with all peripherals?
No, only peripherals that support DMA requests (ADC, SPI, I2C, UART, Timers...) can use it. You must consult the microcontroller's datasheet to see the DMA channel matrix.

See Also