The Complete Guide to DMA (Direct Memory Access) for Microcontrollers
A practical, in-depth guide to mastering DMA on microcontrollers. From theory to hands-on examples on STM32, learn how to free your CPU and optimize your embedded applications.
In the world of embedded systems, every CPU cycle is precious. Yet, many applications spend a considerable amount of time on simple, repetitive tasks, like waiting for an Analog-to-Digital Converter (ADC) to finish its measurement. This is a monumental waste. Fortunately, there is an elegant solution: DMA, or Direct Memory Access.
Imagine your CPU is a creative and very busy head chef. DMA is their personal prep cook: a highly efficient assistant who handles the thankless tasks, like fetching ingredients (data) from the fridge (peripheral) and bringing them to the countertop (RAM). Meanwhile, the chef is free to focus on preparing the main dish (your program's logic). This guide will teach you how to master your prep cook.
💡 Why DMA Matters
Using DMA can reduce CPU load by up to 80% in data-heavy tasks, such as audio streaming or continuous ADC sampling.
By completing transfers faster, it allows the microcontroller to enter sleep mode more often — saving both time and power.
On this page
The Core Concepts of DMA
Before diving into code, it's essential to understand the building blocks of a DMA controller. Every microcontroller has its specifics, but these concepts are nearly universal.
Channels and Streams
A DMA controller has several "Channels" (or "Streams"). Each channel is an independent "prep cook," capable of managing one data transfer task. For example, channel 1 can handle the ADC while channel 2 sends data via the UART.
Transfer Directions
A DMA transfer always has a source and a destination. The three possible directions are:
- Peripheral-to-Memory: Reading the ADC and storing the result in RAM.
- Memory-to-Peripheral: Sending data from RAM to UART or SPI.
- Memory-to-Memory: Copying data between two memory areas.
Addressing Modes
The controller needs to know whether to increment (advance) the source and destination addresses after each data item is transferred.
- Increment Mode: Address increments — used to fill an array.
- Fixed Mode: Address fixed — typical for a peripheral register.
Transfer Modes
- Normal Mode: DMA performs a fixed number of transfers and stops.
- Circular Mode: DMA restarts automatically for continuous streaming.
Summary Table
| Concept | Description | Example |
|---|---|---|
| Channel | Independent DMA transfer unit | ADC → RAM |
| Stream | Hardware data path within the DMA | Stream 0 for ADC1 |
| Direction | Data flow type | Peripheral → Memory |
| Mode | Operation type | Normal / Circular |
Practical Example 1: Reading an ADC Continuously on STM32
Enough theory. Let's take an STM32 (e.g., a NUCLEO-F446RE) and configure it to read 100 values from a potentiometer via the ADC, without ever blocking the CPU.
Configuration with STM32CubeMX
ST's graphical tool is perfect for initializing the DMA. We enable the ADC in "Continuous Conversion Mode" and in the "DMA Settings" tab, we add a channel in "Normal" mode (not circular for now).
Configuration Summary
- ADC1: Enabled in Continuous Conversion Mode.
- DMA (for ADC1):
- Stream: DMA2 Stream 0
- Direction: Peripheral to Memory
- Mode: Normal (for this first example)
- Priority: Low
- Data Width (Peripheral & Memory): Half Word (16-bit)
- Memory Address: Increment Enabled
- Peripheral Address: Increment Disabled
The Code Explained
In our `main.c`, we first declare an array to receive the data:
// Buffer to store the 100 ADC readings
// The 'volatile' keyword is crucial to prevent the compiler from optimizing away
// access to this array, as it is modified by the DMA in the background.
volatile uint16_t adc_values[100];
Then, in the `main` function, after initializations, we start the DMA transfer:
// Start the ADC conversion in DMA mode
// The function returns immediately, the CPU is free!
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_values, 100);
// The CPU can do other things here while the DMA works.
while (1)
{
// ... do calculations, blink an LED, etc ...
}
But how do we know when the transfer is complete? The DMA notifies us via an interrupt, and the HAL library provides a callback for us to implement:
// This function is automatically called by the HAL when the DMA has finished transferring all 100 values.
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc)
{
// The 'adc_values' buffer is now full. We can process the data.
// For example, calculate the average.
uint32_t sum = 0;
for (int i = 0; i < 100; i++)
{
sum += adc_values[i];
}
uint16_t average = sum / 100;
// We could restart a new transfer if we wanted to
// HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_values, 100);
}
Practical Example 2: Audio Streaming with Circular Mode
For continuous data like audio, we can't afford to lose a single sample. If we only process data at the end of a large buffer, the processing time might cause us to miss the start of the next buffer. The solution is Circular Mode with half-transfer and full-transfer interrupts.
The principle: we split our buffer in two. The DMA fills the first half. When it's done, it notifies us (Half-Complete interrupt) and starts filling the second half. Meanwhile, our CPU has plenty of time to process the first half. And so on.
In CubeMX, we simply select "Circular" mode. The code changes in the callbacks:
// 200-sample buffer for our "ping-pong" setup
volatile uint16_t audio_buffer[200];
// ... in main() ...
// Start the DMA in circular mode
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)audio_buffer, 200);
// Called when the first half (first 100 samples) is ready
void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef* hadc)
{
// Process the first half of the buffer: &audio_buffer[0]
}
// Called when the second half (last 100 samples) is ready
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc)
{
// Process the second half of the buffer: &audio_buffer[100]
}
Performance Visualization
It's one thing to say the CPU is free, but it's another to see it. Here are two ways to visualize the impact of DMA.
CPU Load Comparison
The graph below is a conceptual representation of the CPU's workload when continuously sampling a peripheral. Without DMA, the CPU is stuck in a busy-wait loop. With DMA, its load drops dramatically, leaving it free for other tasks.
CPU Load
Without DMA
With DMA
Logic Analyzer: SPI Transfer
This is the ultimate proof. Let's imagine sending 3 bytes over SPI. Without DMA, the CPU must write each byte to the data register, wait for the transfer to complete, and then write the next one. This creates visible gaps on the bus.
With DMA, we configure the entire transfer upfront. The DMA controller feeds the SPI peripheral back-to-back, without any CPU intervention, resulting in maximum throughput.
Expert Corner: Bare-Metal Configuration
While the HAL is convenient, understanding direct register access is key for optimization and non-ST ecosystems. Here is a conceptual look at configuring the same ADC-to-memory transfer on an STM32F4 by writing directly to registers.
// 1. Enable DMA2 Clock in RCC
RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;
// 2. Configure DMA2 Stream 0, Channel 0 for ADC1
// Disable the stream first
DMA2_Stream0->CR &= ~DMA_SxCR_EN;
while(DMA2_Stream0->CR & DMA_SxCR_EN) {}; // Wait until disabled
// Set peripheral address (ADC1 data register)
DMA2_Stream0->PAR = (uint32_t)&(ADC1->DR);
// Set memory address
DMA2_Stream0->M0AR = (uint32_t)adc_values;
// Set number of items to transfer
DMA2_Stream0->NDTR = 100;
// Configure control register:
// - Channel 0 selection
// - Memory increment mode
// - 16-bit data sizes
// - Peripheral-to-memory direction
// - Circular mode (optional)
// - Enable transfer complete interrupt
DMA2_Stream0->CR = (0 << DMA_SxCR_CHSEL_Pos) |
DMA_SxCR_MINC |
DMA_SxCR_PSIZE_0 |
DMA_SxCR_MSIZE_0 |
DMA_SxCR_TCIE;
// 3. Enable the stream
DMA2_Stream0->CR |= DMA_SxCR_EN;
// 4. Configure and enable DMA interrupt in NVIC
HAL_NVIC_SetPriority(DMA2_Stream0_IRQn, 0, 0);
HAL_NVIC_EnableIRQ(DMA2_Stream0_IRQn);
Advanced Topic: Scatter-Gather DMA
On more advanced controllers, DMA can perform even more complex tasks. Scatter-Gather DMA allows the controller to read from or write to multiple, non-contiguous memory blocks in a single operation. It does this by following a linked list of "descriptors" in memory, where each descriptor tells the DMA engine the source, destination, and size of the next block.
This is extremely powerful for applications like networking, where you might build a packet from a header, a payload, and a footer that are all in different memory locations, and send them in one seamless SPI or Ethernet transfer.
Common Pitfalls & Best Practices
- Forgetting the DMA Clock: The DMA controller is a peripheral like any other. If you don't enable its clock in the RCC (Reset and Clock Control), it won't work. CubeMX does this for us, but in bare-metal programming, it's a classic mistake.
- The
volatileKeyword: This is probably the most common error. If you don't declare your DMA buffer as `volatile`, the compiler, in an effort to optimize, might decide that your code never modifies this buffer (since the DMA does it) and cache stale values. Your program will never see the new data. - Cache Coherency: On more powerful microcontrollers (Cortex-M7), the CPU has a data cache. It's possible for the DMA to write to RAM, but the CPU still reads an old version of the data from its cache. You must use specific functions to invalidate the cache and ensure you're reading fresh data.
Conclusion
DMA transforms how you design an embedded application. By delegating data transfers, it frees up the CPU to focus on higher-value tasks. While its configuration may seem intimidating, a structured approach and an understanding of its key concepts make it an extraordinarily powerful tool. Don't be afraid of it anymore: make it your best ally.