The Complete Guide to DMA (Direct Memory Access) for Microcontrollers
A practical engineering guide to choosing normal vs circular mode, wiring DMA to ADC, UART, and SPI flows, and debugging the failure patterns that usually cost the most time.
Quick answer
| Prompt | Short answer |
|---|---|
| What DMA is | Hardware that moves data between a peripheral and memory without the CPU copying every sample, byte, or word itself. |
| What it is not | It does not replace protocol logic, fix framing bugs, or make a peripheral clock faster than the rest of the system allows. |
| When it helps most | Repetitive transfers such as ADC capture, UART receive buffers, SPI bursts, audio streaming, and memory shoveling between blocks. |
| First checklist | Direction, request source, increment settings, data width, buffer length, and where the buffer actually lives in memory. |
On this page
Why DMA actually matters
Without DMA, the CPU often ends up servicing transfers that are mechanically simple but frequent: read one register, store one sample, write one SPI data word, repeat. That is not where you want your firmware budget to go.
DMA lets the CPU configure the transfer once, then step aside while the hardware moves data in the background. The gain is usually not that the peripheral itself runs faster. The gain is that the CPU stops burning cycles on bookkeeping and can focus on control logic, signal processing, power management, or simply staying asleep longer.
| Question | CPU-driven transfer | DMA-driven transfer |
|---|---|---|
| Who moves each item? | The CPU services every byte, sample, or word. | The DMA engine services the transfer after setup. |
| CPU involvement | High and repetitive. | Front-loaded at setup, then mostly interrupt-driven. |
| Best for | Very small or rare transfers where simplicity wins. | Streams, bursts, and background I/O that would otherwise dominate servicing time. |
| Typical trap | Busy waiting, jitter, and missed deadlines elsewhere. | Misconfigured request mapping, width, or cache handling. |
The mental model: CPU, peripheral, DMA
The cleanest way to reason about DMA is to treat it as a third actor in the system, not as an opaque configuration toggle in CubeMX. The CPU configures the rules, the peripheral generates data or consumes data, and the DMA performs the repetitive bus transactions in between.
Core building blocks you must get right
| Item | Why it matters | Typical mistake |
|---|---|---|
| Transfer direction | Defines whether the peripheral is the source or destination, and whether the buffer is being filled or drained. | Configuring the channel for the opposite direction of the real data path. |
| Address increment | Peripheral registers usually stay fixed while RAM buffers usually increment across elements. | Incrementing the peripheral address or forgetting memory increment on an array transfer. |
| Transfer width | The peripheral data register width and the RAM element width must make sense together. | Reading a 12-bit or 16-bit peripheral into the wrong container size and getting garbage-looking samples. |
| Request routing | The peripheral still needs to trigger the right channel, stream, or DMAMUX slot. | Using the wrong request source and concluding that DMA itself is broken. |
| Completion events | Software needs a clear handoff point such as half-transfer, transfer-complete, error, or protocol framing. | Treating callback selection as cosmetic instead of part of the ownership model. |
| Memory visibility | Not every RAM bank is DMA-visible, and caches can make valid transfers look stale. | Debugging application code first when the real issue is RAM placement or cache maintenance. |
A setup workflow that survives framework differences
Whether you use CubeMX, HAL, LL drivers, Zephyr, or bare-metal code, the bring-up sequence is almost always the same. The API names change. The engineering order does not.
- 1
Define the data path
Peripheral to memory, memory to peripheral, or memory to memory. If this is fuzzy, everything downstream gets fuzzy too.
- 2
Choose the buffer shape first
Pick the element type, element count, alignment, and RAM region before touching the DMA configuration UI.
- 3
Map the request source
Pick the actual channel, stream, or DMAMUX request that belongs to the peripheral instance you are using.
- 4
Set width, increment, and mode
Those three settings are where most first-pass DMA setups go wrong. Review them together, not one by one.
- 5
Define the handoff event
Decide whether software should react to half-transfer, transfer-complete, IDLE detect, or a higher-level framing rule.
- 6
Verify ownership and cache behavior
Make it explicit when DMA owns the buffer and when firmware is allowed to read or overwrite it.
Normal, circular, and double-buffer thinking
Picking the wrong transfer mode is one of the fastest ways to make DMA feel unreliable. The mode must match the lifecycle of the data, not just the peripheral.
| Mode | Best fit | What to watch |
|---|---|---|
| Normal mode | Bounded transfers such as one ADC burst, one SPI block, or one memory copy. | The DMA stops after the programmed count, so software must restart it for the next burst. |
| Circular mode | Continuous streams such as ADC sampling, I2S audio, or long-lived UART RX paths. | The stream keeps wrapping, so the consumer side still has to keep up with the producer. |
| Double-buffer pattern | Pipelines where processing must overlap the next fill window without racing the buffer writer. | You still need a clean ownership rule between the half being processed and the half being filled. |
| If your problem looks like... | Start with... | Why |
|---|---|---|
| Take 128 ADC samples, process once, stop | Normal mode | The transfer has a clear end and software regains ownership after completion. |
| Keep sampling forever and process windows in the background | Circular mode | The stream is continuous, so the transfer should not need manual restart logic. |
| Audio or streaming data where processing must overlap the next fill | Double-buffer pattern | It creates an explicit ownership window for one half while the other half is still moving. |
| UART receive path with unknown frame boundaries | Circular DMA plus framing event | DMA absorbs the byte flow while IDLE, delimiters, or protocol logic decide where frames end. |
Worked patterns you will actually reuse
Pattern 1: ADC burst capture in normal mode
This is the clean starting point: trigger a bounded conversion run, fill a buffer, then process it after transfer complete. It is ideal when you need a snapshot rather than a permanent stream.
volatile uint16_t adcSamples[256];
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adcSamples, 256);
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc)
{
if (hadc != &hadc1) return;
uint32_t sum = 0;
for (size_t i = 0; i < 256; ++i) {
sum += adcSamples[i];
}
uint16_t average = (uint16_t)(sum / 256U);
(void)average;
} What must match
- Peripheral-to-memory direction
- Peripheral address fixed, memory increment enabled
- ADC data width mapped into a 16-bit buffer element
- Normal mode if you really want a one-shot burst
Pattern 2: UART receive without losing bytes
UART RX is where DMA often becomes mandatory in practice. Polling or per-byte interrupts are easy to start with, but they become fragile once traffic gets bursty or the firmware has other deadlines.
uint8_t uartRx[256];
HAL_UARTEx_ReceiveToIdle_DMA(&huart2, uartRx, sizeof(uartRx));
void HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t size)
{
if (huart != &huart2) return;
// 'size' bytes are ready in uartRx[0..size-1]
// Parse the frame here, then re-arm if your stack requires it.
} Design note
API names vary by MCU family and HAL version. The reusable idea is the same: let DMA absorb incoming bytes into RAM while software decides where frame boundaries actually are.
On stacks without a receive-to-idle helper, the common fallback is circular DMA plus an IDLE interrupt or a software read pointer.
Pattern 3: SPI or I2S streaming with half-buffer processing
Streaming paths need continuity more than they need extra callback churn. This is where circular DMA plus half-transfer and transfer-complete events turns into a dependable pipeline rather than a glorified copy loop.
int16_t audioBuffer[512];
HAL_I2S_Receive_DMA(&hi2s2, (uint16_t*)audioBuffer, 512);
void HAL_I2S_RxHalfCpltCallback(I2S_HandleTypeDef *hi2s)
{
processBlock(&audioBuffer[0], 256);
}
void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s)
{
processBlock(&audioBuffer[256], 256);
} The real rule
Your processing time for each half-buffer must stay comfortably below the time it takes DMA to fill the other half. If not, the design is already late even if the callbacks keep firing.
Performance visualization
Saying that the CPU is free is not enough. The interesting question is what work disappears from the foreground path once the transfer becomes hardware-driven.
| Foreground view | CPU-driven transfer | DMA-assisted transfer |
|---|---|---|
| What the CPU keeps doing | Repeatedly services each byte, word, or sample while the transfer is active. | Mostly handles setup and completion events, then spends its budget elsewhere. |
| Timing risk | Inter-byte or inter-word gaps appear if firmware cannot reload the peripheral fast enough. | The burst stays fed as long as DMA, memory, and the peripheral path keep up. |
| What still remains | Protocol logic, framing, and application processing all compete with the transfer service loop. | Application work, cache management on some MCUs, and transfer-complete handling still remain. |
SPI transfer without DMA
When software has to keep servicing the transfer itself, inter-byte gaps can appear while firmware reloads the next data word.
Conceptual timing sketch: the exact gap shape depends on the SPI FIFO depth, interrupt strategy, and how quickly software can reload the transmit register.
SPI transfer with DMA
Once the transfer is armed, the DMA can keep the data path fed with little or no CPU intervention between words, then software handles the completion event.
The point is continuity during the burst, not literally zero CPU work. Setup and transfer-complete handling still exist around the DMA transaction.
Advanced DMA topics
Once the basic peripheral-to-memory and memory-to-peripheral flows are solid, the next leap is descriptor-based DMA. Not every MCU has it, but it is worth recognizing because it shows up in networking, storage, display, and higher-end streaming architectures.
Why it matters
Descriptor-based DMA lets hardware chain multiple segments without forcing software to repack everything into one contiguous temporary buffer first.
- Common in Ethernet, SDMMC, display, and higher-end audio pipelines
- Useful when the packet or frame already exists in several memory regions
- Reduces copy overhead but raises the debugging bar
- Needs stronger ownership rules for descriptors and data buffers
The advanced version of the same lesson still applies: DMA moves bytes well, but it does not remove the need for correct buffer lifetime, ordering, and cache handling.
Bare-metal register view
Vendor frameworks hide the register details, which is fine until you have to debug a setup that almost works. A generic DMA channel usually boils down to the same fields no matter how the vendor names them:
| Register concept | What it controls | Typical question |
|---|---|---|
| PAR / CPAR | Peripheral register address | Am I pointing at the real data register? |
| MAR / CMAR | Memory buffer address | Is this RAM region DMA-visible and aligned? |
| NDTR / CNDTR | Transfer item count | Am I counting items or bytes on this MCU? |
| CR / CCR | Direction, increment, mode, width, priority | Does the configuration match the actual data path? |
| ISR / IFCR | Status flags and flag clearing | Did transfer complete, half-complete, or error really fire? |
If you have to debug at register level, inspect the peripheral request path first, then the DMA channel state, then the memory region and cache story. It is usually one of those three.
Minimal debug order
- 1. Confirm the peripheral is actually generating a DMA request in the mode you selected.
- 2. Confirm the channel or stream is mapped to that request source.
- 3. Confirm the transfer counter decrements when the event should occur.
- 4. Confirm the destination RAM region is both valid and visible to the DMA engine.
- 5. Only then debug callbacks, framework wrappers, and higher-level application logic.
Failure patterns that burn the most time
| Symptom | Likely cause | Check first |
|---|---|---|
| Nothing moves at all | No effective DMA request path, missing clock, wrong mapping, or inactive peripheral-side request enable. | Clocks, DMAMUX or stream mapping, NVIC path if used, and whether the peripheral is really issuing requests. |
| The buffer fills with nonsense | Width mismatch, wrong increment setting, swapped direction, or incorrect buffer element type. | Peripheral width, memory width, source vs destination, and increment configuration. |
| Transfer works once, then dies | Normal mode was configured for a one-shot transfer, but software expected a continuous stream. | Selected mode, restart logic, and whether the peripheral request is re-armed after completion. |
| The CPU still sees stale data | Cache coherency issue or a buffer placed in RAM that the DMA engine cannot actually access. | RAM bank placement, cache maintenance rules, and whether the buffer is really DMA-visible on that MCU. |
| Circular mode still overruns | The processing side cannot keep up with the producer rate, so the system is overloaded despite DMA. | Time budget per half-buffer, callback execution time, and whether the consumer workload is too heavy. |
| DMA was not the right answer | The transfer is too small or infrequent for the added configuration and debug cost to pay off. | Whether polling or a simple interrupt path would solve the actual problem with less complexity. |
Practical rules of thumb
- Do not start with circular mode if your use case is actually a one-shot burst.
- Put the buffer type and transfer width under explicit review before chasing timing ghosts.
- Think in buffer ownership windows: when DMA owns it, when software owns it, and how that handoff is signaled.
- On cached MCUs, choose DMA-safe RAM and treat cache maintenance as part of the design, not a patch.
- Measure whether the consumer side can keep up with the producer side before declaring the DMA path done.
FAQ
How does DMA work in a microcontroller?
When should you use DMA instead of interrupts?
What are common DMA mistakes in embedded systems?
Does DMA make the peripheral itself faster?
When should I use circular mode?
Why does DMA update RAM but my code still sees old values?
Is DMA always the best practice for throughput?
Related resources
- UART Serial Communication ExplainedGuide
Pair DMA with robust UART framing, baud budgeting, and receive-side error handling.
- SPI Interface GuideGuide
Understand the bus timing that makes SPI plus DMA such a common high-throughput pattern.
- I2S Interface GuideGuide
See where circular DMA and half-buffer callbacks become essential in streaming audio paths.