When should I use circular mode instead of normal mode?

Use normal mode for one-shot captures or fixed-length transfers. Use circular mode when the stream never really stops, such as ADC sampling, audio pipelines, or UART receive buffers that must keep running in the background.

Why does DMA seem to write memory but my code still reads old data?

On cached MCUs such as Cortex-M7 devices, DMA and CPU cache can disagree. The DMA may update RAM while the CPU still reads stale cache lines. Cache maintenance and DMA-safe buffer placement are common fixes.

Why does my DMA transfer run exactly once and never again?

That usually means the channel is in normal mode and was never restarted, or the peripheral request is not re-armed after transfer complete. Check the mode, callback flow, and whether the peripheral keeps generating DMA requests.

The Complete Guide to DMA (Direct Memory Access) for Microcontrollers

Q: How does DMA work in a microcontroller?

DMA works as a small hardware data mover between peripherals and memory. After software configures the source, destination, transfer size, and trigger, the DMA controller copies each byte, halfword, or word when the peripheral requests it, while the CPU is free to run other code.

Q: What are common DMA mistakes in embedded systems?

Common mistakes include using the wrong transfer width, placing buffers in memory the DMA cannot access, forgetting cache maintenance, missing half-transfer or transfer-complete callbacks, and assuming DMA fixes protocol framing bugs.

Q: Does DMA make the peripheral itself faster?

No. DMA removes CPU involvement in moving data, but it does not change the peripheral clock, bus speed, or sampling rate by itself. It mainly helps sustain transfers and reduce CPU servicing overhead.

A practical engineering guide to choosing normal vs circular mode, wiring DMA to ADC, UART, and SPI flows, and debugging the failure patterns that usually cost the most time.

Guide Published Oct 14, 2025 Normal, circular, and double-buffer patterns

Jump to worked patterns Go straight to failure patterns

Quick answer

Prompt	Short answer
What DMA is	Hardware that moves data between a peripheral and memory without the CPU copying every sample, byte, or word itself.
What it is not	It does not replace protocol logic, fix framing bugs, or make a peripheral clock faster than the rest of the system allows.
When it helps most	Repetitive transfers such as ADC capture, UART receive buffers, SPI bursts, audio streaming, and memory shoveling between blocks.
First checklist	Direction, request source, increment settings, data width, buffer length, and where the buffer actually lives in memory.

On this page

Why DMA actually matters
The mental model
Core building blocks
Normal, circular, and double-buffer
Worked patterns
Performance visualization
Bare-metal register view
Advanced DMA topics
Failure patterns
FAQ

Why DMA actually matters

Without DMA, the CPU often ends up servicing transfers that are mechanically simple but frequent: read one register, store one sample, write one SPI data word, repeat. That is not where you want your firmware budget to go.

DMA lets the CPU configure the transfer once, then step aside while the hardware moves data in the background. The gain is usually not that the peripheral itself runs faster. The gain is that the CPU stops burning cycles on bookkeeping and can focus on control logic, signal processing, power management, or simply staying asleep longer.

Question	CPU-driven transfer	DMA-driven transfer
Who moves each item?	The CPU services every byte, sample, or word.	The DMA engine services the transfer after setup.
CPU involvement	High and repetitive.	Front-loaded at setup, then mostly interrupt-driven.
Best for	Very small or rare transfers where simplicity wins.	Streams, bursts, and background I/O that would otherwise dominate servicing time.
Typical trap	Busy waiting, jitter, and missed deadlines elsewhere.	Misconfigured request mapping, width, or cache handling.

The mental model: CPU, peripheral, DMA

The cleanest way to reason about DMA is to treat it as a third actor in the system, not as an opaque configuration toggle in CubeMX. The CPU configures the rules, the peripheral generates data or consumes data, and the DMA performs the repetitive bus transactions in between.

DMA is a bus transaction engine. The CPU still owns the system-level logic and the peripheral still owns the protocol timing.

Core building blocks you must get right

Item	Why it matters	Typical mistake
Transfer direction	Defines whether the peripheral is the source or destination, and whether the buffer is being filled or drained.	Configuring the channel for the opposite direction of the real data path.
Address increment	Peripheral registers usually stay fixed while RAM buffers usually increment across elements.	Incrementing the peripheral address or forgetting memory increment on an array transfer.
Transfer width	The peripheral data register width and the RAM element width must make sense together.	Reading a 12-bit or 16-bit peripheral into the wrong container size and getting garbage-looking samples.
Request routing	The peripheral still needs to trigger the right channel, stream, or DMAMUX slot.	Using the wrong request source and concluding that DMA itself is broken.
Completion events	Software needs a clear handoff point such as half-transfer, transfer-complete, error, or protocol framing.	Treating callback selection as cosmetic instead of part of the ownership model.
Memory visibility	Not every RAM bank is DMA-visible, and caches can make valid transfers look stale.	Debugging application code first when the real issue is RAM placement or cache maintenance.

A setup workflow that survives framework differences

Whether you use CubeMX, HAL, LL drivers, Zephyr, or bare-metal code, the bring-up sequence is almost always the same. The API names change. The engineering order does not.

1

Define the data path

Peripheral to memory, memory to peripheral, or memory to memory. If this is fuzzy, everything downstream gets fuzzy too.
2

Choose the buffer shape first

Pick the element type, element count, alignment, and RAM region before touching the DMA configuration UI.
3

Map the request source

Pick the actual channel, stream, or DMAMUX request that belongs to the peripheral instance you are using.
4

Set width, increment, and mode

Those three settings are where most first-pass DMA setups go wrong. Review them together, not one by one.
5

Define the handoff event

Decide whether software should react to half-transfer, transfer-complete, IDLE detect, or a higher-level framing rule.
6

Verify ownership and cache behavior

Make it explicit when DMA owns the buffer and when firmware is allowed to read or overwrite it.

Normal, circular, and double-buffer thinking

Picking the wrong transfer mode is one of the fastest ways to make DMA feel unreliable. The mode must match the lifecycle of the data, not just the peripheral.

Mode	Best fit	What to watch
Normal mode	Bounded transfers such as one ADC burst, one SPI block, or one memory copy.	The DMA stops after the programmed count, so software must restart it for the next burst.
Circular mode	Continuous streams such as ADC sampling, I2S audio, or long-lived UART RX paths.	The stream keeps wrapping, so the consumer side still has to keep up with the producer.
Double-buffer pattern	Pipelines where processing must overlap the next fill window without racing the buffer writer.	You still need a clean ownership rule between the half being processed and the half being filled.

If your problem looks like...	Start with...	Why
Take 128 ADC samples, process once, stop	Normal mode	The transfer has a clear end and software regains ownership after completion.
Keep sampling forever and process windows in the background	Circular mode	The stream is continuous, so the transfer should not need manual restart logic.
Audio or streaming data where processing must overlap the next fill	Double-buffer pattern	It creates an explicit ownership window for one half while the other half is still moving.
UART receive path with unknown frame boundaries	Circular DMA plus framing event	DMA absorbs the byte flow while IDLE, delimiters, or protocol logic decide where frames end.

The point is not just continuous transfer. The point is giving the CPU a predictable window to process data without racing the next fill operation.

Worked patterns you will actually reuse

Pattern 1: ADC burst capture in normal mode

This is the clean starting point: trigger a bounded conversion run, fill a buffer, then process it after transfer complete. It is ideal when you need a snapshot rather than a permanent stream.

volatile uint16_t adcSamples[256];

HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adcSamples, 256);

void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc)
{
  if (hadc != &hadc1) return;

  uint32_t sum = 0;
  for (size_t i = 0; i < 256; ++i) {
    sum += adcSamples[i];
  }

  uint16_t average = (uint16_t)(sum / 256U);
  (void)average;
}

What must match

Peripheral-to-memory direction
Peripheral address fixed, memory increment enabled
ADC data width mapped into a 16-bit buffer element
Normal mode if you really want a one-shot burst

Pattern 2: UART receive without losing bytes

UART RX is where DMA often becomes mandatory in practice. Polling or per-byte interrupts are easy to start with, but they become fragile once traffic gets bursty or the firmware has other deadlines.

uint8_t uartRx[256];

HAL_UARTEx_ReceiveToIdle_DMA(&huart2, uartRx, sizeof(uartRx));

void HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t size)
{
  if (huart != &huart2) return;

  // 'size' bytes are ready in uartRx[0..size-1]
  // Parse the frame here, then re-arm if your stack requires it.
}

Design note

API names vary by MCU family and HAL version. The reusable idea is the same: let DMA absorb incoming bytes into RAM while software decides where frame boundaries actually are.

On stacks without a receive-to-idle helper, the common fallback is circular DMA plus an IDLE interrupt or a software read pointer.

Pattern 3: SPI or I2S streaming with half-buffer processing

Streaming paths need continuity more than they need extra callback churn. This is where circular DMA plus half-transfer and transfer-complete events turns into a dependable pipeline rather than a glorified copy loop.

int16_t audioBuffer[512];

HAL_I2S_Receive_DMA(&hi2s2, (uint16_t*)audioBuffer, 512);

void HAL_I2S_RxHalfCpltCallback(I2S_HandleTypeDef *hi2s)
{
  processBlock(&audioBuffer[0], 256);
}

void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s)
{
  processBlock(&audioBuffer[256], 256);
}

The real rule

Your processing time for each half-buffer must stay comfortably below the time it takes DMA to fill the other half. If not, the design is already late even if the callbacks keep firing.

Performance visualization

Saying that the CPU is free is not enough. The interesting question is what work disappears from the foreground path once the transfer becomes hardware-driven.

Foreground view	CPU-driven transfer	DMA-assisted transfer
What the CPU keeps doing	Repeatedly services each byte, word, or sample while the transfer is active.	Mostly handles setup and completion events, then spends its budget elsewhere.
Timing risk	Inter-byte or inter-word gaps appear if firmware cannot reload the peripheral fast enough.	The burst stays fed as long as DMA, memory, and the peripheral path keep up.
What still remains	Protocol logic, framing, and application processing all compete with the transfer service loop.	Application work, cache management on some MCUs, and transfer-complete handling still remain.

SPI transfer without DMA

When software has to keep servicing the transfer itself, inter-byte gaps can appear while firmware reloads the next data word.

Conceptual timing sketch: the exact gap shape depends on the SPI FIFO depth, interrupt strategy, and how quickly software can reload the transmit register.

SPI transfer with DMA

Once the transfer is armed, the DMA can keep the data path fed with little or no CPU intervention between words, then software handles the completion event.

The point is continuity during the burst, not literally zero CPU work. Setup and transfer-complete handling still exist around the DMA transaction.

Advanced DMA topics

Once the basic peripheral-to-memory and memory-to-peripheral flows are solid, the next leap is descriptor-based DMA. Not every MCU has it, but it is worth recognizing because it shows up in networking, storage, display, and higher-end streaming architectures.

Scatter-gather DMA follows descriptors in memory instead of handling just one contiguous block.

Why it matters

Descriptor-based DMA lets hardware chain multiple segments without forcing software to repack everything into one contiguous temporary buffer first.

Common in Ethernet, SDMMC, display, and higher-end audio pipelines
Useful when the packet or frame already exists in several memory regions
Reduces copy overhead but raises the debugging bar
Needs stronger ownership rules for descriptors and data buffers

The advanced version of the same lesson still applies: DMA moves bytes well, but it does not remove the need for correct buffer lifetime, ordering, and cache handling.

Bare-metal register view

Vendor frameworks hide the register details, which is fine until you have to debug a setup that almost works. A generic DMA channel usually boils down to the same fields no matter how the vendor names them:

Register concept	What it controls	Typical question
PAR / CPAR	Peripheral register address	Am I pointing at the real data register?
MAR / CMAR	Memory buffer address	Is this RAM region DMA-visible and aligned?
NDTR / CNDTR	Transfer item count	Am I counting items or bytes on this MCU?
CR / CCR	Direction, increment, mode, width, priority	Does the configuration match the actual data path?
ISR / IFCR	Status flags and flag clearing	Did transfer complete, half-complete, or error really fire?

If you have to debug at register level, inspect the peripheral request path first, then the DMA channel state, then the memory region and cache story. It is usually one of those three.

Minimal debug order

1. Confirm the peripheral is actually generating a DMA request in the mode you selected.
2. Confirm the channel or stream is mapped to that request source.
3. Confirm the transfer counter decrements when the event should occur.
4. Confirm the destination RAM region is both valid and visible to the DMA engine.
5. Only then debug callbacks, framework wrappers, and higher-level application logic.

Failure patterns that burn the most time

Symptom	Likely cause	Check first
Nothing moves at all	No effective DMA request path, missing clock, wrong mapping, or inactive peripheral-side request enable.	Clocks, DMAMUX or stream mapping, NVIC path if used, and whether the peripheral is really issuing requests.
The buffer fills with nonsense	Width mismatch, wrong increment setting, swapped direction, or incorrect buffer element type.	Peripheral width, memory width, source vs destination, and increment configuration.
Transfer works once, then dies	Normal mode was configured for a one-shot transfer, but software expected a continuous stream.	Selected mode, restart logic, and whether the peripheral request is re-armed after completion.
The CPU still sees stale data	Cache coherency issue or a buffer placed in RAM that the DMA engine cannot actually access.	RAM bank placement, cache maintenance rules, and whether the buffer is really DMA-visible on that MCU.
Circular mode still overruns	The processing side cannot keep up with the producer rate, so the system is overloaded despite DMA.	Time budget per half-buffer, callback execution time, and whether the consumer workload is too heavy.
DMA was not the right answer	The transfer is too small or infrequent for the added configuration and debug cost to pay off.	Whether polling or a simple interrupt path would solve the actual problem with less complexity.

Practical rules of thumb

Do not start with circular mode if your use case is actually a one-shot burst.
Put the buffer type and transfer width under explicit review before chasing timing ghosts.
Think in buffer ownership windows: when DMA owns it, when software owns it, and how that handoff is signaled.
On cached MCUs, choose DMA-safe RAM and treat cache maintenance as part of the design, not a patch.
Measure whether the consumer side can keep up with the producer side before declaring the DMA path done.

FAQ

How does DMA work in a microcontroller?

DMA is a hardware data mover between peripherals and memory. Software configures the source, destination, transfer size, and trigger, then the DMA controller moves each item when the peripheral requests it while the CPU continues running other code.

When should you use DMA instead of interrupts?

Use DMA when an interrupt per byte or sample would create too much CPU load or jitter: UART receive streams, ADC sampling, SPI bursts, and audio buffers are typical examples. For tiny or rare transfers, a simple interrupt path is often easier to debug.

What are common DMA mistakes in embedded systems?

The common mistakes are wrong transfer width, buffers in non-DMA memory, missing cache maintenance, unclear buffer ownership, and expecting DMA to fix protocol framing or timing bugs that belong elsewhere in the design.

Does DMA make the peripheral itself faster?

No. DMA removes CPU work around the transfer, but the peripheral timing is still limited by clocks, bus bandwidth, protocol framing, and memory system constraints.

When should I use circular mode?

Use it when the data source or sink is effectively continuous: streamed ADC samples, audio, or long-lived receive buffers. Use normal mode for fixed-length transactions.

Why does DMA update RAM but my code still sees old values?

That is a classic cache coherency symptom on higher-end MCUs. Also check that the buffer is really placed in a DMA-accessible RAM bank.

Is DMA always the best practice for throughput?

No. DMA is best when repeated data movement is the problem. For tiny or rare transfers, the setup and debug cost may outweigh the benefit.