Jump to content
  • 0

Eclypse-Z7: Bulk Transfer w/ DMA


connoisseur_de_mimi

Question

Hi,

I have asked this question over in the xilinx forums already but have not found a solution yet, so I'm reposting it here:

I'm desiging a Zynq-7020 based system where I occasionally need to transfer up to 64MB of data from the PL to the PS. Data is fed into a FIFO before being transmitted via DMA:

fifo-dma.png.1224b9ad48bf869eb2cb2d97bbeffc5d.png

s2mm_introut is connected to the Zynq's F2P input. Data is coming into the FIFO at ~1MS/s, the PL is clocked at 100MHz. tlast is asserted on the last sample that is fed into the FIFO:

fifo-tlast.thumb.png.14d1c81fef1213bbbd9993a4975334ce.png

 

I'm using the https://github.com/Xilinx/embeddedsw/tree/master/XilinxProcessorIPLib/drivers/axidma driver and have set up the DMA in simple mode (init and ISR below):

    int dma_init()
    {
    	XAxiDma_Config *Config = XAxiDma_LookupConfig(XPAR_AXIDMA_0_DEVICE_ID);
    	if (!Config)
    		return XST_FAILURE;
     
    	if (XAxiDma_CfgInitialize(&dma_, Config) != XST_SUCCESS)
    		return XST_FAILURE;
     
    //	memset(adcDMAArray, 0, sizeof(adcDMAArray));
    	Xil_DCacheFlushRange((UINTPTR)adcDMAArray, sizeof(adcDMAArray) / sizeof(adcDMAArray[0]));
     
    	if (XAxiDma_SimpleTransfer(&dma_, (UINTPTR)adcDMAArray, 0x3fff, XAXIDMA_DEVICE_TO_DMA))
    		return XST_FAILURE;
     
    	//disable all interrupts
    	XAxiDma_IntrDisable(&dma_, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
     
    	//enable Interrupt On Complete
    	XAxiDma_IntrEnable(&dma_, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
     
    	return XST_SUCCESS;
    }

 

    void dmaDoneCh1IRQHandler(void *callback)
    {
    	XAxiDma *dma = (XAxiDma *)callback;
     
    	/* Read pending interrupts */
    	uint32_t irqStatus = XAxiDma_IntrGetIrq(dma, XAXIDMA_DEVICE_TO_DMA);
     
    	/* Acknowledge pending interrupts */
    	XAxiDma_IntrAckIrq(dma, irqStatus, XAXIDMA_DEVICE_TO_DMA);
     
    	if (irqStatus & XAXIDMA_IRQ_IOC_MASK)
    	{
    		XAxiDma_SimpleTransfer(dma, (UINTPTR)adcDMAArray, 0x3fff, XAXIDMA_DEVICE_TO_DMA);
    	}
    }

When the PL tramsits data, Data is received and the ISR is executed, irqStatus reports 0x00005000 (TX Done and DMA Error). Data reception stops after the first block received and subsequent transmissions do not happen at all (data in adcDMAArray are not overwritten and the ISR is not called) - even if i call the dma_init() function again.

 

S2MM registers:

s2mm.png.bba9e1062e0b83e0051b0ac558d04a37.png

which, if I interpret it correctly, decodes to:

  • DMA is running
  • DMA is halted
  • DMA internal error
When Scatter Gather is disabled, this error is flagged if any error occurs during Memory write or if the incoming packet is bigger than what is specified in the DMA length register. 

(from https://docs.xilinx.com/r/en-US/pg021_axi_dma/Stream-to-Memory-Map-Register-Detail)

It seems the error occurs because the amount of data I need to transfer (200kB in this case, but may be up to 64MB) is greater than the maximum of ~128kB (0x3fff 4-byte elements) that was specified in the XAxiDma_SimpleTransfer call (0x3fff is the maximum that can be specified there).

 

1) Can anybody see what I am doing wrong?

2) if everything worked correctly, how would I know when / if the DMA has finished transmitting all data?

3) Do I need to use the SG engine for large transfers?

Edited by connoisseur_de_mimi
Link to comment
Share on other sites

16 answers to this question

Recommended Posts

  • 0

Hi @connoisseur_de_mimi,

  1. Yes, going over the DMA buffer length register's size is a problem when doing SimpleTransfers. You can increase the max transfer length by adjusting the setting in the IP configuration (see screenshot below). It can be increased up to 26 bits.
    • Transfer length is specified in bytes, not words. 2^26 bytes (~67 MB) should barely fit the needed 64 MB. The default 14 bits can only handle a 1024-sample frame.
    • Even if you need more data than fits in one frame, you can get away with repeated SimpleTransfers, given sufficient buffering in FIFOs.
  2. I've only worked with polled mode DMA, but you should be able to wait until XAxiDma_Busy stops returning a busy status.
  3. As noted, probably not if 64 MB is your max, but you would if you wanted to go much larger.

A couple of extra notes:

  • Make sure tlast is only generated once per SimpleTransfer, on the right beat, sending one beforehand should stop the transfer partway through. Given this, make sure that the PS and PL are on the same page for the length of any particular transfer. A counter with a rollover value specified by the PS via AXI4-Lite that asserts tlast on rollover is one way to accomplish this. Always using the same frame length is another way.
  • This shouldn't be relevant at 1 MS/s, but more for anyone else reading who is using a faster input stream: Having a FIFO in front of the DMA is good. You should still make sure that the DMA has enough data bandwidth on both the AXI4-stream and AXI4-full master interfaces to sink your incoming stream, including the AXI protocol overhead. Increasing its clock frequencies above the base sample rate is a good idea. So is using a 64-bit wide memory map data width (to match the width of the Zynq PS's HP slave interfaces).
  • There's an under-construction demo for the Eclypse on the Reference site that uses scatter-gather to perform arbitrary-length transfers that might be useful to you. Apologies that the code is still pretty messy, I haven't made the time to clean it up and finish it. There might also still be a bug in one of the trigger detectors in the prerelease Vivado project archive. https://digilent.com/reference/programmable-logic/eclypse-z7/demos/ddr-streaming

image.png

Thanks,

Arthur

Link to comment
Share on other sites

  • 0

Hi,

I've updated the Width of Buffer Length Register in the AXI DMA blocks but that alone has not solved the issue. But I found out that ´´´XAxiDma_IntrGetIrq´´´ in the ISR always returns a DMA Internal Error flag if I try to transmit more than 83 elements (= 166 bytes). I suspected Micro DMA being enabled, but it isn't :

vivado_dma.thumb.png.37b30438dd7b6aa4cfd5dafb6436ab37.png

The depth of the FIFO in front of the DMA block is configured like this:

vivado_fifo.thumb.png.32f9677799a8b7e67723d7b932511968.png

 

I've also checked the code of ´´´XAxiDma_SimpleTransfer´´´, the maximum transfer length of 0x3FFF seems to be hardcoded somewhere. But the error appearing even on very short transfers look to me as if the issue may not be within the library.

 

This is the result after attempting to transmit 168 bytes:

grafik.png.0e75b2a6e102440304bb38ef21082af3.png

which decodes to:

MM2S_DMACR:

  • reserved bit set

MM2S_DMASR:

  • DMA Channel Idle
  • DMA Internal Error

MM2S_SA:

  • set to *something*

MM2S_LENGTH:

  • set to 0xFF (= 255), instead of the 168 I requested.

 

edit: on closer inspection, the MM2S_LENGTH register often does not reflect the number of bytes I requested (sometimes more, sometimes less) but if it reads 0xFF, the transfer fails as described above.

 

Edited by connoisseur_de_mimi
Link to comment
Share on other sites

  • 0

This ought to generate XST_INVALID_PARAM errors, but you should also make sure that adcDMAArray is aligned to a 4-byte address boundary.

I'm looking more into this and managed to reproduce the issue:

Using a project I had sitting around for 2021.1, main.c below, I seem to also be running into the same issue when trying to use larger requested numbers of bytes, the buffer length and buffer address registers are getting masked to 8 bits. Requesting 4000 (0x7A0) bytes to a buffer address at 0x10c794, and the DMA buffer address register (0x48) updates to 0x94 and buffer length register (0x58) updates to 0xA0.

image.png

I haven't checked the block design and IP settings for my project again yet, but the width of buffer length is 26 and it has some custom counter hardware to generate test traffic sitting in front of the DMA - still clearly having some issues with the CR run bit being high while idle. Data isn't transferred past 0x102, meaning 1k bytes are transferred before halting. It almost seems like only the bottom 8 bits of each of these registers are wired up in hardware.

Will look into this more tomorrow.

Also, another approach to tackling issues with the DMA simpletransfer function (just doing the register accesses yourself) is detailed here: 

 

#include "xparameters.h"
#include "xaxidma.h"
#include "xil_printf.h"
#include "gpio.h"

XAxiDma axi_dma;
const int axi_dma_id = XPAR_AXI_DMA_0_DEVICE_ID;

#define num_gpio XPAR_XGPIO_NUM_INSTANCES

const reg reset_bit         = {XPAR_AXIS_TRAFFIC_GENERATOR_AXI_GPIO0_DEVICE_ID, 1, 0, 1};
const reg start_bit         = {XPAR_AXIS_TRAFFIC_GENERATOR_AXI_GPIO0_DEVICE_ID, 1, 1, 1};
const reg busy_bit          = {XPAR_AXIS_TRAFFIC_GENERATOR_AXI_GPIO0_DEVICE_ID, 1, 2, 1};
const reg high_count        = {XPAR_AXIS_TRAFFIC_GENERATOR_AXI_GPIO0_DEVICE_ID, 2, 0, 32};
const reg packet_high_count = {XPAR_AXIS_TRAFFIC_GENERATOR_AXI_GPIO1_DEVICE_ID, 1, 0, 32};

#define RECV_BUFFER_SIZE 1024
u32 recv_buffer[RECV_BUFFER_SIZE];

void DmaInitialize () {
	XAxiDma_Config *cfgptr;
	cfgptr = XAxiDma_LookupConfig(axi_dma_id);
	XAxiDma_CfgInitialize(&axi_dma, cfgptr);

	XAxiDma_Reset(&axi_dma);
	XAxiDma_IntrDisable(&axi_dma, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
	XAxiDma_IntrDisable(&axi_dma, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DMA_TO_DEVICE);
}

void TrafficGenInitialize () {
	GpioInitialize();
	RegWrite(reset_bit, 1);
	RegWrite(reset_bit, 0);
//	RegWrite(start_bit, 1);
//	RegWrite(start_bit, 0);
//	RegRead(busy_bit, 1);
}

u32 TransferDone () {
	return !XAxiDma_Busy(&axi_dma, XAXIDMA_DEVICE_TO_DMA);
}

void ReceiveData (u32 words_per_packet, u32 packets) {
	UINTPTR buf_addr = (UINTPTR)recv_buffer;
	RegWrite(high_count, words_per_packet - 1);
	RegWrite(packet_high_count, packets - 1);
	Xil_DCacheFlushRange(buf_addr, RECV_BUFFER_SIZE);
	XAxiDma_SimpleTransfer(&axi_dma, buf_addr, words_per_packet * sizeof(u32), XAXIDMA_DEVICE_TO_DMA);
	RegWrite(start_bit, 1);
	RegWrite(start_bit, 0);
	while (!TransferDone());
	for (int i = 1; i < packets; i++) {
		u32 offset = i * words_per_packet * sizeof(u32);
		XStatus status = XAxiDma_SimpleTransfer(&axi_dma, buf_addr + offset, words_per_packet * sizeof(u32), XAXIDMA_DEVICE_TO_DMA);
		if (status != XST_SUCCESS) {
			xil_printf("simple transfer error\r\n");
		}
		while (!TransferDone());
	}
	Xil_DCacheInvalidateRange(buf_addr, RECV_BUFFER_SIZE);
}

void ValidateBuffer () {
	for (u32 i = 0; i < RECV_BUFFER_SIZE; i++) {
		u32 d = recv_buffer[i];
		xil_printf("%08x\r\n", d);
	}
}

int main() {
	DmaInitialize();
	TrafficGenInitialize();

	ReceiveData(1000, 2);

	ValidateBuffer();

	xil_printf("hello world\r\n");
}

 

Link to comment
Share on other sites

  • 0
7 hours ago, artvvb said:

It almost seems like only the bottom 8 bits of each of these registers are wired up in hardware.

when I use direct register access

uint32_t volatile * const ptr_len = (uint32_t volatile *)(dma->RxBdRing[0].ChanBase + XAXIDMA_BUFFLEN_OFFSET);
*ptr_len = (uint32_t)0x3FFF;

the memory monitor still only shows the lowest byte written:

grafik.png.57be06114c7d1d8e8fc09f5eb8b40018.png

however, if I read the register back and print it

uint32_t volatile * const ptr_len = (uint32_t volatile *)(dma->RxBdRing[0].ChanBase + XAXIDMA_BUFFLEN_OFFSET);
uint32_t result = *ptr_len;
SCPI_ResultUInt32(context, result);

the correct value is returned in both cases (using XAxiDma_SimpleTransfer or direct register access).

lets see if vitis 2022.2 has this issue resolved...

edit: no, same isse.

 

edit:

7 hours ago, artvvb said:

you should also make sure that adcDMAArray is aligned to a 4-byte address boundary.

it is, but only by chance, this declaration

int32_t adcDMAArray[DMA_BUF_SIZE] __attribute__((aligned(4)));

should force correct alignment, right?

Edited by connoisseur_de_mimi
Link to comment
Share on other sites

  • 0
Quote

should force correct alignment, right?

Yes, but 32-bit int should be aligned by default, the suggestion was made assuming that you might be using a u16 declaration.

 

After increasing receive buffer size and validating data a bit more, two successive transfers of both 1,000 four-byte samples each is now working. Also worked with 10,000-sample SimpleTransfers.

main.c

 

Here are my DMA settings:

image.png

Link to comment
Share on other sites

  • 0

can you share your vivado project? I have integrated your code into my project, changed the configuration of the DMA IP but still see my initial problem - DMA Error bit is set and subsequent transfers are not executed.

I guess I could just clear the DMA error after every transfer but that doesn't feel right... would rather find out how to fix the actual error.

Link to comment
Share on other sites

  • 0

Thanks for the code. I played with it a bit and it seems to work most of the time.

If I chose a combination of words_per_packet and packets that result in more than 25000 words being transferred some buffer items are not written and ValidateBuffer() reports errors. XAxiDma_SimpleTransfer() always returns XST_SUCCESS tough, and the status register has no error flag set.

 

Link to comment
Share on other sites

  • 0

I changed your project to better reflect my design, it is working correctly, except the issue with transmissions longer than 25000 words (buffer size = 100000 u32's). If I double the buffer size the issue appears with transmissions longer than 50k words. I don't quite understand why tough, the buffer is much longer than the amount of data transferred.

I can't attach the .zip file (too big) so here's a google drive link https://drive.google.com/file/d/1hyeEnv1iQgdiS3OBkP5wZk4dJ5RZjMwE/view?usp=share_link

I'm still trying to figure out what's different between this project and the one I'm working on. Using an ILA I can see the that the DMA is in fact writing data to the AXI bus, incrementing the AWADDR field as it goes, sending 256 elements per burst:

axidma.thumb.png.63441e01476595fc5c400befda844782.png

Link to comment
Share on other sites

  • 0

I tried HW design and SW code from @artvvb (big thanks, Arthur, for both!) in Vivado and Vitis 2023.1

I was able to successfully transfer by the DMA exactly 1,805,843 words (i.e., 6.89 MB of data).
After the 1,805,843rd word follows 16,248 zeroes, and then the correct values continue.

The strange thing is that I observed the same behavior even when trying to split data transfer to multiple calls of XAxiDma_SimpleTransfer(). E.g., setting in the main() variable words_per_packet=500000 and packets=4.
Calling XAxiDma_SimpleTransfer() once or multiple times always correctly transfers only 1,805,843 words.

I can only speculate that this is a bug in AXI DMA IP.

Link to comment
Share on other sites

  • 0

Hi @Viktor Nikolov

500000 should be fine for a packet length - it fits within the 26-bit max of the DMA. I imagine you're also increasing the RECV_BUFFER_SIZE, since it doesn't sound like it's reporting errors - adding a return to the check in main would help debug:

Quote

    if (words_per_packet * packets > RECV_BUFFER_SIZE) {
        xil_printf("error: receive buffer too small\r\n");
        return 1;
    }

With fresh eyes, there's a bug in the code where the cache is handled - the ranges should be RECV_BUFFER_SIZE * sizeof(u32) bytes, rather than RECV_BUFFER_SIZE words. This would be obscured unless large enough values for the packet length and count are tried...

Thanks,

Arthur

Link to comment
Share on other sites

  • 0

After applying that fix, I'm also seeing the last five bytes of the last packet be corrupted and need to look into it some more. At least with RECV_BUFFER_SIZE 200000, words_per_packet, 50000, packets 4. Updated main.c is attached.

image.png

main.c

Link to comment
Share on other sites

  • 0

I did the test with recv_buffer able to hold up to 4M words (still way below the theoretical limit of the DMA 26-bit max.).

When I tried a single packet of words_per_packet==2,000,000 , only the first 1,805,843 words were transferred correctly. Then 16,248 zeroes followed, and then the correct values continued (although I did not check the very end of the 2M buffer; it's possible it was corrupted the way you experienced).

I was surprised to see the very same behavior with packets==2, words_per_packet==1,000,000 , and with packets==4, words_per_packet==500,000.

With packets==1, words_per_packet==1,805,843 , everything was transferred correctly. 

I did this test as part of my research for a project in which I want to try sampling a long series of data from Zynq 1 Msps XDAC. It seems, 1.8 seconds is the limit when using Xilinx AXI DMA IP.

Link to comment
Share on other sites

  • 0

Below are my current results for two runs using the numbers you suggest. This is with an Eclypse Z7, using the hardware project previously linked (upgraded IP to 2023.1, rebuilt, and reexported hardware), and the main.c file attached below (modified slightly from the previous version to log the sizes used). I used Run As -> Launch Hardware to ensure that debug breakpoints wouldn't affect timing. There's clearly something different between our setups, but I'm unclear on what it could be. I've also exported my workspace in case it shows some kind of difference, or if it's somehow caused by an issue with the physical hardware.

main.cvitis_export_archive.ide.zip

image.png

Link to comment
Share on other sites

  • 0

It was all about Data Cache invalidation calls!

In the main.c attached, I demonstrate the successful transfer of 16,777,215 32bit words (==64 MB) as a single packet, which is the largest possible DMA transfer by the AXI DMA IP (because the max. width of Buffer Length Register is 26bit).
This main.c is intended for the Vitis 2023.1 workspace @artvvb shared in the previous post.

First, I fixed in the code the bug Arthur mentioned on 24 April 2024: Data Cache invalidation calls have to be made with length RECV_BUFFER_SIZE * sizeof(u32).

After this fix, I saw the same behavior as Arthur. The last 5 values in the received buffer were zeroes.
Edit: This happens for DMA transfers of all lengths (I tested it down to 1,000) when RECV_BUFFER_SIZE==words_per_packet. The Xil_DCacheFlushRange() and Xil_DCacheInvalidateRange() apparently "forget" to work on the last 5 words of the memory range.

Then I added an additional 16 bytes to the data length passed to the Data Cache functions in the function ReceiveData():

Xil_DCacheFlushRange( buf_addr,      RECV_BUFFER_SIZE * sizeof(u32) + 16 );
...
Xil_DCacheInvalidateRange( buf_addr, RECV_BUFFER_SIZE * sizeof(u32) + 16 );

That did the trick. With the cache invalidated with an extra 16 bytes, the DMA transfer works as documented.

Edit: This seems to be an undocumented feature of Xil_DCacheFlushRange() and Xil_DCacheInvalidateRange(). When I read comments in the source code xil_cache.c, I get impression that the problem is caused by end of the buffer not being aligned with cache line. This forum post from @asmi confirms it.

 

main.c

Edited by Viktor Nikolov
added link to other forum post
Link to comment
Share on other sites

  • 0

If anybody is interested:

When the width of the AXI Stream going into AXI DMA IP is reduced to 16-bit, the AXI DMA handles up to 33,554,431 data samples in a single call to XAxiDma_SimpleTransfer().

I tested this using the attached Vivado 2023.1 HW project and the Vitis 2023.1 workplace exported as an attachment as well.
The HW project is for Cora Z7-7S.

I implemented a very simple 16-bit AXI Stream generator in Verilog (see stream_generator.v).
In my test, it generated an AXI Stream at the data rate of 10 Msps, which means the whole DMA transfer takes 3.36 seconds.

Generator_DMA_test_hw.zip vitis_export_archive.ide.zip main.cpp

Edited by Viktor Nikolov
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...