Jump to content
  • 0

AXI DMA Help on Cora Z7-10


RyanW

Question

Hello anyone, I have been trying to get a simple AXI DMA transfer for the PL to PS on my Cora Z7-10 working for a while now. I have followed many tutorials and guides and for some reason I'm just not getting any results. I'm really hoping someone here can help me out with this, as I have been stuck on trying to get this to work for a long time now.

The C program seems to get stuck waiting for XAxiDMA_Busy after I call XAxiDma_SimpleTransfer(&AxiDma, (UINTPTR) StreamBuffer, 4, XAXIDMA_DEVICE_TO_DMA). All other calls setting up the PL DMA engine seem to return as successes.

I have an arbitrary stream of data being generated by an AXI stream module that just counts up 1 from 0 every transfer. I'm going to input a lot of pictures here in hopes that it might help anyone who wants to take a stab at helping me here. My data generator has 32 bit output and counts up to 31 from 0. I have wondered if there was a problem with tlast, in how the DMA engine considers packets, so I tried using tlast at the end of the 32 word stream and I also tried tying it high.

 

Block_Diagram.thumb.png.8a16d7a6735b61e4502f99bdca48376c.png

Above is my block diagram for this system. The data generator streams to a FIFO which then streams to the AXI DMA and that's about it. I have the sys_clock coming in at 125MHz which enters the clock wizard and comes out at 100MHz.

 

Here is the data_gen sim with tlast. (This module only, not connected in block diagram; however, I have simulated both of these designs hooked up to a stream data-fifo and they passed the data through just fine, so I don't think it's my handshaking but I'm not ruling out out the possibility that I screwed up another part of the streaming protocol).

data_gen_tlast.thumb.png.993820fc858a9b64dff10d68484966bf.png

I also tried tying tlast high for the whole stream as well in the full implementation.

The configuration I have for the DMA is fairly stripped down and here is the way I configured it in Vivado.

DMA_config.png.ef6a223aa62a96e52b25c7df3b648f4a.png

 

The code I have is fairly straight forward. I lookup the config which returns success as do all the other cases. It gets stuck during the loop checking if the DMA is still busy, and from the debugger I can see that no data was ever transferred into the DMA. I also used to have a print statement in the wait loop to see if any of the values changed in the StreamBuffer array.

#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
#include "xaxidma.h"

#define DMA_DEV_ID  XPAR_AXIDMA_0_DEVICE_ID

int main()
{
    init_platform();
    xil_printf("\n\r");
    xil_printf("AXI DMA Self Test\n\r");


    XAxiDma AxiDma;
    XAxiDma_Config *CfgPtr;
    int Status = XST_SUCCESS;

    CfgPtr = XAxiDma_LookupConfig(DMA_DEV_ID);
    if (!CfgPtr) {
        xil_printf("Case 1: Failure\n\r");
    } else {
        xil_printf("Case 1: Success\n\r");
    }

    Status = XAxiDma_CfgInitialize(&AxiDma, CfgPtr);
    if (Status != XST_SUCCESS) {
        xil_printf("Case 2: Failure\n\r");
    } else {
        xil_printf("Case 2: Success\n\r");
    }

    Status = XAxiDma_Selftest(&AxiDma);
    if (Status != XST_SUCCESS) {
        xil_printf("Case 3: Failure\n\r");
	} else {
        xil_printf("Case 3: Success\n\r");
    }

    XAxiDma_IntrDisable(&AxiDma, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
	XAxiDma_IntrDisable(&AxiDma, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DMA_TO_DEVICE);

    xil_printf( "HasStsCntrlStrm : %u\n\r"
                "HasMm2S         : %u\n\r"
                "rHasMm2SDRE     : %u\n\r"
                "Mm2SDataWidth   : %u\n\r"
                "HasS2Mm         : %u\n\r"
                "HasS2MmDRE      : %u\n\r"
                "S2MmDataWidth   : %u\n\r"
                "HasSg           : %u\n\r"
                "Mm2sNumChannels : %u\n\r"
                "S2MmNumChannels : %u\n\r"
                "Mm2SBurstSize   : %u\n\r"
                "S2MmBurstSize   : %u\n\r"
                "MicroDmaMode    : %u\n\r"
                "AddrWidth       : %u\n\r"
                "SgLengthWidth   : %u\n\r",
                CfgPtr->HasStsCntrlStrm,
                CfgPtr->HasMm2S,
                CfgPtr->HasMm2SDRE,
                CfgPtr->Mm2SDataWidth,
                CfgPtr->HasS2Mm,
                CfgPtr->HasS2MmDRE,
                CfgPtr->S2MmDataWidth,
                CfgPtr->HasSg,
                CfgPtr->Mm2sNumChannels,
                CfgPtr->S2MmNumChannels,
                CfgPtr->Mm2SBurstSize,
                CfgPtr->S2MmBurstSize,
                CfgPtr->MicroDmaMode,
                CfgPtr->AddrWidth,
                CfgPtr->SgLengthWidth
                );

    xil_printf("AXIDMA HasSg: 0x%08x\n\r", AxiDma.HasSg);


    //--------------------------------------------------------

    volatile u32 StreamBuffer[256];
    for(int i = 0; i < 256; i++) {
        StreamBuffer[i] = 0;
    }

    while(!XAxiDma_ResetIsDone(&AxiDma)) {}

    Status = XAxiDma_SimpleTransfer(&AxiDma, (UINTPTR) StreamBuffer, 4, XAXIDMA_DEVICE_TO_DMA);
    if (Status != XST_SUCCESS) {
        xil_printf("Case 4: Failure\n\r");
	} else {
        xil_printf("Case 4: Success\n\r");
    }

   int DMA_Busy_DevToDMA = 1;
   int DMA_Busy_DMAToDev = 1;
   while(DMA_Busy_DevToDMA || DMA_Busy_DMAToDev) {
        //Wait
    	//xil_printf("Waiting\n\r");
        DMA_Busy_DevToDMA = XAxiDma_Busy(&AxiDma,XAXIDMA_DEVICE_TO_DMA);
        DMA_Busy_DMAToDev = XAxiDma_Busy(&AxiDma,XAXIDMA_DMA_TO_DEVICE);
   }

    for(int i = 0; i < 100000; i ++) { }

    xil_printf("DMA StreamBuffer Test Data\n\r");
    for(int i = 0; i < 16; i++) {
        xil_printf("0x%08x: %d\n\r", &StreamBuffer[i], StreamBuffer[i]);
    }


    xil_printf("Successfully ran AxiDMASelfTest Example\r\n");
    cleanup_platform();
    return 0;
}

Here is the serial output I get showing that it gets stuck waiting forever for data to transfer and never transfers anything.

Case 1 is the success return of CfgPtr = XAxiDma_LookupConfig(DMA_DEV_ID);

Case 2 is the success return of Status = XAxiDma_CfgInitialize(&AxiDma, CfgPtr);

Case 3 is the success return of Status = XAxiDma_Selftest(&AxiDma);

Case 4 is the success return of Status = XAxiDma_SimpleTransfer(&AxiDma, (UINTPTR) StreamBuffer, 4, XAXIDMA_DEVICE_TO_DMA);

I also print out the configuration pointer data here.

serial_output.png.face3e38e9f789adaaf6e39022605462.png

I have determined that the value on the S_AXIS_S2MM_tdata bus has gotten to 31, so it makes me think there is some form of transfer going on there, but I can't figure out why I don't see any values in the stream buffer still.

 

I have tried directly using Xilinx's examples from their website and followed multiple tutorials in the same way the presenter did them. And imported the examples from the drivers in Vitis and changed the DDR base address in them to fit my board with using the correct address as defined in xparamters.h. One of the more recent tutorials I did was with this video below. And the same configuration on my end with the same code seems to still get stuck (This time I can't even tell where as the debugger crazily jumps around in a fashion that makes no sense). No matter what avenue I go it seems like I just can't get the DMA to work, which seems crazy to me.

 

Is there anyone out there who has experienced these difficulties with the AXI DMA engine before? I just can't seem to figure out what's going wrong here despite a couple months of trying many many different things. For anyone who has bothered to read this far down in the post. You're a hero.

Edited by RyanW
Link to comment
Share on other sites

18 answers to this question

Recommended Posts

  • 1

One thing to be careful with is the S2MM (input to ps) must be set up first, before the MM2S channel. If you try streaming data into the DMA before the channel is set up it will lock up. I found this mistake in Xilinx's own example code. I would look closely at the DMA driver code, it could easily have bugs in it. From my experience with both STM and Xilinx, I would say that driver code is often written by inexperienced programmers and not necessarily properly tested. Read the AXI DMA data sheet as well. Using the DMA is basic mode is really simple and requires just 2 register writes per channel. Use this sequence:

1) Set the run bit ( only need to do this once )

2) Wait for DMA idle ( only after the first transfer )

3) write the S2MM address and transfer length (bytes)

4) write the MM2S address and transfer length

I use an AXI DMA to transfer 64 32-bit words to 8 different IPs by separately multiplexing  tvalid. Tready is tied high as I designed the IP to always be ready. The DMA has internal fifos and has no trouble transferring all 64 words in a single burst of 64 clock cycles. I use the OCM, which as been set to non-cachable, as buffers for the DMA transfer.

Edited by Richm
Link to comment
Share on other sites

  • 1

Hi @RyanW

When receiving, the DMA can stall if it is expecting the wrong number of bytes, as you allude to in referring to potential issues with tlast. As such, if you are raising tlast for the 32nd word of the transfer, the transfer length passed to the simple transfer call should be 128. Refer to the Programming Sequence / Direct Register Mode (Simple DMA) section of the AXI DMA Product Guide, p70-72.

It could also be helpful to add an ILA to your hardware design, so that the Vivado hardware manager can be used as a logic analyzer, letting you observe the axi stream while the bitstream is running on the board. There's a System ILA IP in Vivado that can be used to do this.

Also, once this problem is resolved, you will probably need to use xil_cache.h functions to flush and invalidate the cache before and after the transfer, respectively.

Thanks,

Arthur

Link to comment
Share on other sites

  • 1

Is the "Data Gen" block a known good AXI/DMA block, or something that is under development?

If it is under development, the story below may be relevant.

I have implemented PL-PS on various Digilent boards and did have to use ILA to get the signalling right.

My problems were at the PL side, specifically the Valid and Ready signals.  Essentially the Slave can deassert Ready at ~anytime, including while the Master is clocking data, hence the Master needs to test Ready after clocking and may need to reclock same data multiple times, else the transfer will come up short on data.  The diagram below is something I found, and similar diagrams are in the Xilinx documents.  See the (3) and (5) cases in the below - where D0 and D3 are clocked multiple times:

image.thumb.png.d999513f189b36572908b83f08a08bb9.png

For my work, the Ready being deasserted situation was not a problem with small packets, but became a barrier for larger packets, maybe related to FIFO size.

Dave

 

Link to comment
Share on other sites

  • 0
On 4/20/2022 at 2:28 PM, aadgl said:

Is the "Data Gen" block a known good AXI/DMA block, or something that is under development?

The data_gen is something that is under development. I just needed to test how to get any arbitrary data out of the PL faster than with AXI GPIO. I thought I had come across this problem already a week ago where I found out that my valid and data output didn't line up well if the ready was de-asserted. I thought I had fixed that issue and the simulation seemed to show so, but I could be largely misinterpreting how the AXI stream interface works. I am very new to AXI, and it has been giving me lots of troubles ever since I got in to it.

I have now created a new data_gen that I think adheres to the AXI Stream rules much better. I have an extensive testbench for a lot of cases including broken up data beats.

 

On 4/20/2022 at 1:07 PM, artvvb said:

When receiving, the DMA can stall if it is expecting the wrong number of bytes, as you allude to in referring to potential issues with tlast. As such, if you are raising tlast for the 32nd word of the transfer, the transfer length passed to the simple transfer call should be 128.

This makes a lot of sense now that you say it. I changed my code to something to reflect this and I also invalidated the cache before the transfer (would having the buffer be volatile, as I had it,  not already do this? I figured that's what volatile did to some degree, but I went ahead and flushed the cache anyways).

 

Along with better transfer parameters, cache invalidation, and a new data_gen, I was actually able to populate all 32 words from the PL into DDR RAM; however the DMA engine would still hang and never generate an interrupt or assert high on the halted bit or de-assert to low on the idle bit in the S2MM_DMASR. I tried using the ILA to capture what was going on and this is what I found. One thing I also found peculiar is that I called XAxiDMA_Busy even before the transfer and both directions are still registered as busy; however, the actual status register shows it as halted and run/stop = 0. When the transfer starts these registers are flipped to indicate that it is still running, but it goes on forever.

ILA_results2_begin.thumb.png.6771f2d58b4a742a15bcb8798f987d55.png

ILA_results2.thumb.png.f5e9b76b697ff319f1965ce54404537c.png

The first 4 words come extremely early, so I couldn't capture them in the same waveform and had to re-run the program for the last part. It seems to play out correctly and the same to how I simulated it within the new testbench I made for the data_gen.

Thank you both, for the great help already. I feel like I actually made some progress on this for once. What can I do to get the DMA engine to stop hanging on this transfer? I'm guessing this might have to do with tlast again, but it seems like its signaling at the right time.

Edited by RyanW
Link to comment
Share on other sites

  • 0
On 4/23/2022 at 5:50 AM, Richm said:

1) Set the run bit ( only need to do this once )

2) Wait for DMA idle ( only after the first transfer )

3) write the S2MM address and transfer length (bytes)

4) write the MM2S address and transfer length

Thank you. I took the advice and wrote my own drivers for this kind of thing. Perhaps I just didn't understand how to use the Xilinx provided ones, but direct transfer mode is fairly simple when you lay it out like that. I know I had read the programming sequence in the docs, but I figured it was just handled in the simple transfer function which wouldn't allow consecutive transfers as it checks if the DMA has been started before already. I had thought that the DMA would de-assert back to a halted state, but seems this is not the case.

 

Thank you everyone for helping me clear this up. I would like to select everyone as best answer, but I can't, so I'll just go with the last one in the progression.

Link to comment
Share on other sites

  • 0
On 4/27/2022 at 6:36 PM, RyanW said:

Thank you. I took the advice and wrote my own drivers for this kind of thing. Perhaps I just didn't understand how to use the Xilinx provided ones, but direct transfer mode is fairly simple when you lay it out like that. I know I had read the programming sequence in the docs, but I figured it was just handled in the simple transfer function which wouldn't allow consecutive transfers as it checks if the DMA has been started before already. I had thought that the DMA would de-assert back to a halted state, but seems this is not the case.

 

Thank you everyone for helping me clear this up. I would like to select everyone as best answer, but I can't, so I'll just go with the last one in the progression.

I'm currently facing a similar challenge, but attempting to establish a DMA transfer from PL to PS. Would you be willing to share the code snippets or more detailed steps you followed to solve your issue? Any additional insights on managing the DMA operations effectively would be greatly appreciated!

Thanks in advance for your assistance!

Link to comment
Share on other sites

  • 0

Hi @Julii

The following describes a minimal PL -> PS transfer example. Code was tested on an Eclypse in Vivado/Vitis 2023.1. An AXI stream counter module is used to generate stimulus for the DMA's AXIS_S2MM port and Verilog source code for it is attached. It has a couple of control signals - when start is asserted, it asserts tvalid and counts whenever tready is asserted until it reaches a software-specified limit, at which point it asserts tlast, sends a final beat, and pauses until start is sent again. AXI GPIOs are used to hit the counter's control ports from software.

image.png

Source code:

axis_counter.v

The DMA was configured as follows. Scatter Gather was turned off, the width of buffer length was maximized, and the read channel was disabled. Width of buffer length is an important parameter, as it defines the maximum number of bytes that can be sent in a single transfer (2**26 in this case). I didn't touch "allow unaligned transfers" but it would be helpful if you're working with u8 arrays, as it allows the software to be more flexible in where DDR buffers are located.

image.png\

The Zynq PS had the HP0 port enabled so that the DMA could use it to push data to DDR.

For software, registers are directly accessed via pointers that are pointed at the corresponding addresses, to showcase how to avoid using the xaxidma and xgpio drivers. It follows the process outlined in the Programming Sequence -> Direct Register Mode section of the DMA product guide: https://docs.xilinx.com/r/en-US/pg021_axi_dma/Direct-Register-Mode-Simple-DMA. The software sets the Runstop bit in the DMA's S2MM control register, sets the destination address and buffer length, then configures the AXI stream counter and starts it. Once the S2MM status register's Idle bit returns to "1", it invalidates the cache and verifies that data has been successfully transferred.

main.c

This system accounts for several potential stumbling blocks:

1. The DMA S2MM interface's tready bit cannot be relied on to prevent data from flowing into the DMA before software initiates a transfer. It comes up as soon as the DMA comes out of reset. This means it's important to manually start upstream IP after setting up the S2MM transfer.

2. The module upstream of the DMA must assert tlast at the right time. If tlast is not asserted before the buffer that the DMA is pointing at would overflow, the DMA will lock up and needs to be manually reset to continue being used.

3. If the PS cache is enabled, data you're trying to write or access from software may not be the same as that seen by hardware. Manually flushing and invalidating relevant ranges of addresses ensures that the two are in sync.

4. Lastly, pay attention to where your buffers are placed in memory. If the memory segment they're placed in is not located in DDR, the DMA may not be able to access them. If the memory segment is too small, you may see issues like stack overflows.

Hope this helps,

Arthur

Link to comment
Share on other sites

  • 0

@artvvb

can you elaborate a bit on the control_0 part of this design?  Are these just 2 GPIO blocks?  Its impossible to tell what they are connected to, my options for the Z7 are the button or leds.  It seems like the data is streaming from this gpio. 

Otherwise this seems like a nice instructional piece of code, but for this exclusion.  

Sorry it may be obvious to some. 

Thanks, 

 

 

Link to comment
Share on other sites

  • 0

I've attached an archive of the project. As mentioned before, it targets Eclypse Z7 and is for 2023.1: basic_dma_pl_to_ps.xpr.zip

It's minorly different from the block design screenshot in the original comment but I think the only change is the addition of an ILA for debugging some of the AXI interfaces.

Link to comment
Share on other sites

  • 0

@artvvb

When I run this the data never flows into the while loop in the software, my serial terminal output is a few "test done" printf statements.  When the ILA is launched it is "waiting for trigger" 0 of 1024 samples. 

Is there an obvious problem? 

 

Thanks, 

 

Link to comment
Share on other sites

  • 0
46 minutes ago, Xband said:

my serial terminal output is a few "test done" printf statements. 

This is a good thing - it means that the data returned into the buffer matches the expected incrementing values and no mismatch was printed. To confirm, you could modify the section and add a pass/fail message after it checks the buffer contents as below:

	// verify the transferred data is as expected
	u8 test_good = 1;
	for (u32 i = 0; i < transfer_length; i++) {
		if (i != buffer[i]) {
			xil_printf("buffer[%d]: %d != %d\r\n", i, buffer[i], i);
			test_good = 0;
		}
	}

	if (test_good == 1) {
		xil_printf("test passed\r\n");
	} else {
		xil_printf("test failed\r\n");
	}
	xil_printf("test done\r\n");
50 minutes ago, Xband said:

When the ILA is launched it is "waiting for trigger" 0 of 1024 samples. 

For the ILA, I'm not certain what you mean. To verify that data is flowing:

1. Launch the Vitis debugger (right-click on system project in Vitis, select Debug -> Launch Hardware). After some time putting the bitstream and application onto the board, the trace will be planted at the top of main - you should see one of the lines of code turn green to indicate that this is where the program is currently. If you want, set a breakpoint at the end of the do_transfer function by right-clicking to the left of the line number and selecting "Toggle Breakpoint".

2. Open Vivado Hardware Manager and Open Target -> Auto Connect to the hardware server. You should see the ILA interface appear. You may need to add signals to the Waveform view, if so, click the plus button in the top left, select all, and hit OK.

3. Configure a trigger in the ILA. Trigger Setup pane, click the plus button, search for and select both axis_counter_0_count : TREADY and axis_counter_0_count : TVALID. Set the Value dropdown for both to "1 (logical one)". Operator should stay "==". Make sure the gate button at the top of the Trigger Setup pane is set to 'Global AND'. This will make it so that the logic analyzer will trigger the first time that ready and valid are simultaneously asserted at the counter module's output, which is when data starts moving.

4. Click the Run button in the ILA Status pane to start it. Click the Resume button in Vitis (or F8 hotkey) to launch the program. The ILA should fill with data. You should expand the interface and channel groups of interest to see individual signals. Should look something like this:

image.png

Link to comment
Share on other sites

  • 0

@artvvb

Not sure if its successful or not, looks like it read 64 bits, 0-63, though the added code says test failed.   Not sure how to interpret the result. 

I was going to move this topic to a new thread but not sure of a smooth process for doing so. 

thanks for insight. 

image.thumb.png.c1e58e5c01ec0afda0d88427089a40d1.png

image.png

Link to comment
Share on other sites

  • 0

Ugh, I'm really sorry, there's a bug I didn't catch... The hardware design seems to be fine but the second word in the buffer is getting overwritten at some point, possibly due to insufficient memory in the stack/heap, memory alignment, or some pointer issue. Taking a look.

I also likely should have started a new thread when posting the example code, rather than replying into this one.

Link to comment
Share on other sites

  • 0

Running through the debugging process:

Using a 20-word (80-byte) transfer for the sake of being able to see things on screen. Below is the input stream going into the DMA. You can see that it counts from 0 to 19, and on that 19 (the 20th word), tlast is asserted. If an issue with the counter module was causing the first two words to be corrupted, I'd expect to see it here, but everything looks as expected. - If the DMA worked perfectly, but the counter provided data to it that didn't match what I expected, then the data seen in software would also be wrong, not the case here, since we see the expected 0, 1, 2, 3, ...

image.png

Below is the output of the DMA, the AXI master interface that writes to DDR. The DMA takes the input stream, breaks it up into bursts, and sends data to DDR by first writing an address and then the data - here you can see the first ten bytes are transferred in one burst, starting with address 0x2062d8. After the DMA transfer is complete, the processor will be able to access that memory and see the new data. Again, if there was an issue with the DMA or the counter, I'd expect to see it here, potentially by seeing a value other than 0 or 1 on the WDATA bus at the start of the transfer.

image.png

Now in the Vitis debugger, this is where we can actually see the bug. I can look at the data in DDR after the transfer by setting a breakpoint after the Xil_DCacheInvalidateRange call, and using the "Memory" view to monitor the address that the buffer is pointing at. In this case, I have two breakpoints set. One at the start of the do_transfer function, so that I can check the buffer address in the "Variables" view. I take that buffer address and add it to the memory view, which shows me that the data is set up as expected - in the code pictured, I'm zeroing everything other than the first word to make sure that the expected value after DMA is different from the known value before the transfer.

image.png

Hitting continue to go to the next breakpoint, we can see that everything except for the first two words is updated correctly - the red delta indicates that there's a difference in the value at that memory address since the last time the processor was halted:

image.png

This is then checked by the program and the pass/fail is printed to console, in this case a fail since the first two words don't match the expectation.

image.png

So, the issue doesn't seem to be with the hardware, at least up to the DMA output. Either the first two words aren't being written by the DMA transfer at all, or they are overwritten behind the scenes after the transfer is completed (overwritten specifically back to the original state).

Moving the buffer declaration out of the do_transfer function so that it's a global variable seems to consistently fix it (it puts the buffer at a different place in memory). That said, I'm still not sure what exactly is going on. I tried increasing both the stack and heap sizes in lscript.ld to 0x200000, which didn't help. I also tried checking the "allow unaligned transfers" box in the DMA settings, since alignment can also matter, which also didn't help.

image.png

Hopefully my thought process is relatively clear.

Link to comment
Share on other sites

  • 0

@artvvb

Moving that definition fixed it on my side too, thanks for a small win here and great instructions getting through the process.  I found this quite valuable as a training intro to the ILA process.  Cant say I would have successfully debugged and found the problem but its a good step in the process!

Glossed over the bit about triggering the acquisition twice and seeing the difference in the buffer.  My success was getting all of your steps to line up and work.  Spent 2 days messing around after trying to address a buffer overflow diagnosis from the AMD help site, finally restored the settings and things worked after rebooting the machine.  This platform seems to have an infinite number of issues to deal with.  

 

Thanks again for the help!

image.png

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...