Need advice on improving processing speed for my program: Efficient data reading from memory block in PS

ale.fdezsuarez · May 28, 2023

I would like to ask for advice on the possibility of increasing the processing speed of my program.

I have a program that reads a series of data from AD converters from the PL and stores them in a memory block with "stand alone" mode and "Simple Dual Port RAM" type. It uses Channel A for writing the data from the PL, and Channel B is supposed to be used for reading the data from the PS.

The problem I'm facing is that the only way I know how to access the data from the memory block in the PS is by creating two AXI GPIO blocks. One block contains the memory address I want to read from, and the other block holds the data once it's loaded into that module with the address. Although this process works, it is highly inefficient. Does anyone know how I can efficiently read the data from the memory block in the PS?

Thank you in advance.

zygot · May 28, 2023

Yes, your current method is highly inefficient.

You can DMA data between the PS external memory and your PL logic two basic ways. One way that I've done this is similar to yours. Implement DPRAM so that your logic has access on one port, and the IPI AXI IP in your block diagram has access to the same memory. You can even use the PS DMA controller to move data to external memory.

Another possibility is to use AXI IP to DMA data between PS memory and the PL where the DMA is in the PL logic. There are a few different approaches to select from here.

The best way to do this is to write your own AXI IP to DMA data between PS resources and PL logic. This is not a trivial undertaking.

All of these methods retuire some kmnowledge of how PS/AXI works and the limitations involved. Nothing that's high perfromance is free or simple.

DougFPGA · May 28, 2023

Note: I haven't personally done this (yet) so others with more experience can correct me.

I'd look into building an AXI interface directly to the block RAM. Xilinx seems to provide one as an IP block in PG078, "LogiCORE IP AXI BRAM Controller v4.0". Then you should be able to read the block RAM directly from the PS with the appropriate memory addresses.

Cheers,

Doug

artvvb · May 31, 2023

@ale.fdezsuarez

How much faster do accesses need to be? Using two AXI GPIO channels as described should require two AXI4Lite transactions, one for the address write, and one for the data read - at least if you perform address reads and writes manually, the xgpio driver could introduce extra overhead. A custom AXI4Lite peripheral with an address space mapped to it would only require one AXI4Lite read transaction (there's a built-in example of this, and potentially some guides around the web). This isn't exactly the same as being twice as fast, but is probably close. DMA, and/or a custom AXI4 (not lite) controller, like zygot mentioned, would be necessary if you want to go faster than this. Xilinx's AXI DMA IP could also work, but has some catches/limitations and a potentially steep learning curve of its own.

Quote

Xilinx seems to provide one as an IP block in PG078, "LogiCORE IP AXI BRAM Controller v4.0".

There's additional complexity with this if you want to use burst transactions, which are required if you want to significantly beat the performance of a custom AXI4-Lite core, or match DMA: https://support.xilinx.com/s/question/0D52E00007FSWkfSAH/zynq-axi-master-gp-burst-access?language=en_US

Thanks,

Arthur

zygot · May 31, 2023

It should be noted that AXI DPRAM size used with the AXI BRAM controller is very limited (8KB). Yes, you can instantiate lots of controller/BRAM instances but the cost is high.

A further problem to solve with many of these solutions is knowing what the scope of the data in the shared memory is at any point in time. You could implement a mailbox connecting your ARM cores to the Pl logic.

My point here is that all of these solutions have some non-trivial issues to resolve in your design approach.

Tip: AXI addressing is byte oriented.

asmi · May 31, 2023

There are a few ways to achieve this I can think of:

1. Implementing an AXI4-lite slave interface with a single address reading from which will pop a single data item from FIFO and return it. Then you can set up a DMA to always read from the same address and push the data into some memory buffer. This is how many peripherals in MCUs are implemented (though they use AHB bus).

2. Use AXI DMA IP to write data from your FIFO (via AXI4-Stream) directly into Zynq's DDR memory via AXI4 Slave HP port.

3. Implementing an AXI4-Full master interface which would interact directly with Zynq's AXI4 Slave HP port and push the data that way. This will also likely require also implementing an AXI4-Lite interface for control registers so that CPU can indicate to your module destination address as well as some other parameters. This is likely going to be the most efficient way in terms of FPGA resources, but it's also the most labor-intensive.

Using AXI GPIO to access on-chip resources is insanity. Kind of like talking over a cell phone with someone who is standing right next to you.

Sign In

Need advice on improving processing speed for my program: Efficient data reading from memory block in PS

Question

ale.fdezsuarez

Link to comment

Share on other sites

5 answers to this question

Recommended Posts

zygot

Link to comment

Share on other sites

DougFPGA

Link to comment

Share on other sites

artvvb

Link to comment

Share on other sites

zygot

Link to comment

Share on other sites

asmi

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity