Nexys A7 - DDR2 speed

Michal Hucik · August 31, 2021

I'm trying the neorv32 CPU on AXI4-lite bus. AXI_ACLK is 100MHz. I have connected BRAM memory block by AXI BRAM Controller and DDR2 memory interface by MIG. Input clock sys_clk_i connected to MIG is 200MHz.
I wrote a simple test to measure the speed of reading:

uint64_t start_time = neorv32_mtime_get_time ( );
uint32_t value = *(uint32_t*) addr;
uint64_t end_time = neorv32_mtime_get_time ( );

When program is running from BRAM, so access time to BRAM is approx 70 ticks and to DDR2 120 ticks.
When program is running from DDR2, so access time to BRAM is approx 995 ticks and to DDR2 1100 ticks.

Why is there such a big difference and why is the program running from DDR2 so slow?
The DDR2 memory is so saturated by random read access connections?
Is it possible to activate some DDR2 cache that would speed it up?
Or is DDR2 is not suitable to use by CPU instruction and data memory?

zygot · August 31, 2021

I like the idea of creating some experiments to get a feel about performance. You ask good questions and perhaps your questions ought to suggest to you some additional experiments. That's how you learn stuff. Of course understanding how DDR and BRAM and your controller work helps. Indeed, the danger of asking questions and doing experiments is that you usually end up with more and ever more complex questions.

So more questions:

Is the test above really just measuring how long it takes to read a 32-bit value from memory or something else?
How could, looking at the 3 lines of code above, could you improve on the accuracy of your measurement?
What would happen if your soft processor ran the code out of BRAM and accessed data in both BRAM and DDR ( or visa versa ) ?

I frequently like to get a sense of performance and latency for my projects and it's common for me to instrument my sources and create experimental side projects to answer questions that come up during design phases... and there are always questions. Of course being able to do the HDL development flow helps considerably. It's amazing what a simple counter, some logic, and a few trusty UART debugging modules can do.

Be aware that for more complicated systems, and I believe that yours applies here, simple tests can be misleading when applied to real-world performance. There can be lots of little details and behaviors that aren't at all that obvious to complicate things if you want to extrapolate simple performance numbers into more general expectations.

Sometimes just thought experiments are productive.

What's the difference between moving blocks of data and true random R/W access?
Does block size matter?
Is random R/W operation different than serial operation?

Edited August 31, 2021 by zygot

Michal Hucik · August 31, 2021

Thank you for answer. I admit that I was a little carried away when I used my simple (misleading) test as a prerequisite for measuring performance. I underestimated a bit the number of operations that are associated with the measurement itself.
I did a new a little more accurate measurement through the ILA. I measured the difference between the S_AXI_ARVALID and S_AXI_RVALID signal on the AXI slave.

BRAM - 3 ticks (its nice, but this is not in my focus)
DDR2 - 22 ticks = 220ns

I'm probably wrong, but I'm based on the information in the Nexys A7 reference manual. Here is information about the transfer rate of 650Mbps at clk_period = 3077ps.

So my assumption was that read_frq = 650Mbps / 32bit => (650*1024*1024)/32 => 21.2992 MHz => aprox 47ns (read access time for one 32bit word).

Assumption according to the measured value of 220ns is 1 / 220ns => approx. 4.54 MHz => (1 / 220ns) * 32 => 138.7 Mbps

zygot · August 31, 2021

Well the ILA idea is better but still might be misleading. Sometimes even very precise time measurements turn out not to be repeatable under all operating conditions. But the whole purpose of experiments is to test your assumptions, and perhaps expose concepts that we believe are factual but really poor assumptions. There's nothing wrong with assumptions unless they aren't properly tested.

One thing about DDR is that the controllers generally work on cache-lines of multiple bytes or words because of the high data rates and a desire to keep logic clocked at a reasonable rate. The implementation of external dynamic memory controllers as a whole seriously complicates any mathematical analysis of performance. This is doubly so when a processor is executing opcodes out of DDR, whether or not your soft-processor uses instruction and or data cache or not.

Perhaps, a better test would be measuring large blocks of data transfer and working out an average rate in bytes or words per second. And perhaps not. It depends on what you are looking for. Sometimes what we are looking for isn't what we should be looking for, as far as getting answers to questions are concerned. Usually, particularly in the beginning of an educational journey the initial questions need improvement.

You correctly understand that the time to do a measurement, or get a timestamp, can become the predominating part of a measurement. So, go with that thought, understanding that there might be other factors that you haven't taken into account that might affect the quality of of your measurements and conclusions about those measurements. Don't forget about latency, that is the time that your processor wants data to when it gets data. The ILA approach is good at measuring clocks between signal states but perhaps not necessarily at latency which might be as or even more important than delay between signals. That is, what's happening on the logic side might not be as important as what's happening on the software (opcode) side. Of course, if you can break into your processor you can tweak the measurements to be more accurate for what you want to measure.

Again, even if you get a pretty accurate measurement of the minimum possible time to read or write data this might not be all that helpful for real-world applications once all of the levels of software processes are taken into account. This is one reason why DMAing data directly from logic into memory is so attractive. If your data memory is the same as your instruction memory then even DMA analysis gets pretty complicated.

I'm sure that you've read CPU and GPU performance reports by various testing websites comparing, for instance AMD and INTEL products and the supporting devices on motherboards and memory. Even with standardized synthetic and 'actual application' test suites making sense of performance numbers as it pertains to a different application that you are interested in is fraught with danger.

So, with all of that in mind, perhaps you can think of a way to construct your experiment and measurements that are a bit more comprehensive and take into account of concepts discussed so far.

[edit] Also, for dynamic memories there are periods during which the memory controller performs refresh and the application doesn't have access. So, be suspicious of very consistent measurements. Usually, it's better to track minimum and maximum times as well as average.

Edited August 31, 2021 by zygot

Sign In

Nexys A7 - DDR2 speed

Question

Michal Hucik

Link to comment

Share on other sites

3 answers to this question

Recommended Posts

zygot

Link to comment

Share on other sites

Michal Hucik

Link to comment

Share on other sites

zygot

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity