Jump to content
  • 0

Arty Microblaze Speed Question


Nystflame

Question

Hello,

Without implementing a timer, I had thought that toggling a GPIO pin and observing the result via a logic analyzer(has 100Megasample/sec). The Microblaze input clock is coming from the "ui_clk" from the MIG, which seems to be 83 MHz, but when observing the pin toggle the frequency is ~37 kHz. My method for toggling the pin is just an infinite while loop with two Xil_Out32 commands, one for turning the pin on, and the other command turns it off. Any debugging methods I should try as to why the frequency of this switching is so low?

 

p.s. I've since moved from toggling via the xil_out function and am targeting the address of the GPIO pins directly, the frequency I'm seeing now is 130.7kHz, still nowhere near the 80MHz I had been expecting.

 

p.s.s. I've enabled caches and tried block ram vs ddr and the max i've gotten to is 1.3Mhz

The following is all of the code in my program for this test:

 

#include "platform.h"

int main(void)
{
    init_platform();
    volatile unsigned int *pins = (volatile unsigned int *) 0x40000000;
    for(;;){
        pins[0] ^= 0x1;
    }
    cleanup_platform();
}

 

 

Best Regards,

nystflame

Link to comment
Share on other sites

6 answers to this question

Recommended Posts

@Nystflame,

Wow, quite a fascinating observation, huh?

I was going to post a quick ZipCPU program, illustrating how the ZipCPU could toggle a pin at about 16MHz with the Arty, but then as I started to do so I started to realize all of the things that could slow your program down.  Hence, before we start comparing one CPU to another, let's see how fast we can get the MicroBlaze to run.

So let me ask a couple of questions:

  1. How did you configure the MicroBlaze?  Does it have an instruction cache?  Is your program running from cache, or is it running from DDR3 SDRAM memory with no cache?  It can cost about 200ns to retrieve each instruction from the SDRAM memory if there's no cache.
  2. Did you turn optimizations on?  The "-O3" compiler argument?  Without optimization, the compiler will calculate the address of your pins variable before every access.
  3. You might wish to define your pins as "volatile unsigned * const pins = 0x40000000;"  This will tell the compiler that the pointer itself is constant (not volatile), and so it won't need to recalculate the pointer's value each time.
  4. Does setting the pin to 1 and then to 0 run any faster than your current program which must read the value, XOR it to one, and then write the value back?
  5. Can you post the assembly associated with your program above?  (*-objdump -D objectfile.o)  (Adjust the *-objdump to reference your cross-compilers objdump utility ...)

Incidentally, the reason why I estimated that the ZipCPU would do 16MHz (I didn't time it, I estimated it) was because of the latency associated with accessing the peripheral bus.  Just because it takes the ZipCPU 3-instructions (store, store, and a jump for the loop--about as good as any CPU could get) at an 82MHz clock on the Arty, doesn't mean it can run at (82MHz/3=) 27MHz.  The loop instruction costs a CPU stall cycle, and the two store instructions require a couple of clock cycles as well.  I would expect the MicroBlaze CPU to have similar limitations.

Realistically, though, the bottom line is that if you want to toggle a pin at high speeds .... do it with logic, not with the CPU.  CPU's just don't run all that fast in comparison to FPGA logic--even when implemented within an FPGA.  (Actually, they run about 10x slower within FPGA's vs dedicated hardware--hence the Zynq.)

Dan

Link to comment
Share on other sites

Thank you both for your responses, i'll try my best to answer them individually:

@jpeyron :

Attached is a image of my block diagram.

 

@D@n :

Please see attached for assembly code.

1. I configured the microblaze with 32KB of instruction cache and data cache, and my program is running from the block memory onboard.

2. I hadn't turned on optimizations, but I believe I have that "-O3" option enabled now in the makefile. It increased the speed from 1.333MHz to 1.538MHz.

3. I've made them constants and the speed is still the same.

4. Setting the pin to 1 and then to 0 does actually speed up the program from 1.538MHz to 3.571MHz, so that's a pretty big improvement.

5. I've attached the assembly code (from the .elf file) which has the program within it. It seems like prior to setting the pin manually to 1 or 0, the xor command took 6 assembly instructions. With the manual setting of the pin, it takes 6 assembly instructions as well. (Both of these are including branching to the top of the loop). When stepping through the look in the SDK, it only highlights 3 assembly instructions though for setting the pin which is load word immediate (2, one or setting the pin to 1 and another for setting back to 0), and the 3rd instruction is branching immediate (to the top of the loop)

Thank you both very much for the help!

here is an update of my current code:

#include "platform.h"
#include "xil_printf.h"
#include "xgpio.h"

XGpio Gpio; /* The Instance of the GPIO Driver */

int main(void)
{

    int Status;


    Status = XGpio_Initialize(&Gpio, 0);
    if (Status != XST_SUCCESS) {
        xil_printf("Gpio Initialization Failed\r\n");
        return XST_FAILURE;
    }


    XGpio_SetDataDirection(&Gpio, 1, ~0x01);


    init_platform();
    volatile unsigned* const pins = (volatile unsigned* const) 0x40000000;


    for(;;){
       //pins[0] = pins[0] ^ 0x01;
       pins[0] = 1;
       pins[0] = 0;
    }
    cleanup_platform();
}

 

Best Regards,

nystflame

Microblaze_Block_Diagram.PNG

assemblyCode.txt

Link to comment
Share on other sites

Hmm ... there's a problem in how pins[0] is being accessed.  It looks like it's calculating the pointer each time.  Since you've got -O3 already, suppose you try accessing *pins instead of pins[0].  I wonder if that will get rid of the extra "lwi" instructions.  I'm also a little disappointed at the addik instruction.  The compiler should be able to remove that from the loop, but it isn't.  Shame on the compiler for that one.  You might be able to, outside of your loop, declare register variables "reg zero, one;" and set them to zero and one respectively--then reference those variables within your loop.  That might help get you a bit faster.  (Actually, r0=0 already, so you might only need to do this for "one")

Still, I'm a bit disappointed at 27 clocks for six instructions--about four instructions per clock.  Did you tell Xilinx to create a pipelined microblaze CPU?  That might be the difference.  The other possible difference might be the I/O speed.

Dan

Link to comment
Share on other sites

@D@n ,

I appreciate all of the help. I've changed from pins[0] to *pins, and declared two registers and initialized them to 0 and 1, but it doesn't seem to have had any affect on the speed. Maybe it is the I/O speed as you suggested. Also, the microblaze is currently configured for "performance" mode, which enables the 5 stage pipeline. I've attached the new assembly code, and it looks like it's still the same number of instructions.

 

Best Regards,

nystflame

 

newAsmCode.txt

Link to comment
Share on other sites

Hi @Nystflame,

I verified your results with the microblaze using the axi gpio core. I had some of my co-workers look at this thread to see of they had any additional input. Unfortunately, there was nothing more we could add or suggest and we agree with @D@n's earlier statement "if you want to toggle a pin at high speeds .... do it with logic, not with the CPU".  Have you reached out to Xilinx support to see if they have anything more to suggest?

thank you,

Jon

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...