SOC Strategies

zygot · July 12, 2019

Ever the reluctant optimist I thought that I'd start a new thread and see what happens.

There are a number of strategies for creating a platform to do interesting things using FPGA devices. Among them might be:

All HDL with lots of state machines
Programmable soft processor in the FPGA with custom processing modules
ARM PS/PL single chip platform
Multi-board system. For instance UP^2, Delfino, Raspberry Pi, nVidia Jetson, Espsressif WiFi SOC, plus some FPGA development board.

I'm sure that there are others but I'm going to kick off the discussion with what I have experience doing. Clearly this discussion depends on wanting to do something that is very difficult or impossible with either the first two choices or the fourth choice without an FPGA.

Your choice, of course, depends on what you want to do. If you can accomplish a projects' goals with all HDL then you are wise to use that approach. From there it gets messy, particularly if your FPGA development and HDL skills are minimal. I tend to think of this problem, at least for thoughts directed to the Digilent audience, as a " how do I enhance my HDL with capabilities when needs are better served by other means" question. To be sure there are a lot of problems that are better served using a standalone processor like the Delphino or perhaps one of the processor boards mentioned earlier without and FPGA. I don't see a reason to consider that case here.

To my mind, once you know what needs to be done, and if you choose the 4th option listed previously, how you proceed comes down to a few decisions:

partitioning the design into what is best done using HDL and what is best done using a programmable processor
how flexible the programmable platform is in terms of OS selection
how productive the programmable platform tools are
how useful the exposed programmable platform hardware interface are

Some of the programmable platforms that I've mentioned above are very complicated devices with very complicated and often restrictive development flows. Having a system crippled by an OS kernel that can't handle real-time events isn't going to get you to a satisfactory conclusion regardless of how fast the processor(s) are; depending of course on you project requirements. Even with some pretty good on-board high resolution timing hardware there are still things that need to be be done for some projects that require programmable digital logic. (let's table IO logic issues and such for the moment)

UP^2, Delfino, Raspberry Pi, nVidia Jetson, Espsressif are all reasonable platforms for a lot of projects but let's face it they are very very complicated devices with complicated, sometimes maddeningly difficult to work with kernel development toolchains and support for application development that might not be adequate. Still, they provide capabilities that you will not find in an ARM based FPGA device. The more recent Ultrascale ZYNQ devices from Xilinx are certainly catching up and in some areas besting multi-device systems. And the RTOS options are certainly there making the Vivado ZYNQ verses multi-board selection more challenging ( ignoring the unknowns involved in using the platform tools ).

For those coming from some competence in Linux, C, Python, and other software platform development but no digital design background the learning curve for understanding the complexities of FPGA development are huge. Even if you are quite competent doing FPGA development figuring out the details of using hardware resources on a complex programmable platform can also be a high hurdle.

So, perhaps anyone with experience doing any of this might offer some useful insights to benefit the community.

zygot · July 14, 2019

Holding meetings where you're the speaker and the audience is not that great. Not being great ( at the moment ) doesn't seem to me to be a significant criteria for either doing or not doing something that might be interesting; so onward and forward.

I understand that this post is kind of out of the way for many Digilent users, though probably not for a lot of people wandering into FPGA development. As far as Digilent is concerned Raspberry Pi and its kind doesn't exist. Other FPGA development board vendors have embraced the Arduino ecosystem with compatible headers. My perspective is that if I can solve a problem just with an FPGA board and some HDL that's what I'd rather be doing... I really don't have much interest in keeping up with Linux kernels, driver development or application coding if I can avoid it. No doubt this puts me in a different camp than most people reading any of this ( I know that I'm not the only one so far...)

Every once is a while I forget the hours of wasted time and energy trying to develop with nVidia's Jetson. Hours and days that needn't be wasted if adequate support, documentation, and care were part of the company's development product mantra. I really don't understand why a company with gobs of cash wants to be embarrassed by a little company making $5 modules like Espressif in terms of supporting it's customers needs. Have some pride in what you do already.

Lately, I've been playing with the Jetson Nano. For $99, a 5W or 10W power budget, 4 A53 cores, and a few CUDA GPUs promising to bring enhanced computing and imaging capabilities to an FPGA centric design why not?

The Jetson, like the Raspbery Pi, and any number of SBCs offer a 40 pin header with a few serial interfaces and GPIOs. This is sometimes a suitable way to connect to your FPGA board using SPI, I2C and UART interfaces. The Nano also has USB 3.0 ports if you can manage to find a useful API. For an x86-64 board like the UP^2 this isn't a problem. For the ARM based boards with an aarch64 variant architecture it more of an issue. Still, if one considers the FPGA board to be a sensor front-end of sorts there's an allure of possibility that's hard to ignore.

Out of the box the Jetson Nano is different from its siblings in that it uses a micro SD card for its FS and you can get started without having to create an Ubuntu cross-compiler Host. I don't want to mention how many times I've had to re-install Ubuntu as a host Jetson development platform with the TX1 and TX2. Anyway, as I was about to mention you can just flash an SD card, put it into the Jetson ( for under $100 you better get used to not getting all of the things required to actually use your board. ) and off you go with the Jetson booting an nVidia Ubuntu 18.04 variant. One of the nice things that's happened since back when I've developed for the TXx boards is that Ubuntu really does support the aarch64 Linux architecture a lot better (Can't agree with you about ARM Linus). The Nano has a 6-pin header that you can connect a TTL USB UART to and use as a console to SSH into the OS which is nice. As to the 40-pin header you get a second UART (the data sheet says up to 12.5 Mbit (???) data rates) 2 SPI channels (the datasheet says up to 65Mbit data rates, 32-bit word lengths) an I2C interface and a few spare GPIOs. Unfortunately, nVidia didn't see fit to allow use of the external SPI channels in the default kernel image. This is has always been a big problem with nVidia; not providing a default kernel supporting advertised hardware. Fortunately, there is a solution to the SPI problem as provided by one of its customers in the form of a shell script. This script runs on any Linux ( I used Centos 6.10 ) to create a new device driver tree blob and re-flash a block of your Nano's FS so that one of the SPI channels gets exposed. No kernel recompilation, no SDK Manager, no cross-compilation involved. One could, in theory, expose the other SPI by modifying his dts but I haven't done that yet. One nice thing about the Jetson SPI, I2C and UART interfaces is that there isn't any USB timing related issues. The downside for most of these types of platforms, and especially for the nVidia platforms is that there isn't an RTOS available. Considering how nVidia wants to position the Jetson devices I find this astonishing. Well really I find the logic of most companies management astonishing but really nVidia???

I need to mention that nVidia has chosen to encrypt its dtb so you can't just download a device tree compiler to your Nano, extract a dts, modify it, create a new dtb and boot into a kernel ready to support hardware changes. This is important to anyone wanting to use the platform. The sale pitch implies that the Nano could be used to do kernel development on the target but this is not the case and evidently will never be for Jetson platforms. I guess that while some vendors want their customers to soar like eagles some want their customers to be more like chickens in a cage....

The reality of devices like the Jetson is that what the datasheet promises and what you can get are not always the same. nVidia provides absolutely no support for the interfaces on the interface header so you have to figure this out for yourself and be content with what's available. I've been able to run the external SPI interface as /dev/spidev0.0 using C but only up to about 25 MHz SCLK rates. Not what I want but not unusable for now. A problem is that for a platform like the Nano clock management is key. Unfortunately, nVidia doesn't seem to be too interested in providing the tools necessary for extracting the full promise of its products to its customers. That's a shame.

Users have claimed to install PYCUDA on their Nano devices but I haven't yet been so lucky; so that investigation is on hold.

What I can do is make an HDL project connecting one of my FPGA boards to the Nano as a demo. Does that raise anyone's interest in adding to this thread?

[edit] I forgot to mention that one other FPGA interface potential is the Ethernet port. Later versions of the Raspbery Pi, the UP^2 and the Jetson boards share this. The Jetson TX2 also offers a 4 lane PCIe connector though using it is not trivial and requires some kernel changes.

D@n · July 20, 2019

@zygot,

Ok, it took me a *long* time to get to reading this post (you posted, and posted again while I was out of town), and ... now that I've had a chance to read the rants above, I'm struggling to figure out what the point is.

You are arguing that there are several options for solving problems, some involve a CPU, some involve an embedded CPU, some involve an FPGA, and that you should pick the best choice for the task. Ok, got that. You are preaching to the choir here.

You say that you can interface to a device I've never heard of before. Neat. Go for it. All power to you.

Did I miss a point in here somewhere?

You mentioned interfacing FPGs with external CPUs. Been there, done that, it's doable but it's got it's problems but ... back to the top--you pick the best choice for the task. Still missing the point.

Dan

zygot · July 21, 2019

@D@n

You know that it's a slow month when the only response to a post is "uh.. is there a reason that you wasted my time reading this?". Did I interpret your reply correctly?

I wasn't trying to rant... it worked out that way as I was recalling frustrations with vendors. Not everything I've said about the Jetson is negative; but what I did mention seems to me to be relevant, if not thorough. The second post just mentioned a current investigation.

So far I'm not really trying to make a point; just trying to engage with people to discuss experiences with the topic. The topic is, I think fairly well bounded. The thread is intended to deal with cobbling a system using off the shelf boards to accomplish a goal with a very limited budget. In the few years that I've been reading the Digilent posts this seems to be a general subject that comes up often. FPGA development boards rarely support everything that you need in order to accomplish a task and cheap SBCs rarely have FPGAs closely coupled to the uController. The same goes for the respective tools.

Personally, my experience with the computer boards is that what you can do with them is rarely what is implied by the sales pitch. If there's a better place discussing this please let me know and I'll go there. I'd much rather be informed and not spend money than buy something and find out that it's impossible or very difficult to use. In this case use refers to the hardware interfaces; because none of them has an FPGA on board. Perhaps you'd rather waste time figuring this out for yourself. I'd rather have a place to discuss experiences, save others wasted time, and perhaps have do the same for me. When an SBC has a easy to use connector exposing hardware interfaces I kind of expect to be able to use it. Sometimes this is too hard because of the tools. Sometimes it's impossible because the vendor simply doesn't intend to support it. Sometimes you can find answers in a community forum sometimes you can't.

If you're implying that the reason why I've been, until now, the only one posting to this thread is because there just isn't anyone else doing this kind of thing I won't argue. It won't be the first, or last, time that I've asked questions that doesn't appear to concern anyone within earshot. I see that the thread has only a limited number of views, and so far, no interesting replies. It may well be that 'my people' live somewhere else. I'm fairly certain that I'm not the only one wedding these kinds of SBCs to an FPGA.

I suspect that now that Intel owns Altera we will see inexpensive off the shelf boards that do it all as far as this thread is concerned. Until then I've got to make do with what's available.

D@n · July 21, 2019

@zygot,

I've only done a couple of projects which included both FPGAs and something else on board. There was one with a Pi and one with an older XScale processor. In both cases, I've been disappointed at the throughput between the processor and the FPGA. If you want, therefore, to do some serious data processing--moving the data around from one part of the board to another needs to be ... shall we say carefully and deliberately engineered.

The ease of moving data around from CPU to FPGA fabric is one of the things I like about some of the more modern FPGA designs that include the CPU on the same die as the FPGA. In the case of microblaze, though, the CPU speed gets crippled to slow it down to FPGA speeds. In the case of the ARM, same thing. The ARM tends to be a fairly fast processor, but it has to slow to a crawl to interact with the FPGA.

This is also one of the reasons why I've been studying AXI recently--to be able to do a better job moving data around at high speeds. Of course, when you see how crippled Xilinx's implementations of AXI are, you start wondering how folks even accomplish anything at "high speed" when using it ... but that's another rant for another day.

Dan

zygot · July 21, 2019

13 hours ago, D@n said:

In both cases, I've been disappointed at the throughput between the processor and the FPGA.

You will see a change in that from Intel sometime soon. Whether or not you can afford to 'play' with it remains to be seen. Check out some of Intel's FPGA divisions recent announcements.

Having a tightly coupled processor to 'co-processor' ( for this thread let's define co-processor as FPGA ) is one thing. It's another to be able to handle time critical applications. One nice thing about the ARM based FPGA devices is that you can DMA data between processor memory buffers and the FPGA or the other way around. For the kinds of SBCs that I've mentioned 350+ MB/s using UBS 3.0 is as good as it might get, assuming that you are using an 86x_64 arcjitecture; good luck finding a driver for aarch64. The problem with USB is that it might offer potentially good data rates but poor responsiveness in terms of latency. The Jetson TK2 offers 4-lane PCIe Gen2 (as I recall but don't hold me to it) as another potential interface. I haven't managed to build a kernel that fixes the PCIe master clock ( who'd think that making this a spread-spectrum clock is a good idea?? ). It's also another thing to be running an OS that has a chance of 'real-time' responsiveness. Again, this is where the ZYNQ ecosystem tends to be ahead to my knowledge.

zygot · July 21, 2019

@D@n

So, perhaps to your chagrin, you've inspired me to make a point for this thread.

There's cutting edge technology and there's affordable technology for budget constrained adventurer's like me, and I'm guessing, a lot more people. Recent cell phone offerings provide a view of what's possible for a power and size constrained embedded systems. Perhaps, for last year's releases not that affordable or practical but still amazing. Those advancements don't really help my options directly, but do help pull up the lagging technology that gets made available to me.

The biggest problem that I see, in looking at new offerings ( generally as a hobbyist related product ) is that it's almost impossible to find out what's possible to do with them without actually buying them. I'm not technically naive. I can pretty much figure this out for FPGA development boards and add-on products because I know the tools and devices. This is not the case for the SBCs that I've mentioned so far.

I realise that I have a fairly unique perspective but I'm also convinced that there are a lot of other people pondering the possibilities. Most likely have some experience using the SBCs to develop application that exploit native hardware and interfaces and involve mostly software development and are interested in what FPGA platforms might offer as a supplement. I'm more interested in viewing the problem the other way around. Everyone can benefit from sharing experiences and frustrations. Learning from your own mistakes or misjudgements is good. Learning from the mistakes and misjudgements of others is better.

Note that one of the options is ARM based FPGA devices and boards. I'm viewing this as the benchmark. Sometimes it's better than the alternatives but not really affordable for either experimentation or small production runs. Sometimes, an alternative is better; if what you are expecting also happens to be realistic or possible.

That, for now, is as good a point that I can think for starting the post. All thoughts are welcome, even those suggesting that the thread shouldn't exist.

Anyway, I've decided to see if I can develop a 40 MB/s USB 2.0 FPGA<-->SBC interface that works on my x86_64 PC and the Jetson Nano. Perhaps it's an effort of no interest to anyone but me. Perhaps not. Either way, I'm pretty sure that it's going to get done. Perhaps no one wants to know about issues faced and conquered, perhaps some do.

zygot · August 6, 2019

For anyone interested here is an update on the Jetson Nano USB FPGA project.

I connected a CMOD-A735T and Adafruit FT232H breakout board using one of Adafruit's proto boards. It was pretty easy to do. The FT232H was re-programmed (the EEPROM) to be usable in Synchronous 245 FIFO mode in order to obtain the highest data throughput possible. On the OS side you have to make sure that the default VCP drivers aren't used because we need to use the D2XX driver for this mode. On my ageing Win7 PC I was able to transfer 32.5 KB at data rates of 15-21 MB/s. For a relatively small amount of data transfer this was in the expected range. 40 MB/s would be considered the high end. On my PC I was able to get to the 40 MB/s target with 512 KB data payloads though not symmetrically or consistently. Currently my FPGA design only has 64 KB for data storage. The FPGA timestamps activity so that I can get a very good idea of elapsed time to do all of the data transfers as the FPGA sees it. Timing on a PC is, of course, a more complicated analysis.

The first unknown about whether or not the Jetson Nano could replicate these results was compiling an application in C++ using the FTDI D2XX driver API. Fortunately, FTDI does provide a number of ARM versions of this driver. After a bit of reworking to suit Linux C++ development I was able to compile a version of my Win7 application and try some runs on the Jetson Nano.

The way my application and FPGA work is that all transfers are in transactions that consist of 1 control sector, n data sectors, and 1 status sector. A sector is 512 bytes. The control sector is always up to the FPGA and the status sector is always down to the PC application. My initial tests were a bit disappointing as 0 data sector transactions up or down always worked. 1 data sector up or down usually worked. When the amount of data exceeded the 1 KB FT232H internal FIFO (1 data sector plus either a control or status sector) the application failed consistently. The first test involved plugging the FT232H module into one of the unused Jetson Nano USB 3.0 ports; a keyboard and mouse also occupied 2 of the 4 ports. One possible explanation for the disappointing results might be the extra power of the USB attached devices on a system with a constrained power budget. I repeated the tests with all USB devices attached to a powered USB 3.0 HUB. Results were the same. My expectation is that I will have better success on the UP^2 SBC and will try that out.

zygot · August 17, 2019

My currently) favourite inexpensive prototyping FPGA board is the DE0 Nano with a Cyclone IV device and a 32 MB SDR SDRAM. It is a little better suited to this project because of the external memory size. I was able to use the same basic code that I used for the CMOD-A7 to test larger transactions.

Performing a transaction uploading 2047 data sectors (4192256 payload bytes) to the DE0 Nano SDRAM and then downloading it resulted in no errors. I only have a measurement for the download; it averaged 42 MB/s.

I hope to try out a test on the UP Squared board later this week and will report. I still haven't figured out why the Jetson Nano couldn't transfers data sectors reliably. It certainly isn't an issue of available memory.

zygot · August 21, 2019

As promised I ported my PC test application to run on Linux Mint 18 64-bit. The platform is an Up Squared SBC with a Pentium N4200 and 8 GB of memory. This platform is still within a 15W power envelope useful for embedded projects. My Windows PC development application was C++ though the only C++ functionality was using streams for file IO. I had a devilish time trying to port it to Mint so I ended up just making it into a C program and having no file IO.

Using the DE0 Nano I was able to send data up and down error free at average data transfer rates > 40 MB/s consistently. Curiously, past 256KB data payloads I observed download rates fall below 15 MB/s though upload rates stayed above 40 MB/s. This is definitely not consistent with what my experience has been with PC platforms. As all application data storage is kept in memory I assume that this is a bottleneck on this board.

My FPGA design counts 60 MHz clocks the the state machine spends in the data download and upload states so the rates are extremely accurate, from the point of view of the FPGA. Timing and other instrumentation is reported to the host application in the Status Sector for every transaction. From the point of view of the SBC there are a lot of factors diminishing those rates.

Anyway, I've shown that it's possible to cheaply combine an FPGA and a cheap SBC on at least one platform with a relatively high speed interface. Overall performance suffers from platform dependent factors; so experimenters should not make assumptions about expected performance for a particular application.

zygot · August 22, 2019

Another update.

I decided to try the DE0 Nano with the Jetson Nano since the code and application have evolved a bit since the CMOD-A7 versions.

                                                         Win7 PC                                            Up^2 N4200                                    Jetson Nano
Data Sectors Payload Bytes Upload MB/s Download MB/s Upload MB/s Download MB/s Upload MB/s Download MB/s
     127              65024                  43.1471           42.1225                 42.853889      45.721241             35.106934       38.934151
     511          261632     42.1225     42.7433                 42.979492      46.239983              35.065121      38.884438
     767              392704               42.0514           42.7561                 42.898247      20.149193              33.739391      16.723478
    1023            523776               41.9084        42.7521                42.931019       12.217526              35.086201      10.197824
    2047           1048064             40.4817            42.7381                42.907658          9.832075              35.111324        7.109247
    4095           2096640             42.0431            37.316
    6143           3145216       41.3916            39.1262

As you can see the problems previously reported for the Jetson Nano were likely due to compilation issues.

I've made no attempt at a threaded application which might help. The drop-off in data rates for higher payloads might be due to Linux driver behavior. The Jetson is archaa64; the UP Squared is x86_64.

The data rates for the UP^2 at or below 256KB payloads are unexpectedly high. All rates are based on the total time that data was being sent or received by the FPGA and measured in 60 MHz clock periods.

zygot · August 27, 2019

I was able to install the FTDI D2xx driver onto the Raspberry Pi 4 running the July 10 release of Raspbian Buster and 4 GB memory.

Data Sectors Upload MB/s Download MB/s

127 31.815292 39.139751

511 31.967712 38.767433

767 32.060654 12.824258

1023 31.880133 9.921153

2047 31.963524 9.567891

The compiled application had a few warnings that I ignored and didn't printf the correct payload bytes or times but the data rates look reasonable to I'm reporting them here. Frankly, I wasn't expecting to be able to run a test on the Raspberry Pi. Those A76 ARM cores definitely have more of a kick than the Raspberry Pi 3 cores.

SOC Strategies

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived