How can I minimize the usage of LUTs?

House · December 10, 2023

Hi everyone, I´m new to FPGA development and I´ve been trying to implement an complex number divider following the formula

Screenshot2023-12-09234738.png.cc4b3e4ece12aeecb82853404916efdc.png
to do that I´ve done multiple attempts as , using 2 sequential dividers and 2 sequential multipliers, a combinational divider and the two sequential dividers, and for last one sequential divider and a combinational multiplier.
In all the cases referred before I end up using about 1300 LUTs, when I was aiming to 900/1000, where theoretically i would only use about 800 with the modules (180 for seq multiplier, 207 for seq divider , 400 for comb multiplier).
I´ll leave the implementation where i use one combinational multiplier and two sequential dividers so if any of you see something that might be leading to the problems im facing , ill be very thankful.

module cpxdiv_4(
         input        clock,
         input        reset,
         input        run,
         input [15:0] ReA,
         input [15:0] ImA,
         input [15:0] ReB,
         input [15:0] ImB,
         output[31:0] ReY,
         output[31:0] ImY,
         output       busy
             );
wire busy2;

reg [31:0] temp_reg1, temp_reg2, temp_reg3;

reg [31:0] sum;

reg run2;

reg [31:0] dividend_re, dividend_im;

reg [15:0] divisor;


//counter_var
reg state;
reg [7:0] counter;

parameter IDLE = 0,
          RUN  = 1;

psddivide_top
      psddivide_top_1
      (
      .clock(clock),
      .reset(reset),
      .run(run2),
      .busy(busy1),
      .dividend(dividend_re),
      .divisor(divisor),
      .quotient(ReY),
      .rest()
      );

psddivide_top
      psddivide_top_2
      (
      .clock(clock),
      .reset(reset),
      .run(run2),
      .busy(busy2),
      .dividend(dividend_im),
      .divisor(divisor),
      .quotient(ImY),
      .rest()
      );



// Clock cycle counter implementation
always @(posedge clock)
begin
  if ( reset )  // Synchronous reset, active high
  begin
      state <= IDLE;
      counter <= 8'd0;
  end
  else
  begin
    case ( state )
        IDLE: if ( run )
              begin
                    state <= RUN;
                    counter <= 8'd1;  // this is the start value show in the timing diagram
            end
        RUN:  if ( counter == 8'd46 )  // last clock cycle
                begin
                    counter <= 8'd0;
                    state <= IDLE;
              end
            else
              begin
                counter<=counter+1;
              end
     endcase
  end
end


always @(posedge clock)
begin
                      case(counter)
                  
                  9: run2 <= 1;

                  11: run2 <= 0;
                  endcase

end


always @*
begin

      case(counter)
            
      1: temp_reg1<= fxpmult(ReA,ReB);

      2: temp_reg2<= fxpmult(ImA,ImB);
      
      3:begin

         dividend_re <= temp_reg1 + temp_reg2;
         temp_reg3<=fxpmult(ReB,ReB);
      
         end
      
      4:temp_reg1<= fxpmult(ImB,ImB);
      5:begin

         sum <= temp_reg1 + temp_reg3;
         temp_reg2<=fxpmult(ReB,ImA);
      
         end
      
      6: begin
          temp_reg3<=fxpmult(-ReA,ImB);
          divisor<=sum[31:16];
         end
      7:dividend_im<=temp_reg2+temp_reg3;

      endcase

end



assign busy = ( counter >= 8'd1 && counter <= 8'd50 );

function [31:0] fxpmult (
                          input[15:0] A,
                          input[15:0] B
                                    );
reg [15:0] absA;
reg [15:0] absB;
reg [31:0] P;
reg sP;
begin
  sP = A[15] ^ B[15];
  absA = A[15] ? -A : A;
  absB = B[15] ? -B : B;
  P = absA * absB;
  fxpmult = sP ? -P : P;
end
endfunction

endmodule

D@n · December 10, 2023

@House,

Have you considered how much logic a multiple requires? How about a divide? There are no shortcuts to division, and the low-level hardware guys haven't provided many short cuts for multiplication. Both are very computationally intensive components. They'll both use a lot of LUTs, and likely keep you from meeting timing.

The solution to the expensive multiplies is to use a DSP element. Those, however, have requirements to them. You have to use them "right", or ... they'll get turned into LUTs and you won't meet timing (or area) again.

My general rule of thumb when building multiplies using DSP elements is to take no more than one clock per multiply.

always @(posedge clk)
if (CE)
  product <= A * B;

If you use any more logic than that in an always block, you're likely to break Vivado's algorithm that packs multiplication logic into DSPs.

When not using DSP elements, I take (roughly) one clock per bit. You can read about my algorithm here.

While the tricks I used to make the multiply use less logic work for divides, I know of no other tricks to make divides work. You're roughly stuck with one clock per bit, and no hardware acceleration.

Other things can be made cheaper, but those don't appear to be your problem at present.

Dan

BMiller · December 16, 2023

Euphoniously, I just got notified about this paper, which purports to give an algorithm for multipliers that use fewer resources than others. Perhaps it will help?

https://www.nature.com/articles/s41598-023-49913-5

zygot · December 16, 2023

The initial question is incomplete, because it's too general.
For what it's worth here's my reaction to the question.

In most complex logic algorithms implemented in logic there are always at least three competing goals to balance:
1) Resource usage
2) throughput speed
3) accuracy

So, for this problem, we need to define a few things.
- what is the number format? Integer, fixed point? floating point? block floating point? something else?
- do you need to instantiate large numbers of the IP?
- how accurate must the results be?
- for something like a divide algorithm, is the quotient or the remainder ( or both ) what you need?
- is there a maximum compute time?
- are there resource budgetary restrictions?

That list isn't meant to be complete, but you get the idea... you need some project specifications to decide what the best approach to your solution will be.

Hard multipliers ( these are included in DSP resources these days ), offer potential high speed and reasonable dynamic range but can have serious timing issues depending on the connection topology when multiple DSP slices are required.

An iterative algorithm might use few resources, but have insufficient data bandwidth for a particular application.

Often BRAM used as a look-up table can help speed up the algorithm if used cleverly.

Of course the fastest and lowest LUT usage way to do a divide ( by a constant ) is to create a constant. So n/10 is the same as n * 0.1. This approach can't be applied indiscriminately of course, but often a general complex calculation can be reduced into a simpler, but more constrained solution by knowing what's important and what's not for a particular application.

There just isn't going to be one solution to implementing the calculation presented that works for all applications equally well. That's the nature of logic design. Edited December 16, 2023 by zygot

Sign In

How can I minimize the usage of LUTs?

Question

House

Link to comment

Share on other sites

3 answers to this question

Recommended Posts

D@n

Link to comment

Share on other sites

BMiller

Link to comment

Share on other sites

zygot

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity