Skip to main content

Product Engineer II at Cadence Design Systems

Hello Dear Readers, Cadence Design Systems has a vacancy for a Product Engineer II role. Cadence is a pivotal leader in electronic design, building upon more than 30 years of computational software expertise. The company applies its underlying Intelligent System Design strategy to deliver software, hardware and IP that turn design concepts into reality.  Cadence customers are the world’s most innovative companies, delivering extraordinary electronic products from chips to boards to systems for the most dynamic market applications including consumer, hyperscale computing, 5G communications, automotive, aerospace industrial and health. The Cadence Advantage: The opportunity to work on cutting-edge technology in an environment that encourages you to be creative, innovative, and to make an impact. Cadence’s employee-friendly policies focus on the physical and mental well-being of employees, career development, providing opportunities for learning, and celebrating success in recog...

Architecting Speed in FPGA

 Hello Dear Readers, 

Today in this post we will discuss how the architecting speed inside will be changing by writing efficient RTL coding.

Sophisticated tool optimizations are often not good enough to meet most design constraints if an arbitrary coding style is used. Here we will discuss the first of three primary physical characteristics of a digital design speed and also discuss methods for architectural optimization in an FPGA.

There are three primary definitions of speed depending on the context of the problem: throughput, latency, and timing. In the context of processing data in an FPGA, throughput refers to the amount of data that is processed per clock cycle. A common metric for throughput in bits per second. Latency refers to the time between data input and processed data output. The typical metric for latency will be time or clock cycles. Timing refers to the logic delays between sequential elements. When we say a design does not “meet timing,” we mean that the delay of the critical path, that is, the largest delay between flip-flops (composed of combinatorial delay, clk-to-out delay, routing delay, setup timing, clock skew, and so on) is greater than the target clock period. The standard metrics for timing are clock period and frequency.

1). HIGH THROUGHPUT:

A high-throughput design is one that is concerned with the steady-state data rate but less concerned about the time any specific piece of data requires to propagate through the design (latency). The idea with a high-throughput design is the same idea Ford came up with to manufacture automobiles in great quantities: an assembly line. In the world of digital design where data is processed, we refer to this under a more abstract term: pipeline.

A pipelined design conceptually works very similarly to an assembly line in that the raw material or data input enters the front end, is passed through various stages of manipulation and processing, and then exits as a finished product or data output. The beauty of a pipelined design is that new data can begin processing before the prior data has finished, much like cars are processed on an assembly line. Pipelines are used in nearly all very high-performance devices, and the variety of specific architectures is unlimited. Examples include CPU instruction sets, network protocol stacks, encryption engines, and so on.

From an algorithmic perspective, an important concept in a pipelined design is that of “unrolling the loop.” As an example, consider the following piece of code that would most likely be used in software implementation for finding the third power of X. Note that the term “software” here refers to code that is targeted at a set of procedural instructions that will be executed on a microprocessor.

XPower = 1; 

for (i=0;i < 3; i++) 

XPower = X * XPower; 

Note that the above code is an iterative algorithm. The same variables and addresses are accessed until the computation is complete. There is no use for parallelism because a microprocessor only executes one instruction at a time (for the purpose of argument, just consider a single-core processor). A similar implementation can be created in hardware. Consider the following Verilog implementation of the same algorithm.

module power3(output [7:0] XPower, output finished, input [7:0] X, input clk, start); // the duration of start is a single clock 

reg [7:0] ncount; 

reg [7:0] XPower; 

assign finished = (ncount == 0); always@(posedge clk) 

   if(start) begin 

      XPower <= X; 

       ncount <= 2; 

   end

   else if(!finished) begin 

     ncount <= ncount - 1; 

     XPower <= XPower * X; 

   end 

endmodule

With this type of iterative implementation, no new computations can begin until the previous computation has been completed. This iterative scheme is very similar to a software implementation. Also, note that certain handshaking signals are required to indicate the beginning and completion of a computation. An external module must also use handshaking to pass new data to the module and receive a completed calculation. The performance of this implementation is 

Throughput = 8/3, or 2.7 bits/clock 

Latency = 3 clocks 

Timing = One multiplier delay in the critical path

Contrast this with a pipelined version of the same algorithm:

module power3( output reg [7:0] XPower, input clk, input [7:0] X); 

reg [7:0] XPower1, XPower2; 

reg [7:0] X1, X2; 

always @(posedge clk) begin 

// Pipeline stage 1 

X1 <= X; 

XPower1 <= X; 

// Pipeline stage 2 

X2 <= X1; 

XPower2 <= XPower1 * X1; 

// Pipeline stage 3 

XPower <= XPower2 * X2; 

end 

endmodule

In the above implementation, the value of X is passed to both pipeline stages where independent resources compute the corresponding multiply operation. Note that while X is being used to calculate the final power of 3 in the second pipeline stage.

Both the final calculation of Xpower(XPower3 resources) and the first calculation of the next value of X (XPower2 resources) occur simultaneously. The performance of this design is

Throughput = 8/1, or 8 bits/clock 

Latency = 3 clocks 

Timing = One multiplier delay in the critical path

The throughput performance increased by a factor of 3 over the iterative implementation. In general, if an algorithm requiring n iterative loops is “unrolled,” the pipelined implementation will exhibit a throughput performance increase of a factor of n. There was no penalty in terms of latency as the pipelined implementation still required 3 clocks to propagate the final computation. Likewise, there was no timing penalty as the critical path still contained only one multiplier. Unrolling an iterative loop increases throughput. The penalty to pay for unrolling loops such as this is an increase in area. The iterative implementation required a single register and multiplier, whereas the pipelined implementation required a separate register for both X and XPower and a separate multiplier for every pipeline stage.


2). LOW LATENCY:

A low-latency design is one that passes the data from the input to the output as quickly as possible by minimizing the intermediate processing delays. Oftentimes, a low-latency design will require parallelisms, removal of pipelining, and logical shortcuts that may reduce the throughput or the max clock speed in a design.

Referring back to our power-of-3 example, there is no obvious latency optimization to be made to the iterative implementation as each successive multiply operation must be registered for the next operation. The pipelined implementation, however, has a clear path to reducing latency. Note that at each pipeline stage, the product of each multiply must wait until the next clock edge before it is propagated to the next stage. By removing the pipeline registers, we can minimize the input to output timing: 

module power3( output [7:0] XPower, input [7:0] X ); 

reg [7:0] XPower1, XPower2; 

reg [7:0] X1, X2; 

assign XPower = XPower2 * X2; 

always @* begin 

X1 = X; 

XPower1 = X; 

end 

always @* begin 

X2 = X1; 

XPower2 = XPower1*X1; 

end 

endmodule

In the above example, the registers were stripped out of the pipeline. Each stage is a combinatorial expression of the previous. The performance of this design is 

Throughput = 8 bits/clock (assuming one new input per clock) 

Latency = Between one and two multiplier delays, 0 clocks 

Timing = Two multiplier delays in the critical path 

By removing the pipeline registers, we have reduced the latency of this design below a single clock cycle. Latency can be reduced by removing pipeline registers. The penalty is clearly in the timing. Previous implementations could theoretically run the system clock period close to the delay of a single multiplier but in the low-latency implementation, the clock period must be at least two multiplier delays (depending on the implementation) plus any external logic in the critical path. The penalty for removing pipeline registers is an increase in combinatorial delay between registers.


3). TIMING:

Timing refers to the clock speed of a design. The maximum delay between any two sequential elements in a design will determine the max clock speed. The idea of clock speed exists on a lower level of abstraction than the speed/area trade-offs discussed elsewhere in this chapter as clock speed, in general, is not directly related to these topologies, although trade-offs within these architectures will certainly have an impact on timing. For example, one cannot know whether a pipelined topology will run faster than an iterative without knowing the details of the implementation. The maximum speed, or maximum frequency, can be defined according to the straightforward and well-known maximum-frequency equation (ignoring clock-to-clock jitter):

where Fmax is maximum allowable frequency for clock; Tclk-q is the time from clock arrival until data arrives at Q; Tlogic is propagation delay through logic between flip-flops; Trouting is routing delay between flip-flops; Tsetup is minimum time data must arrive at D before the next rising edge of clock (setup time); and Tskew is the propagation delay of the clock between the launch flip-flop and the capture flip-flop.


Connect with me 




Comments

  1. Helpful information

    ReplyDelete
  2. Thanks bro your every articles are fully Students tutorial oriented please stay as it is and keeping up

    ReplyDelete
  3. Incredible article's and you are really passionate about your writing. Keep it up.

    ReplyDelete
  4. Thanks for your words and I am surely keep my writing as it is.

    ReplyDelete
  5. Superb and easy explanation with example.

    ReplyDelete
  6. Good and easy explanation of the throughput, latency and timing.

    ReplyDelete

Post a Comment

Popular posts from this blog

SDC (Synopsys Design Constraints) contents part 4

Today, we will be discussing the remaining constraints mentioned in the SDC, which pertain to timing exceptions and design rules. This is the final part of the SDC contents. This is going to be interesting, especially with multicycle paths. Take time to read and try to comprehend. 10. set_max_transition     By setting max transition value, our design checks that all ports and pins are meeting the specified limits mentioned in SDC. If these are not satisfied then timing report will give DRVs (design rule violations) in terms of slack. This is specified as               set_max_transition 0.5  UBUF1/A setting maximum limit of 500ps on pin A of Buffer1. 11. set_max_capacitance     This is same as max transition, setting the maximum capacitance value. if our design not meeting this value then violation will occur. This will also reports under design rule violations in terms of slack.     set_max_capacitance 0.7 [all_...

Apprenticeship CAI at MediaTek Bangalore

Hello Dear Readers,   Currently at MediaTek Bangalore vacancy for an Apprenticeship CAI role. Job Description: B.Tech degree in Electrical/Electronics Engineering with a strong educational background in Digital circuit design Experience in physical design of high performance design with frequencies > 2 Ghz. Experienced in hierarchical design, budgeting, multiple voltage domains and multiple clock domains. Strong skills with Cadence Encounter. Solid understanding of STA and timing constraints. Experienced in working on advanced process nodes (16nm). Strong expertise in Physical Verification to debug LVS/DRC issues at the block level. Requirement: B.Tech degree in Electrical/Electronics Engineering with strong educational background in Digital circuit design Experience in physical design of high performance design with frequencies > 2 Ghz. Experienced in hierarchical design, budgeting, multiple voltage domains and multiple clock domains. Strong skills with Cadence Enc...

IC Physical Design (PnR) at Ulkasemi

Hello Dear Readers,   Ulkasemi  has a vacancy for an IC Physical Design (PnR) role. Job Overview: As a full-time Trainee Engineer, the individual will be working on IC Physical Design implementation from RTL to GDSII to create design databases ready for manufacturing with a special focus on power, performance & area optimization with next-generation state-of-the-art process technologies. Job Responsibilities: Perform physical design implementation which includes Floor planning, Power Planning, Clock Tree Synthesis, Place and Route, ECO, Logic Equivalence checks Timing analysis, physical & electrical verification, driving the sign-off closure meeting schedule, and design goals Develop flow, methodologies, and automation scripts for various implementation steps Follow the instructions, compile documents, prepare deliverables, and report to the team lead Should remain up to date with the latest technology trends Educational Qualification:   B.Sc/M.Sc   in EEE or...