Spring '00 Project: Matrix Multiply Revisited

Project Spring '00: Matrix Multiply Revisited

Objective

You are to implement the matrix multiply defined in Lab #8 in 12 clock cycles or less and to operate at a clock frequency of not less than 25 Mhz. This will illustrate the resource versus time tradeoff that is present in all digital system designs.

To Do

. You are to implement the matrix multiply defined in Lab #8 in 12 clock cycles or less and to operate at a clock frequency of not less than 25 Mhz. There are no resource limitations; use as many multipliers, adders, registers, RAMs etc as you want. The device must be the EPF10K20RC240-4 (Flex 10K, note the speed grade!!!!) The interface to the design has been altered to define two modes of operation: continuous mode and normal mode.

The new interface for mmult is defined as follows:

Inputs

clk, reset - clock and asynchronous reset
din[8..0] - data bus for both 9-bit matrix coefficients and 8-bit coordinates
cf_load - used to load the 16 matrix coefficients
start - used to input the 4 coordinate values and start a computation

Outputs

dout[7..0] - output bus for transformed coordinates
busy - asserted when either coefficient values are being loaded or a matrix operation is being performed
output_rdy - asserted when either coeffs or dout contains valid data
input_rdy - asserted when ready for new coordinate data

There are two main changes from the previous implementation:

The cf_dump input has been dropped. You do not have to be able to output the matrix coefficients. This was only included for debugging purposes last time. This should help reduce the number of needed states in your FSM.
The operation of the start input has been expanded to define two modes of operation: normal mode, and continuous mode. In normal mode, the start line is asserted for one clock cycle, and the din bus is assumed to have the X matrix coordinate, with the Y, Z, W values following on successive clock cycles. After the computation is finished, the ASIC returns to waiting for either start or cf_load assertion. In continuous mode, the start line is held high. After the first computation, the ASIC asserts the input_rdy line indicating that it is ready for another coordinate. The X,Y,Z,W values should follow on four successive clock cycles on the din bus after the assertion of input_rdy. To exit continuos mode, the start line must be negated at most two clock cycles after input_rdy has been asserted. The ASIC will finish the matrix multiply for the coordinates just entered, and then return to waiting for assertion of either start or cf_load. While the ASIC is waiting for assertion of either start or cf_load, the input_rdy output should be asserted.

Clock Cycle Constraint

The clock cycle constraint of 12 clocks is measured between falling edges of the input_rdy output while in continuous mode (this number of clocks defines the initiation rate of the design). The smallest possible number of clocks is 4; this represents a continuous stream of X,Y,Z,W values on the DIN bus.

Extra Points

You can earn extra points in two ways:

For every Mhz above 25 Mhz, you earn 1 extra point. There is no maximum to this bonus.
For every clock cycle below 12 for the initiation rate, you earn 2 extra points. You earn a 10 point bonus for an initiation rate = 4, for a maximum bonus of 20 points.

The bonus points are added directly to the point total of your tests. IF you earn the MAXIMUM bonus for #2, you have the option of keeping all of the bonus points or dropping one test grade (the final exam does not count as a test grade).

Extra points will ONLY be awarded if you meet all functionality specs and timing specs (i.e, a design that fails functionality tests, but fails them fast at 100 Mhz, still gets 0 extra points)

Hints

There are 16 multiply-add operations. You must have at least 2 multiply-add blocks in order to perform the multiply in 12 clock cycles or less.
The more resources you use, the less complex your finite state machine will be.
An alternate way to do the computation is in column order. Once you have X, you can perform the X*T00, X*T04, X*T08, X*T11 calculations in parallel instead of waiting for Y, W, Z and doing the X*T00, Y*T01, etc. calculations.
I doubt that you can get 25 Mhz operation without putting a register between your multiplier and saturating adder; you may even have to pipeline your multiplier.

Sample Waveforms

This waveform file (mmslow.scf) shows a solution that has an 11 clock initiation rate. The computation at 20 us shows the continuos mode of operation, the computation at 38 us shows the normal mode. This waveform file (mmfast.scf) is a solution for a 4 clock initiation rate. Again, both continuos and normal operation modes are demonstrated.

Testbench

This schematic (tbproj.gdf) is a testbench that MUST BE USED to demonstrate the final checkoff of your design. Your register-to-register timing check using the testbench must exceed 25 Mhz; however I will use the register-to-register timing check on your mmult design (minus the testbench) for assigning bonus points. The testbench has been designed to run on the Altera UP1 FPGA board. The testbench uses the pushbuttons PB1 and PB2 to control operation; PB1 starts a computation sequence that loads in a coefficient matrix and then computes 64 matrix multiplies using the continuous operation mode. A 14-bit Linear Feedback Shift Register (LFSR) is used to provide a pseudo-random number stream that is used for all data inputs. An 8-bit XOR-checksum register is used to capture all dout data values whenever output_rdy is asserted. The output of the 8-bit XOR-checksum register is displayed via the two 7-segment displays on the Altera UP1 board. The FSM testbench has been designed such that input values and output values are supplied using the input_rdy, output_rdy handshaking signals. This means that all designs, regardless of the initiation rate, will give the same checksum value after the 64 matrix multiplies have finished. The PB2 pushbutton is used to reset the FSM to its initial state. The input switches SW7-SW3 can be used to alter the initial value used by the 14-bit LFSR; this will cause a different pseudo-random number stream and thus a different checksum value.

The testbench uses approximately 6% of the Flex10K20 resources and cannot be altered.

These two sample waveform files, tslow.scf and tbfast.scf, show testbench simulations for designs with initiation_rate = 11 and initiation_rate = 4. The bus labeled accout is the 8-bit XOR register checksum value. You will note that for the slow design, the first computation cycle (64 matrix multiplies) runs from about 50us to 320us, with a final checksum value of E7. The SW inputs are then changed to alter the initial value of the LFSR, and the 2nd computation cycles gives a checksum of 3F. The fast design first computation cycle runs from 50 us to only 150us, and gives the same checksum of E7. The 2nd computation cycle for the fast design also gives a checksum of 3F.

The files you need for the testbench are: tbproj.gdf (testbench schematic), tbfsm.vhd, and tbproj.acf (this file gives the pin number definitions needed for the Altera UP1 board).

One of the tests that I will perform on your project submission is to download it to the Altera UP1 board and see if it works. You would be wise to do the same before submission; I will have boards available for checkout and there will also be a board in the Digital Systems workstation area.

Project Submission

I will a place a script called "submit_cadproject.pl" in the /home/reese/bin directory on Leto (download the script submit_cadproject.pl if you are at a remote site). You will need to create a directory called "project", and copy all of your files to that directory. You will then need to execute the program "submit_cadproject.pl". This script will remove unnecessary files from that directory, then zip it up, and email it to me. You must make sure that ALL files that I will need to compile and simulate your design are included. If you leave out any files such that I have to contact you after the submission date, then I will deduct 5% from your final project grade.

Academic Dishonesty

I will be grading all of the projects. You MUST do your own work; treat this as a take home test. I will treat any cases of project copying extremely harshly.

Grouping

You may do this project in groups of two or by yourself. If in a group of two, I would like to see some indication in the report as to how the labor was divided up. For a group of two, there only needs to be one report and both will recieve the same project grade and same bonus points.

Report

I want this report typed, and in an MSU lab folder. You must have the following:

You must hand in plots of all schematics, and print outs of your VHDL code.
You must have a neatly drawn ASM chart that illustrates your FSM operation.
You must have neatly drawn datatpath diagram of your design.
Compare the number of Logic Cells required for this project with your design of Lab #8 (the matrix multiply that only used one multiplier/adder). Multiply the clock cycles by the number of logic cells for Lab8, and do the same for Lab #7. Compare these two numbers (this is the area*time product, and is often used as a metric).
You must have a NARRATIVE that describes the operation of the datapath, and includes a scheduling table that shows how your resources (RAMs, registers, adder, multiplier) are used each clock cycle. You must also include a discussion of problems encountered. I expect the narrative to be at least two pages long.
You must have a section labeled EXTRA POINT CLAIMS that has screenshots showing your register-to-register delay for this design, and a zoomed in view of a Testbench waveform that clearly shows the inititation rate.
You must have a section labeled FUNCTIONALITY CLAIMS in which you tell me exactly what does and does not work in your design, you initiation rate, and your final register-to-register delay.

Everything must be typed; nothing handwritten will be accepted.

Submission Dates:

All files must have been submitted via the submission script by Thursday April 27, 8:00 am. All final REPORTs must be in my Simrall Office box or placed under my Simrall office door by 9:00 am, Friday April 28th.

Grading

The project is worth 10% of your total grade. I will use the following guidelines for grading:

a. Project does not work at all ( poor report turned in: 0%, average report: 10%, stellar report: 20 %)

b. Matrix Multiply seems to work somewhat, but testbench fails because handshaking is messed up ( poor report: 30%, average report 40%, stellar report 50%)

c. Testbench works fine, but fails either the initiation rate or clock rate requirements
( poor report: 40%, average report 50%, stellar report 60%)

d. Meets all specs ( poor report: 80%, average report 90%, stellar report 100%)

Obviously, the biggest impact on your final class grade is via the extra points. Start early so that you can get a fully functioning design and be able to earn these extra points.