**The First MEMOCODE
HW/SW Co-design Contest**

**https://memocode07.ece.cmu.edu/contest.html**

**Entries Due Online: March 19, 2007**

**Announcements
and Updates**

- Please revisit this website
from time to time to receive updates.

**1. Overview**

Matrix-matrix multiplication (MMM) is a linear algebra operation familiar to everyone. This design challenge emphasizes your HW/SW co-design methodology’s ability to explore algorithmic alternatives, to fine-tune performance, and to create quickly a HW/SW co-designed/optimized implementation.

The implementation of a basic triple-loop square MMM is
given below. The first three arguments A, B and C are pointers to the square N-by-N multiplicand, multiplier and product matrices,
respectively. The fourth argument N
specifies the number of rows/columns in the square matrices. A, B and C are row-major. The implementation below assumes
C is zero initialized. This
implementation performs N^{3}_{ }multiplications and N^{3} additions.

void mmmKernel(Number* A, Number* B, Number* C, int N) {

int i, j, k;

for (j = 0; j < N; j++)

for
(i = 0; i < N; i++)

for (k = 0; k < N; k++)

C[i*N+j] = C[i*N+j] + A[i*N+k]*B[k*N+j] ;

}

The above straightforward implementation has poor data reuse
locality, which becomes a performance concern when the entire data set cannot
be kept in a nearby fast memory. This can occur in a software scenario
when the data set cannot fit in the cache of the processor; this can also occur
in a HW/SW co-execution scenario when the data set cannot fit within the
hardware accelerator. In both scenarios, the data set must be transferred
frequently and redundantly between the nearby fast memory and a slower
backing-store (e.g., DRAM).

To improve data locality, a “blocked” MMM breaks down the
computation into a series of small MMMs of ^{3}_{ }multiplications and N^{3} additions but has better data
locality (requiring fewer data transfers between the fast near-by memory and
the slower backing store). A blocked MMM implementation is given below
with an additional argument, NB, the
blocking size. This implementation assumes N is an integer multiple of NB.
Again, this implementation assumes C
is zero initialized. This implementation assumes the existence of a new MMM
kernel (which could be a basic triple-loop implementation) to multiply the
sub-blocks. For complete details, please see the reference software-only
implementation released with the contest.

void mmmBlocked(Number* A, Number* B, Number *C, int N, int NB) {

int j, i, k;

for (j = 0; j < N; j += NB)

for
(i = 0; i < N; i += NB)

for (k = 0; k < N; k += NB)

mmmKernel(&(A[i*N+k]), &(B[k*N+j]), &(C[i*N+j]), N, NB);

}

**2. Design Task**

Starting from the reference software-only blocked MMM implementation (see submission instructions), you are to develop a HW/SW co-designed optimized implementation that partitions the computation onto a processor running software and the hardware datapath to be implemented on an FPGA fabric. For starters, mmmKernel( ) is a good candidate to accelerate in HW, and the choice of NB is a tunable parameter. You might want to explore non-square blocking. You can also consider more effective data management to take advantage of on-FPGA memory capacity. Your implementation should be optimized for 1024-by-1024 and also 256-by-256 sized matrices. Your implementation must be able to support 2-power matrix sizes, N=64, 128, 256, 512, and 1024, without reprogramming the FPGA fabric.

Like the reference implementation, your design must support 16-bit fixed-point, complex data values. That is, each matrix element is 32-bits: the more significant 16 bits are used for the real component and the less significant 16 bits are used for the imaginary component. The multiplier matrix (B) and the product matrix (C) use the same 16-bit 2’s-complement fixed-point format in the real and the imaginary component. The multiplicand matrix (A) uses a 16-bit 2’s-complement fixed-point formant with 14 bits for fraction. You can assume the multiplicand matrix contains real and imaginary values that are between 1 and -1 inclusive. You can also assume the test cases would not cause overflows in the reference software-only implementation.

**3. Implementation Platform**

You may use any HW and SW design methodology at your disposal. Formal methods are encouraged but not required.

You may use any FPGA development platform at your disposal. The FPGA platform must permit a processor running software (e.g., the embedded PowerPC405 in a Xilinx Virtex-II Pro FPGA or a Xilinx Microblaze soft-core) to communicate with the hardware accelerator datapath on an FPGA fabric. The processor and/or the FPGA should have access to off-chip memory sufficient to hold the multiplicand, multiplier and product matrices, up to 1024-by-1024. In your implementation, the multiplicand matrix and the multiplier matrix must start from the off-chip memory, and the product matrix must complete in the off-chip memory.

The reference platform supported by the contest is the Xilinx XUP development board (http://www.digilentinc.com/Products/Detail.cfm?Prod=XUPV2P&Nav1=Products&Nav2=Programmable) with 512MB of DRAM. For those of you using XUP, we provide a reference EDK project for the Xilinx XUP board that implements a basic interface library to enable communication between the software running on the 300MHz embedded PowerPC405 and the 100MHz FPGA fabric through the DSOCM interface (150+MB/sec bandwidth). The interface is based on two circular memory FIFOs, one for each direction of data communication between the hardware and the software. You may develop or acquire any other interfacing scheme you prefer.

**4. Contest Judging**

The contest will be judged in two parts.

First, an objective element of judging is based on the combined execution time (wall-clock) of a 1024-by-1024 MMM and a 256-by-256 MMM. Second, an subjective element of judging is based on the “elegance” of the solutions.

The execution time will normalized for platform differences. The effective execution time will be

Time_{effective}_{
}= (Time_{measured}_{ N=1024} +
64*Time_{measured}_{ N=256}) * f_{CPU}_{-speed
}* f_{FPGA}_{-capacity
}* f_{FPGA}_{-speed}

where

f_{CPU}_{-speed
}= min{ 1 , Time_{{SW-only,
N=1024, no blocking, 300MHz PPC in XUP}}/Time_{{SW-only, N=1024 no
blocking, your CPU}}}

f_{FPGA}_{-capacity
}= min{ 1 , capacity-in-gates_{{Your FPGA}} / capacity-in-gates_{{XC2VP30}}}
/* based on vendor data for the part you use */

f_{FPGA}_{-speed
}= min{ 1 , operating-frequency_{{Your
FPGA} }/ 100MHz_{ }}

Notice, this normalization penalizes platforms that are more capable than the XUP but deliberately does not benefit less capable platforms. Some factors that the normalization calculation does not take into account are 1. HW/SW communication bandwidth and latency, and 2. off-chip memory bandwidth and latency. This normalization is not perfect; hence “gaming the system” is fair game. In the case an entry is based on an extraordinary FPGA platform, the judging panel reserves the right to determine a fair normalization on a case-by-case basis.

You need to prepare a brief documentation describing 1. your FPGA platform, 2. your design methodology, 3. the organization of your design and its theory of operation, 4. a brief analysis of its performance (e.g., where are time spent and where are the bottlenecks). A panel of judges will assess subjectively the elegance of the solution and rank the entries accordingly. The decisions of the judges are final. (Documentations in the form of PowerPoint slides are perfectly acceptable. Keep in mind, the judges’ subjective sense of elegance is likely to be influenced by the quality of the documentation.)

The entries are ranked overall by the value (Rank_{Time-effective}+Rank_{elegance}).
In the case of a tie, Time_{effective}_{ }is
the tie-breaker. Top three finishers will have the opportunity to present
their design and design methodology in MEMOCODE’07’s technical program. All entrants are invited to submit a poster
for Memocdoe’s Poster Session. We will offer a cash
prize for the winning design (from the top three) selected at the conference.

We rely on the honor systems for initially reporting the
performance data required for Time_{effective}
calculation, i.e.,

Time_{measured}_{, N=1024},_{ }Time_{measured}_{,
N=256}

Time_{{SW-only, N=1024, no
blocking, your CPU} }/* Reference
implementation compiled with –O an –DNDEBUG */

capacity-in-gates_{{Your FPGA} }

frequency_{{Your FPGA} }

The winning designs must work correctly. In the case you
should come in top three, we will need to arrange verification of the
correctness of your design and the performance data.

**4. Eligibility and Submission Instructions**

This contest is open to industry
and academic institutions (faculty and/or students). A team may include both industry and academic
members. There is no limit on the size
of a team. There is no limit on the
number of teams per institution. Each
person can participate in only one team.

If you are interested, please
email forrest@ece.ucsb.edu with your
contact information (team member names and affiliation). We will respond by sending you the SW-only
MMM reference design (portable C) and the XUP HW/SW interface design (C and Verilog in an EDK project).
There are no obligations associated with requesting the reference
designs. You have until March 19, 2007 to submit your design submission (design
source + documentation) in a tar or zip file via http://www.ece.cmu.edu/tools/upload/
(use memocode07@ece.cmu.edu as the
recipient address). Although you have
until March 19, we estimate the design task to be approximately a one-week
effort.

Please email queries to forrest@ece.ucsb.edu. Please revisit
this website from time to time for updates.

Organized
by Forrest Brewer (UCSB) and James C. Hoe (CMU)