In conjunction with the Sixth ACM-IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE'2008) Cryptosorter Overview The objective of the 2008 MEMOCODE Hardware Software Codesign Contest is to sort an encrypted database of records as fast as possible. Sorting an encrypted database requires that each record is decrypted, that a proper index for each decrypted record is determined, and that each record is re-encrypted and stored at the proper index. The record length is 4 words. A record consists for four fields {f1, f2, f3, f4}. Each field is one word long and encrypted. Encryption and decryption transforms the four fields of a record {f1, f2, f3, f4} into four decrypted fields {g1, g2, g3, g4}. Each of g1, g2, g3, and g4 is one word long. Sorting The sorting key of a decrypted record is determined by weighted numerical ordering of the decrypted words of the record. g1 is the most-significant word and g4 is the least significant word. g1 through g4 are unsigned numbers. The database does not contain any duplicate records, so that there is always a unique ordering possible. Some examples of valid sorting key relations: {1, 1, 0, 0} > {1, 0, 0, 0} {28, 10, 30, 30} > {10, 99, 99, 66} {10, 10, 1, 0} > {10, 10, 0, 1} Encryption and Decryption Encryption and decryption is done using four keys {k1, k2, k3, k4}, each one word long. The encryption/decryption is done using XOR operations: {f1, f2, f3, f4} = {g1 xor k1, g2 xor k2, g3 xor k3, g4 xor k4}. Based on this mechanism, encryption and decryption of records are identical operations. The four keys {k1, k2, k3, k4} are determined using AES-128 as follows. The record index is encrypted with AES under a fixed, data-base wide key which is specified below under 'System Testbench and Design Parameters'. The encrypted record index yields 128 bits of key material. These 128 bits are split in four words corresponding to {k1, k2, k3, k4}. The encryption/decryption process is illustrated below in pseudocode. ``` // 1. determine record key using record index and database-wide key unsigned data_in[4]; unsigned data_out[4]; const unsigned key[4]; // database-wide key data_in[0] = index; data_in[1] = 0; data_in[2] = 0; data_in[3] = 0; aes_ecb_encrypt(data_in, key, data_out); // 2. decrypt record using record key unsigned f[4]; unsigned g[4]; unsigned k[4]; k[0] = data_out[0]; k[1] = data_out[1]; k[2] = data_out[2]; k[3] = data_out[3]; g[0] = k[0] ^ f[0]; g[1] = k[1] ^ f[1]; g[2] = k[2] ^ f[2]; g[3] = k[3] ^ f[3]; ``` System Testbench and Design Paramters The cryptosorter reference implementation includes a testbench, that must be used by your optimized implementation. The testbench generates the content of the database using a pseudo-RNG with a 32-bit seed. The generating polynomial for the PRNG is P(x) = 1 + x^2 + x^3 + x^32. The testbench will verify if the result of the sorting is correct. This is done by veryfying if all the records are in the proper order, and by verifying that no record was changed. For your design to be valid, your implementation should show the same behavior as the reference implementation for any other combination of amount of records (a power of two between 64 and 262144) and of database generation seed (in the range 1 to 2^32 - 1). These parameters can be selected through the macro's MAXRECORD in recordio.h, and RNGSEED in recordio.h, respectively. The judges may modify these parameters to evaluate your design for a different configuration. Important parameters for the reference implementation are as follows. A parameter either has the suffix (fixed), in which case it cannot be changed by your optimized implementation, or else it indicates the required supported range. Record length: 4 words (fixed) Number of records in the database of the reference implementation: 65536 (need to support a power of two between 64 and 262144) Database-wide key: 0xB01D FACE 0DEC 0DED 0BA1 1ADE 0EFF EC70 (fixed) Database generation seed: 1 (need to support a any seed in the range 1 to 2^32 - 1). Reference Implementation This reference implementation was designed on a Xilinx XUP board using ISE/EDK 9.1.2. The reference implementation uses off-chip DDR memory which may need to be added on your XUP board. The reference implementation is a software-only design using a 300 MHz PPC processor. The PPC has a 16KByte cache enabled for code and data. The reference implementation takes approximately 20,692,099 microseconds (+- 5 microseconds) to execute on a XUP board (with O2-level optimization in gcc). If you use a different platform, you will need to port the reference implementation to your chosen platform and demonstrate that the reference implementation operates correctly. Also, make sure you follow the Calibration rules provided in the Design Contest Rules. Submission Instructions You have time until midnight Eastern Time, March 9, 2008, to send your solution. Note that the countdown timer at the top of this page relies on the host clock, and may be inaccurate if your host clock is set wrong. Submit your design in a tar or zip file via http://www.ece.cmu.edu/tools/upload/. Use jhoe@ece.cmu.edu as the recipient address. After your submission, you will receive a confirmation email within 24 hours. If you do not receive this confirmation email, please contact the organizers immediately. Your submission should contain the following items in a single zip or tar file. Design Source Files Design Performance Metric as obtained using the testbench described under Test Cases. Design Documentation describing your FPGA platform, your design methodology, the organization of your design and its theory of operation, and a brief analysis of its performance and bottlenecks. Documentation in the form of PowerPoint slides are perfectly acceptable. Keep in mind, the judges' subjective sense of elegance is likely to be influenced by the quality of the documentation. Please note that we will treat your design sources as confidential, and that they will only be shared with the judging panel for the purpose of judging. We will not post your source code unless you give us explicit permission to do so. Test Cases We would like to obtain a uniform performance metric used by all participants. Therefore, we require you to evaluate the performance of the following testbench for your design and include the evaluation in your submission . The overall performance metric for your design is obtained as the geometric mean of the performance of 8 individual test metrics. The 8 testcases are partioned in two groups of 4 testcases each. The 4 testcases in each group differ only in the number of records. The four cases include 64, 1026, 16384, and 262144 records respectively. As a power of two, these cases correspond to 2^6, 2^10, 2^14 and 2^18 records respectively. Random testbench group The first group uses the random testbench, specified in the cryptosorter assignment. The random generator must use a seed of 0x64646464. The random testbench uses encryption with the key defined in the assignment specification. Rotated testbench group The second group uses a pathological testbench, defined by the elements of a matrix M as follows. Given a square matrix T with in-order elements in row-major order: ``` 0 1 2 3 T= 4 5 6 7 8 9 10 11 12 13 14 15 ``` then M is a 90-degree clockwise-rotated version of T: ``` 12 8 4 0 M= 13 9 5 1 14 10 6 2 15 11 7 3 ``` The contents of the plaintext database is the row-major order scan of M, and thus consists 12, 8, 4, 0, 13, 9, 5, 1, 14, 10, 6, 2, 15, 11, 7, 3. For a database of size 2^Q, M becomes a square matrix with 2^(Q-1) elements on each side. The rotated testbench uses encryption with the key defined in the assignment specification. Measurements We require the contestants to follow the testbench code. When you integrate the testbench code in your design, follow the directives below to obained performance numbers on your design. The time must be measured as accurately as possible, using hardware cycle counting. The values for Reference_time are defined in the testbench code as constants. These values reflect cycle counts on a 300 MHz PPC on a chosen hardware configuration and a chosen tool version of Xilinx/EDK. You cannot change them. You are not allowed re-synthesize the design depending on the test case. A single FPGA bitstream configuration must be able to execute all test cases. You are allowed to measure each testcases three consecutive times in order to benefit from running with a "warm" cache. In doing so, you have to follow the testbench API provided by the organizers. The example output of the testbench code on the reference code is shown below. Getting Started Start by studying the reference implementation in combination with the documentation above. On a standard XUP board, the reference implementation takes around 20 seconds to execute, as illustrated by the following screen shot. The testbench also prints '1' if the result is sorted correctly. Note that the Hyperterm terminal does not render newlines correctly. The complexity of this design problem stems from two interacting problems: a sorting problem and a crypto problem. Solving only one problem will yield a suboptimal design - considering the system level perspective is essential for optimal performance. You may make changes to the algorithms, such as changing the sorting strategy, as long the overall functionality of the system is preserved. See above (under 'System Testbench') to find what 'preserving system functionality' means. In practice, this means that you should not modify anything besides the function sortrecord() in main.c of the reference implementation, as well as the underlying call tree of sortrecord(). You can find some examples of AES cores on OpenCores.