Canturk ISCI
&
Zhenghong WANG


In this project, we present an implementation example for the PLX1.0 ISA developed in Princeton University. We describe a 2-way superscalar implementation of the PLX Processor, using VHDL as the design entry tool. This implementation may support all instructions described in the PLX1.0 specification and was verified at logic level via VHDL simulators such as SYNOPSYS™. Our design methodology is a combination of structural and behavioral modeling:  all blocks that is driven by clocks, such as all registers, are designed at structural level. This is to ensure that the whole processor works like a "real" hardware implementation and once it is verified, the clock timing of the processor is verified. While for complicated combinational logic, we use behavioral description since combinational logic will not cause timing problems if we don't make the critical path too long. This approach brings us more flexibilities and makes debugging easier. Further more, it's not a difficult work to convert a behavioral description of a combinational logic into a synthesizable description and thus one can easily make the whole design synthesizable when necessary. The top level block diagram of our design is shown in figure I-6. The lower hierarchy block descriptions are imported from other group modules, which are done in RTL level. Both the test vectors (binaries+clock+initialization signals) and assembly level codes are included in this documentation. In order to provide flexibility in testing of top level design, an assembler is also written in VHDL to acquire the binary instruction sequence. The structural design is expatiated with block diagram descriptions, the verification of operation is demonstrated using waveform diagrams and simulation outputs.



 
 
0) TABLE OF CONTENTS:

    0)Table of Contents
    1) Introduction
    2) Design and Implementation
    3) Simulation Results
    4) Appendices
 
 
1) INTRODUCTION:

PLX is a small, general purpose, subword parallel ISA, implemented in Princeton University by Prof. Lee and Princeton students. The major objective of PLX is toward multimedia applications and specifically 3GPP applications and protocols. The PLX Project Part-II comprises of 3 major subdivisions, depending on the nature of the project work, Algorithms, Software and Hardware. Our project –Hardware- is VHDL design of a 2-way pipelined superscalar implementation of the PLX-Processor, with the hazard detection and bypassing logic. Being a principal feature of PLX ISA, predication – which is used in PLX and IA64 instead of conditional branches – is also implemented in the hardware control logic. As the case in most cost effective implementation, although the execution units are doubled for superscalarity, there is a single Data Cache in the datapath. Regarding the general pipeline control path, the control signals are also piped through the pipe stages. The major execution units, ALU, Shift-Permute Unit and Multiplier are designed incompletely in behavioral level and are to be provided by other hardware groups working on the lower hierarchy RTL design. In addition to the hardware description of the data and control paths, we also implemented an assembler in VHDL to convert the assembly level instructions to binary instructions. The assembler is integrated with the final testbench, to provide both the resultant binary file and instruction information for the processor. The design of the assembler allows us a higher flexibility for application of different test programs.
 
 
2) DESIGN and IMPLEMENTATION:

In this section, we describe the design of the 2-way superscalar PLX processor in detail, and as clearly as possible. We follow a design top-down and implement bottom-up procedure, yet the description will be from lower hierarchy cells to higher level design. The discussion of the assembler is done after the description of the top-level circuit for the sake of design description flow. However, the implementation of the assembler is actually in parallel with the implementation of the top level design.

In order to explain our design in a clear and logical way, we divide this part into 4 subsections:

Appendix-I lists brief descriptions and some testbenches for the following components.
    1. ALU
    2. Multiplier
    3. Shift - Permute Unit
    4. Predicate Register File
    5. General Purpose Register File
    6. Instruction Decoder
    7. Pipeline Registers
    8. Controller
    9. Instruction Memory
    10. Data Cache
    11. Top Level Design
Figure 1 shows the datapath diagram of our 2-way superscalar PLX processor. The data forwarding paths for predicate bits are  shown in Figure 2.

Figure-1, Datapath Diagram of a 2-way superscalar PLX processor
 
 

(a) Use ALU to calculate the results of CMP instructions
 
 

(b) Use specific comparators to calculate the results of CMP instructions

Figure-2, Data forwarding paths for predicate bits



 
 
 
 
 

    1. Hazard detection
    2. Data forwarding logic for general purpose register file
    3. Data forwarding logic for predicate register file
IF/DR0---IF/DR1
 
Number IF/DR0 IF/DR1 Results
1 ALU/PMUL/LDi/LD/ST.upd ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR1
2 LD/ST LD/ST Stall IF/DR1
3 CMP/CHPR JMP/CHPR Stall IF/DR1
4 CHPR any Stall IF/DR1

DR/E0---IF/DR0
 
Number DR/E0 IF/DR0 Results
5 PMUL/LD ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR
6 CMP/CHPR CHPR Stall IF/DR

DR/E0---IF/DR1
 
Number DR/E0 IF/DR1 Results
7 PMUL/LD ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR1
8 CMP/CHPR CHPR Stall IF/DR1

DR/E1---IF/DR0
 
Number DR/E1 IF/DR0 Results
9 PMUL/LD ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR
10 CMP/CHPR CHPR Stall IF/DR

Note: For DR/E1, one may not determine whether it is valid or not only from the valid bit read from predicate register since CMP results of DR/E0 may influence DR/E1. Thus before stall IF/DR, the check of validity is necessary.

DR/E0---IF/DR1
 
Number DR/E0 IF/DR1 Results
11 PMUL/LD ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR1
12 CMP/CHPR CHPR Stall IF/DR1

Note: For DR/E1, one may not determine whether it is valid or not only from the valid bit read from predicate register since CMP results of DR/E0 may influence DR/E1. Thus before stall IF/DR, the check of validity is necessary.

E/DF0---IF/DR0
 
Number E/DF0 IF/DR0 Results
13 PMUL ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR
14 PMUL any instruction that will write RF except PMUL insructions Stall IF/DR

E/DF0---IF/DR1
 
Number E/DF0 IF/DR1 Results
15 PMUL ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR1

E/DF1---IF/DR0
 
Number E/DF1 IF/DR0 Results
16 PMUL ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR

E/DF1---IF/DR1
 
Number E/DF1 IF/DR1 Results
17 PMUL ALU/LD/ST/JMP.reg/CMP/PMUL Stall IF/DR
18 PMUL any instruction that will write RF except PMUL insructions Stall IF/DR

For JMP instructions:

IF/DR0---IF/DR1
 
Number IF/DR0 IF/DR1 Results
1 JMP any Cancel IF/DR1(only if JMP is valid and taken)

DR/E0&1---IF/DR0&1
 
Number DR/E0&1 IF/DR0&1 Results
2 JMP any Cancel IF/DR(only if JMP is valid and taken)

Note: Thus it's actually a predict-untaken JMP implementation.
 
 

All data forwarding paths are shown in figure 1. Table 1 lists the correct data sources of MUXs of Op1 and Op2 for different instructions. Note that the JMP target address is not calculated by ALU. We use a specific adder to calculate it in DR stage.
 
Instruction MUX for Op1 MUX for Op2
JMP any any
JMP.link PC immediate number: 4
JMP.reg any any
JMP.reg.link PC immediate number: 4
Loadi.hi/lo any IR_imm
trap any any
CMP Rs1(or from forwarded data) Rs2(or from forwarded data)
CMPi Rs1(or from forwarded data) IR_imm
Testbit Rs1(or from forwarded data) IR_imm
Changepr any any
Changepr.load any any
Load Rs1(or from forwarded data) IR_imm
Load.upd Rs1(or from forwarded data) IR_imm
Store Rs1(or from forwarded data) IR_imm
Store.upd Rs1(or from forwarded data) IR_imm
ALU/LOGIC imm Rs1(or from forwarded data) IR_imm
Extract Rs1(or from forwarded data) IR_imm
Deposit Rs1(or from forwarded data) IR-imm
Shift Right Pair Rs1(or from forwarded data) Rs2(or from forwarded data)
Packed ALU/SHIFT_PERM Rs1(or from forwarded data) Rs2(or from forwarded data)
Packed MUL Rs1(or from forwarded data) Rs2(or from forwarded data)
Packed SHIFT_PERM with imm Rs1(or from forwarded data) IR_imm

Table-1, Data Sources for MUXs of Op1&Op2

For Rsi(or from forwarded data), once the target/source addresses are the same and the source instruction is valid, the data should be forwarded.

According to figure 2, the results of CMP instructions are not available until E stage. But with the data forwarding logic, the validity of all instructions can be determined in DR stage except one case: when CMP in IFDR0, then the validity of the instruction in IFDR1 can be determined in E stage. Thus only the instructions that complete the "real" operations in DR stage, such as JMP and Changepr, have to be stalled in this case.

When Changepr in IFDR0, any instruction in IFDR1 has to be stalled since Changepr modified the bank selection bits of the predicate register file in DR stage, thus one can not know whether the instruction is valid. Figure 3 shows the circuit of bank selection bits.

Figure-3, Bank selection unit


In order to avoid directly manipulating binary streams for instructions, we implemented an assembler in VHDL, as shown in assembler.vhd, which is integrated with the testbench to read instructions directly from an assembly file and convert the assembly level instructions to binary instructions stored in the instruction memory. The assembler is written as flexibly as possible, providing flexible file formats as well as several informative error diagnostics. The input output of the assembler is as shown in figure-4.

Figure-4, Assembler Input and Output files
In order to demonstrate the flexibility of the assembler, we provide an exemplary “testfile.asm” and demonstrate the cases on this file:
#OUR ASSEMBLY FILE:

P7:cmp.leu R1, R11, P1,P0
  P6 :    loadi.hi R6, -9

p0: PADD.8.u r5 , r3, r2  #comment here

P2:jmp.link -1

# MY ASSEMBLY CODE NEEDS THIS , 0
P0:jmp.reg R18, 0

# wrong instructions
#padd.8.s r5, r16 r7
#pacc.6
#padd.1.s r5, r6, ro

We can have comment lines
We can have blank lines
We can have spaces between operands
We can have indentation and spaces between predicate fields
Case insensitive and we can have comments after instructions
 
 

This demonstrates an actual violation of the assembler for "jmp.reg" instruction
 
 
 
 

 

Hence, we also demonstrated a bug-like property of the program for jmp.reg instruction, yet this is not a bug as done intentionally to make jmp.reg abide by Type-1 instructions, and can be fixed if needed.

The assembler has several error diagnostic properties to help determine and correct the faulty instructions and several debug options embedded in the code. At present, some of the debug options are still left working as they are informative in verification of correct operation. The output of the simulator with respect to the above file and the generated binary file are shown in table-2.
 

Simulator Output
<<<<<<<<<<<<<<<<<<< STARTS >>>>>>>>>>>>>>>>>>
11100100000001010110000010000111
11000010000110111111111111110111
00011000000011000100010100000111
01000000111111111111111111111111
00000001010010000000000000000000
0 NS
Assertion FAILURE at 0 NS in design unit ASSEMBLER(TESTBENCH) from process /ASSEMBLER/_P0:
    "test finished"
Tells the start of assembling
Displays each generated
       binary instruction also on 
       standard output
 
 

Tells assembling finished
        gracefully

Instr_image.ini -> Binary File
11100100000001010110000010000111
11000010000110111111111111110111
00011000000011000100010100000111
01000000111111111111111111111111
00000001010010000000000000000000
Table-2, Simulator Output and Generated Binary file by the assembler
In order to demonstrate the diagnostic ability of our assembler, we include a few exemplary cases of error reports, as shown in table 3. The assembler error diagnostics include,but are not limited to the given examples.
Erroneous Instruction
Error Message Generated By the Assembler
PADD.8.u r5 , r3, r2  #comment here ***ERROR***: in line -> 6
    "Invalid instruction[non numeric Predicate id] -->PADD.8.u r5 , r3, r2  #comment here "
P7:cmp.leu R1, R11 P1,P0 "Invalid instruction -->P7:cmp.leu R1, R11 P1,P0"
P3:padd.5.u r5, r4, r3 ***ERROR***: in line -> 7
    "Invalid instruction [wrong subword size field for ...] -->padd.5.u r5, r4, r3"
P8:psub.4.s r5, r4, r3 ***ERROR***: in line -> 7
 "Invalid instruction[non numeric Predicate id] -->P8:psub.4.s r5, r4, r3"
P4:psub.4. r5, r4, r3 ***ERROR***: in line -> 11
    "Invalid instruction [expected u or s for psub] -->P4:psub.4. r5, r4, r3"
P6 :    loadi.hi 6, -9 ***ERROR***: in line -> 5
    "Invalid instruction[expected Rd register field] -->P6 :    loadi.hi 6, -9"
Table-3, Error diagnostics generated by the assembler
This assembler file is seen to be very effective in generating the required binary data for the instruction memory, and is as a matter of fact integrated within the final testbench. As a mean of precaution, the assembler code should be run with:
CS_ASSERT_STOP_NEXT_WAIT  = TRUE
statement in ".synopsys_vss.setup" simulation setup file, otherwise, the assertion statements do not cause the simulation to terminate. Other determined bugs for the assembler code are, tabs between fields are not sufficient toseparate fields, and immediate value is taken in the range -99999 to 99999 decimal, therefore, all the 23 bit immediates are not supported (Hence, 99999 is up to 17 bits only), but this can be fixed if needed. Finally, although all cases are tried to be taken care of, a space after the end of each instruction is recommended - Although we have not encountered any case this is actually recommended, the final field handled as a special case in the assembler due to possibility of exceeding available line length in comparison of fields, and a space solves this special case.
3) SIMULATION RESULTS:

The following asm files and binary files as shown in tables 4 and 5 are examples of our testbenches: Standard data forwarding, predicate register control, compare and jump all appear in these testbenches.
 

3.1) FIBONACCI SEQUENCE:

This test file computes the first 100 terms of the Fibonacci Series. The assembly level file, fibo.asm is input to the assembler and the resulting binary file, fibo.bin is used by the instruction memory to store the binary instructions.
 
 

ASM file
Binary file
# PROGRAM TO COMPute FIBONACCI
P1: LOADI.LO r1, 1 
P1: loadi.lo R2, 32
# R1 and R2 keep for loop repetition
P1: loadi.lo r3,1
P1: loadi.lo R4,0
# R3 and R4 first 2 terms of fibonacci
P1:padd.8 R5,R3,R0
#R0 contaions 0 and R5 is tmp
P1: padd.8 R3, R3,R4
P1: padd.8 R4, R5, R0
P1: psub.8 R2,R2,R1
P1: cmp.gt R2, R0, P1, P2
P1: jmp -20
# jump back 5 instr-s
P2: padd.8 R10, R3, R0
# to see that now P2 is set
 
00100010100001000000000000000001
00100010100010000000000000100000
00100010100011000000000000000001
00100010100100000000000000000000
00111000000011000000010100000011
00111000000011001000001100000011
00111000000101000000010000000011
00111000000010000010001000010011
00100100000010000000000010100100
00100000011111111111111111101100
01011000000011000000101000000011
Table-4, Sample testbench 1: fibo
 


 Figure-5, Simulation waveform for fibo.asm

3.2) FAXPY SEQUENCE:

This test file computes the fixed point: a*x[i]+y[i] for arrays x and y terms of the Fibonacci Series. The assembly level file, faxpy.asm is the assembly level file and the resulting binary file, faxpy.bin is used by the instruction memory to store the binary instructions.
 
 

ASM file
Binary file
# PROGRAM TO COMPUTE FAXPY
# P0:mux.2.brcst R3, R3  #copy scalar "a" to all 16-bit subwords in R3
#foo:
 P1:LOAD.8.upd R4, 8(R1) #load (X(i), X(i+1), X(i+2), X(i+3)), update R1
 P1:PMUL.odd R5, R4,R3 #multiply (a*X(i), a*X(i+2))
 P1:PMUL.even R6, R4,R3 #multiply (a*X(i+1), a*X(i+3))
 P1:LOAD.8 R8, 8(R2)       #load (Y(i), Y(i+1), Y(i+2), Y(i+3))
 P1:MIX.2.L R7, R8,R0 #expand 16-bit to 32-bit subwords, padding on right
 P1:MIX.2.R R8, R8,R0 #R7=(Y(i), Y(i+2)) and R8=(Y(i+1), Y(i+3))
 P1:PADD.4 R9, R5,R7       #add (a*X(i)+Y(i), a*X(i+2)+Y(i+2))
 P1:PADD.4 R10, R6,R8      #add (a*X(i+1)+Y(i+1), a*X(i+3)+Y(i+3))
 P1:MIX.2.L R8, R9,R10 #contract 32-bit down to 16-bit subwords
 P1:STORE.8.upd 8(R2), R8 #store new (Y(i), Y(i+1), Y(i+2), Y(i+3)), update R2
 P1:CMP.leu R1,R11, P1,P0 #test if done (result to P0 is discarded)
 P1:JMP  -44  #loop if not done 
 
 
00101011100001001000000000001000
00111000100100000110010100000001
00111000100100000110011000000101
00101001100010010000000000001000
00111001001000000000011101000001
00111001001000000000100001000101
00111000000101001110100100000010
00111000000110010000101000000010
00111001001001010100100001000001
00101111100010010000000000001000
00100100000001010110000010000111
00100000011111111111111111010100
 
 
Table-5, Sample testbench 2: FAXPY


Figure-6, simulated waveform for one loop of FAXPY



Figure-7, simulated waveform for 5 loops of FAXPY



 
4) APPENDICES

4.1) APPENDIX-I - BASIC COMPONENT LIST:
 

4.1.1  ALU:

In order to test our processor, we implemented a behavioural level ALU, as included in ALU64_new.vhd. This ALU performs not all, but the required basic functions for our testbenches such as: and is used in most of the testings. A representative block diagram for the ALU, demonstrating the input and output pins is shown in figure I-1. A representative simulation for the ALU is done using the test program ALU64_tester.vhd within the higher level testbench file. The simulation stimulus, waveforms and descriptions are shown in Appendix-II.
4.1.2  MULTIPLIER:
To support PMUL instructions, we also implemented a behavior-level fix-point multiplier, as shown in MUL.vhd. In our design, multipliers have a 3-cycle E stage and are pipelined. Up to now, PMUL.odd and PMUL.even are implemented. Its input/output pins are almost the same as ALU except that MUL needs clk and reset inputs.
4.1.3  SHIFT-PERMUTE UNIT:
For the shift permute unit, we use the shift permute unit provided by group 1. The behavioral design for the shift permute unit is shown in 1-shift-permute.vhd.


4.1.4  PREDICATE REGISTER FILE

As described in PLX-1.0, we implemented a 16-bank predicate register file, and each bank contains 8 bits, which is shown in pred_reg_v2.vhd. In 2-way superscalar PLX, 2 read ports, 4 bitwise write ports and 1 bytewise write port are necessary. figure I-2 shows its in/out pins.
4.1.5  GENERAL PURPOSE REGISTER FILE
PLX has 32 general purpose registers. In our 2-way superscalar, the register file has 4 read ports, 3 write ports (including 1 serves for data from data cache). The implemented register file is included in regfile.vhd.
4.1.6 INSTRUCTION DECODER
Instruction decoder interprets instructions: it separates predicate register field, Opcode, register fields, immediate number fields and Subop fields, it also sets/clears write-enable bits for different instructions. The coeded instruction decoder is shown in Inst_decoder.vhd.
4.1.7 PIPELINE REGISTERS
Pipeline registers pipelined all necessary information for all fetched instructions, such as all the decoded fields, the PC and even the instruction itself. An exemplary pipeline register for the control path is included in Ctrl_reg.vhd
4.1.8  CONTROLLER
The controller is a pure combinational logic. Indeed, it just checks all kinds of hazards and stalls the pipeline or forwarding correct data. It contains standard data forwarding logic for both general purpose register file and predicate register file. It also deals with WAW hazard elimination. The VHDL code of the controller is in Ctrl_unit.vhd.


4.1.9  INSTRUCTION MEMORY

The instruction memory, which stores the binary instructions is designed behaviorally as an array of std_logic_array(31:0). Therefore the memory structure is 4 byte Big-Endian words. The memory size is defined as 1Kb, with 1024/4 wordlines. As there is no standard defined for initializing the Imem, we first read the binary file generated by the assembler that contains the binary instructions. As the I-mem is byte addressed, the references should be aligned by 4 bytes. This should be seriously taken into consideration in 'jmp' instructions' offsets. The VHDL code for the instruction memory is shown in I_Cache.vhd.


4.1.10 DATA CACHE

The data memory, which loads/stores the memory data is designed behaviorally similar to the instruction memory as an array of std_logic_array(31:0). It is also 4 byte word, Big-Endian addressable and size is 1Kb with 1024/4 wordlines. In addition to the Waddr and Raddr address lines, and the Wdata data line, there is a we write enable signal to suppress/enable writes depending on the instruction. The VHDL code for Data cache is in D_Cache.vhd


4.1.11 TOP LEVEL DESIGN

The top level design is as shown in the following figure, the corresponding VHDL file is in PLX.vhd.
4.2) APPENDIX-II - ALU TEST:

To test the ALU functionality, we wrote a small testbench for the ALU. The test file performs one operation within a defined time period and performs the respective operations:
 

  • Period 1: padd.8 15,  32 Expected ALUout: 47
  • Period 2: padd.8 -99, 32 Expected ALUout: -67
  • Period 3: padd.2 x35_24_45_23, x00_00_11_11 Expected ALUout: x35_24_56_34
  • Period 4: cmp.eq 10, -10 Expected ALUout: 20
  • Period 5: cmp.ge -8, 764 Expected ALUout: -772
  • Period 6: cmp.geu -8, 764 Expected ALUout: 1
  • Period 7: cmp.geu 764,-8 Expected ALUout: -1
  • Period 8: cmp.geu -764, -8 Expected ALUout: -756
  • Period 9: cmp.geu 764,8 Expected ALUout: 756
The ALUout values for Padd operations are obvious. For the compare instructions, we only regard the sign information for >,< relations, therefore, the ALU code reveals only the correct sign information for the operation for the unsigned comparisons. A positive value means the first operand is bigger, while a negative value means the second operand is bigger, in terms of the applied comparison method. The test file is written for a pattern period of 100 ns, so the whole test set is applied in 900 ns. The resultant simulation waveforms are as shown in the following figures AppII-1, and AppII-2.

Figure AppII-1, Simulation Waveform for ALU - Padd part

Hence, figure AppII-1 is actually two merged figures, the first 2 patterns are displayed as signed integers, to demonstrate the padd.8 operation and the 3rd period displays the patterns in HEX to simplify the observation of padd.2 operation.

Figure AppII-2, Simulation Waveform for ALU - Cmp part

As seen in figure AppII-2, the described compare operations are performed and the results are in accordance with expectations.
 

In these test programs, we also use a batch simulation option provided by the Synopsys Software, named comm files. These files automate the simulation, by letting the user pre-store the desired simulator actions. The used comm file for the ALU test is included in ALU64test2. Information on how to use these files can be found in the text file about how to use batch files. As an example, for the ALU test, after starting the simulator, typing:
 

# include ./batchfiles/ALU64test2
# ALU64comm2
enables the use of the comm file and produces the above shown simulation results.