Homework 1

CMSC 411 / Olano, Fall 2013

Updates

12 September 2013
5 September 2013

In this homework, you will write an assembler in C or C++ for a simplified subset of the MIPS instruction set. Your program should take two arguments, the name of an assembler input file and the name of the hex output file.

Assembler Format

Instructions are written with an operation followed by a list of operands separated by white space (for ease of parsing, we will NOT have commas between operands):

    operation operand1 [operand2 ...]

Register operands are written $0 to $31, though $0 is hard-wired to 0. Immediate constants are given as a c-style number, for example '100' is decimal 100, while '0x100' is hex 100 or decimal 256, or with a label. In addition, labels can be defined at the beginning of a line with a leading colon (:label), and can appear as an immediate constant without the colon (label). Note that using a label might produce different immediate values in the encoded instruction depending on the type of instruction. Data accesses use the real address byte, while a branch would need to compute the relative word offset to the label from the next instruction.

For ease of parsing, we will write data addresses as "register offset", rather than the more common "offset(register)", for example "$1 label" not "label($1)"

Anything from a '#' character to the end of the line should be discarded as a comment. Any blank or comment only lines should be ignored, and you should allow any amount of white space between elements.

We will also use one pseudo-instruction, int, whose "operands" are a list of word-sized integers to store as sequential data elements

Encoding

All instructions are encoded in 32 bits in one of three formats:

R-type opcode (6b) rs (5b) rt (5b) rd (5b) sh (5b) func (6b)
I-type opcode (6b) rs (5b) rd (5b) immediate (16b)
J-type opcode (6b) address (26b)

For R-type instructions the registers are written in the order "rd rs rt", where rd is the destination register. For I-type instructions they are written in the order "rd rs", where rd is the destination register.

You should support the following opcodes (see the textbook for descriptions)

    low bits
    000 001 010 011 100 101 110 111
high bits 000 *   j jal beq bne blez bgtz
001 addi addiu     andi ori xori lui
010                
011                
100       lw        
101       sw        
110                
111                

Opcode 000000 is used for all R-type instructions, which use the func field to determine the actual operation. The func operations you should support are:

    low bits
    000 001 010 011 100 101 110 111
high bits 000                
001 jr jalr            
010                
011 mult multu div divu        
100 add addu sub subu and or xor nor
101                
110                
111                

Your output should be a list of c-format 32-bit hex integers:

0x12345678,
0x87654321,
...

Example

For example, here is a section of legal assembly code

      addi $1 $0 10    # x = 10
      add  $2 $0 $0    # y = 0
:loop addi $1 $1 -4    # x = x - 4 (1 word)
      lw   $3 $1 data  # z = data[x]
      add  $2 $2 $3    # y = y + z
      bgtz $1 loop     # loop if x>0
:end  j    end         # loop forever to end

:data                  # label can be on its own line
      int  0 1 2 3 4   # data pseudo-instruction
      int  5 6 7 8 9

Here is the same code with instruction addresses shown, labels filled in, and the encoding type and relevant components shown

 0       addi $1 $0 10    # i-type op=0x08, rd=1, rs=0, imm=10
 4       add  $2 $0 $0    # r-type fn=0x20, rd=2, rs=0, rt=0
 8 :loop addi $1 $1 -4    # i-type op=0x08, rd=1, rs=1, imm=-4
12       lw   $3 $1 28    # i-type op=0x23, rd=3, rs=1, imm=28
16       add  $2 $2 $3    # r-type fn=0x20, rd=2, rs=2, rt=3
20       bgtz $1 -4       # i-type op=0x07, rd=0, rs=1, imm=-4
24 :end  j    6           # j-type op=0x02, address=6
28 :data int  0 1 2 3 4   # data
48       int  5 6 7 8 9

and here is what you'd print as output

0x2001000a,
0x00001020,
0x2021fffc,
0x8c23001c,
0x00431020,
0x1c20fffc,
0x08000006,
0x00000000,
0x00000001,
0x00000002,
0x00000003,
0x00000004,
0x00000005,
0x00000006,
0x00000007,
0x00000008,
0x00000009,

Implementation notes

There are a few functions that may prove useful in implementing this project.

First is fgets(line, size, file), which reads a line from a file into a string buffer. This function requires a maximum line size (= size of the buffer). 256 should work fine. It will return NULL when you reach the end of the file.

Second is strtok(line, " \t\v\r\n\f"). The second string there contains a space, tab, vertical tab, carriage return, line feed, and form feed (the standard "white space" characters). This strtok returns the first "token" in the line made entirely of characters other than white space. You will also want to use strtok(NULL, " \t\v\r\n\f"), which returns the next token in the same line. When there are no tokens left in the line, strtok returns NULL.

This particular assembly language is defined so you can tell what any token is by the first character. Feel free to take advantage of that!

sscanf(token, "%i", &val) will convert a string to an integer, including doing any C-style integer conversions (handling signed, decimal, hex with leading 0x, etc.)

fprintf(oFile, "0x%08x,\n", val) will print one integer in the expected output format (hex integer, padded with 0's to eight characters wide)

You'll probably want to use these bitwise logical operators to manipulate the bits of the encoding: a & b is a bitwise and. It may be exceptionally useful for masking (for example, val & 0xFFFF will keep just 16 bits of a signed integer value). a | b is a bitwise or. It may be useful for composing parts of the encoded instruction together. a<<b (or a>>b) shift a left (or right) by b bits. This may be useful for positioning the parts of the encoded instruction to combine.

You may see references to some labels before their location is defined. There are two approaches you can use to deal with that. The first is to take a first pass through the file counting words and collecting label addresses, then use rewind(file) to start over. The second is to store the addresses of unresolved labels in a data structure, then fill them in when you find the definition of the label.

Submitting

All electronic submissions in this class will be done using the CVS version control system. You should look at the class CVS instructions before you start work. To get full credit for your submission, you should (1) check out a copy of the empty hw1 directory before you start, (2) do your work in that checked out copy, (3) submit several intermediate checkins with short but useful messages (e.g. "parsing works!"), and (4) check in your final submission before class starts on the day of the deadline.

Include a short file named "readme.txt" that describes how to build and run your program, as well as a description of any known problems or bugs. If your program is contained in a single C or C++ file, say named "program.c", just running "make program" should suffice to build it, but we need to know the name of your program. Bugs you identify in your readme will lose only half points. Bugs you don't identify that are found during grading will lose full points.