# UNIVERSITY OF LONDON IMPERIAL COLLEGE OF SCIENCE, TECHNOLOGY AND MEDICINE

#### **EXAMINATIONS 2007**

BEng Honours Degree in Computing Part II

MEng Honours Degrees in Computing Part II

MSc in Computing Science

BEng Honours Degree in Information Systems Engineering Part II

MEng Honours Degree in Information Systems Engineering Part II

for Internal Students of the Imperial College of Science, Technology and Medicine

This paper is also taken for the relevant examinations for the Associateship of the City and Guilds of London Institute

PAPER C210=E2.13

#### COMPUTER ARCHITECTURE

Wednesday 2 May 2007, 14:30 Duration: 120 minutes

Answer THREE questions

Corrected Copy

Paper contains 4 questions Calculators required

#### 1a Memory System

Explain the purpose of cache lines. What happens when you increase the cache line size? What happens when you decrease the cache line size? Please explain your answers to *part 1a* in detail.

- b Assume a 1MByte direct mapped cache with cache lines of 512 bytes, and a 64 bit address space. How many bits are needed to address a word in a cache line? How are the remaining address bits used in a cache access?
- c The following programme runs on a machine with a single cache as described in *part 1b*.
  - i. Explain how many main memory accesses (memory bus transactions) the programme below generates.
  - ii. Explain how much data gets transfered back and forth between memory and the cache.

You may assume:

- The cache is empty in the beginning.
- Reasonable register allocation
- Array a and array b start at the beginning of a cache line.
- Array a and array b do not interfere in the cache, and
- A "write back" policy for writes to the cache.

State any other assumptions you make.

```
int i,a[1000000],b[1000000];
init(a); b[0]=0;
for(i=1;i<1000000;i++){
    b[i]=a[i]*a[i-1]+b[i-1];
}</pre>
```

d Assuming that a cache line can be retrieved and stored in main memory in 500 clock cycles, multiplies take 3 clock cycles and additions take 1 clock cycle: How many clock cycle does the processor have to wait due to an access to main memory?

The four parts carry, respectively, 20%, 20%, 30%, 30%, of the marks.

#### 2a Virtual Memory

The processor issues the following instruction: mov(R1), R2 meaning: load R2 with the value stored at the address A stored in R1. Describe in detail the process that starts with address A, involving a single cache and a single TLB, and ending with the data item arriving at the register R2. Make sure to explain the use of virtual and physical addresses, the cache, TLB, and all other structures used in the data access process.

- b On many processors we can choose between a few pre-set virtual page sizes. Suggest three reasons why we would need different page sizes. Explain your three choices in detail.
- c Assume a 32 bit address space, virtual page size of 4 KBytes, a paging file size of 10GBytes, and a physical main memory of 1.5 GBytes. The code below shows a matrix multiply using double precision floating point arithmetic.

```
for(i=0;i<1000000000;i++){ //result rows
  for(j=0;j<1000000000;j++){ // result columns
     for(k=0;k<1000000000;k++){
         result[i][j]=matrix1[i][k]*matrix2[k][j];
     }
}</pre>
```

- i. Explain how you would change the code above to improve performance?
- ii. Explain how the layout of the data could be changed to obtain higher performance?

The three parts carry, respectively, 30%, 30%, 40% of the marks.

- 3a Describe the advantages of compiling programs directly into hardware.
- b Describe, with the use of a circuit diagram, how the parallel execution statement:

can be implemented in hardware using the token-passing method.

- c A DO ... UNTIL (E) loop differs from a WHILE (E) loop in that the first iteration of the loop is always executed, the expression E is only evaluated at the end of the loop and the loop terminates when E is true.
  - i) Provide a circuit diagram showing how a DO . . . UNTIL (E) loop can be implemented in token-passing hardware.
  - ii) Hence, provide the circuit diagram for the following program implemented in token-passing hardware:

```
int_6 X
SEQ
   X := 10
   DO
       X := X - 1
UNTIL (X = 0)
```

The three parts carry, respectively, 20%, 30%, 50% of the marks.

- The executable code for program P has N instructions. A fraction  $\alpha$  of the instructions are type Y, and the rest are type X.
  - i) Machine M1 has a single-cycle data-path and operates at  $f_l$  cycles per second (Hz). How long will M1 take to execute P?
  - ii) Machine M2 has a multi-cycle data-path, and operates at  $f_2$  Hz. Given that each instruction takes x cycles to execute, how long will it take M2 to execute P?
  - iii) Machine M3 also has a multi-cycle data-path, and operates at  $f_3$  Hz. However, although the execution time for type X instructions is the same as for M2 (x cycles), type Y instructions are faster, taking y cycles to execute. Given that  $f_2 > f_3$  determine how many times faster Y instructions need to be executed for M3 to be faster than M2 when running P.
- b The circuit ALUb shown below implements one bit of a simple ALU.



| $\mathbf{d}_1$ | $d_0$ | ck | $\mathbf{z}_{\mathbf{k}}$ | c <sub>k+1</sub> |
|----------------|-------|----|---------------------------|------------------|
| 0              | X     | 0  | $a_k \oplus b_k$          | $a_k \cdot b_k$  |
| 0              | X     | 1  | $a_k \oplus b_k$          | $a_k + b_k$      |
| 1              | 0     | X  | $a_k \cdot b_k$           | X                |
| 1              | 1     | X  | $a_k + b_k$               | X                |

- Provide a circuit diagram to show how ALUb can be built using a fulladder, multiplexors and two-input logic gates. Do not show the internal logic of the fulladder or the multiplexors.
- ii) Provide a circuit diagram to show how four copies of the ALUb circuit can be used to build a four-bit ALU which implements add, AND and OR functions when  $d_1d_0 = 01$ , 10, 11 respectively (ignore overflow). Label all inputs and outputs.
- iii) By adding logic to the carry input show how the ALUb circuit can be used to implement ALUb', which implements all the functions of ALUb as well as an exclusive-OR function on the inputs  $(a_k, b_k)$  when  $d_1d_0 = 00$ .

The two parts carry, respectively, 50%, 50% of the marks.

# EZ.17 Computer Architecture

| Depa                            | rtment of Computing Examinations –                                              | - 2006–2007 Session Con                                                                                                                                 | nfidential |  |
|---------------------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|------------|--|
| MODEL ANSWER and MARKING SCHEME |                                                                                 |                                                                                                                                                         |            |  |
| First l                         | Examiner Oskar                                                                  | Paper Code C210 = E C.1.                                                                                                                                | 3          |  |
| Secon                           | nd Examiner NpS                                                                 | Question   Page   out of                                                                                                                                | 9          |  |
| Quest                           | ion labels in left margin                                                       | Mark allocations in rig                                                                                                                                 | ht margin  |  |
| la                              | Exploring spatial                                                               | locality of data accesses                                                                                                                               | , 4        |  |
|                                 | ef DRAM accers the coche line site in Reducing coche line si                    | rite marenes gambrides. For a given system, received accesses conflict misses. The mareness initial misses accesses les efficient                       |            |  |
| ٩                               | onsuring 32 bit wer                                                             | ds = 0 128 woods/line = 7 adds sed to check the tag.                                                                                                    | . 4        |  |
| С                               | 1 carleline = 1 nemo<br>4 MBytes = 8192                                         | y transaction<br>cache Quies                                                                                                                            | 6          |  |
| d                               | 4 cc for unliptyadd<br>For 2 cache lines (1<br>512 by tes=128 w<br>128 ×4cc=512 | For womany bus transactions  Per is 12 M Bytes.  (cc = clock cycle)  write, I read) = \$1000 cc  orals  cc = 0488 cc wait  cc total = 051,2% efficiency | 6          |  |

| Depar                           | tment of Computing Examinations –                           | - 2006-2007 Session Conf              | idential |  |
|---------------------------------|-------------------------------------------------------------|---------------------------------------|----------|--|
| MODEL ANSWER and MARKING SCHEME |                                                             |                                       |          |  |
| First E                         | examiner os kar                                             | Paper Code $C210 = E2.1$              | 3        |  |
| Second                          | d Examiner NPS                                              | Question 2 Page 2 out of              | 7        |  |
| Questi                          | Question labels in left margin Mark allocations in right ma |                                       |          |  |
| 2 a                             | (RI) is a victial ado                                       | ben A. holdren A goes to the          | 6        |  |
|                                 |                                                             | wiss, to the Ostion (alien table)     |          |  |
|                                 | sently in a physical                                        | address, A. A goes to the value       |          |  |
|                                 | and in case ela miss,                                       | ell the way to the main memory.       |          |  |
|                                 | in case the page is not in                                  | inamore, we elso have to wat fe it to |          |  |
|                                 | ording from the pageing for                                 | ile                                   |          |  |
| 6                               | page size : gramatice of                                    | I vistual weeners accen               | 6        |  |
|                                 | Differ loss of the                                          | memory accon                          |          |  |
|                                 | Different page sizes allo                                   | was to adopt to                       |          |  |
|                                 | - eliferent atologe il                                      | baseleristics farthe paging file      |          |  |
|                                 | - ordapt to the appl                                        | ications data occess regornement      |          |  |
|                                 | - odbben a layer                                            | adolen space (e.g. on 26)             |          |  |
|                                 |                                                             | ×                                     |          |  |
|                                 | Glocking undares perform                                    | rance, i.e. portilion la malices      | 8        |  |
|                                 | into maller books up,                                       | in let on to the series of all all    |          |  |
|                                 | your way though each                                        | plack repossely, layout counter       |          |  |
|                                 | trigung en                                                  | I metrix I in now ender               |          |  |
|                                 | and matrix 2 in colu                                        | um order.                             |          |  |
|                                 | 8                                                           |                                       |          |  |
|                                 |                                                             |                                       |          |  |

| Depai   | ertment of Computing Examinations — 2006–2007 Session                                           | Confide                     | ential |
|---------|-------------------------------------------------------------------------------------------------|-----------------------------|--------|
|         | MODEL ANSWER and MARKING SCI                                                                    | HEME                        |        |
| First E | Examiner np S Paper Code                                                                        | C210=E2.13                  |        |
| Secon   | nd Examiner Oskar Question                                                                      | Page 3 out of 9             |        |
| Questi  | tion labels in left margin                                                                      | Mark allocations in right m | argin  |
| 3~      | Programs compiled directly into hardware executed in parallel, which medies                     |                             | 4      |
|         | One can create processor with custom which we hardware more efficiently the purpose processors. |                             |        |
|         |                                                                                                 |                             |        |

### Department of Computing Examinations — 2006–2007 Session Confidential MODEL ANSWER and MARKING SCHEME Paper Code C210 = E2.13First Examiner Mps Page 4 out of Second Examiner Question oskar Question labels in left margin Mark allocations in right margin 6 35 Ster f Q reset reset reset finish The token is split and passed to all thee processes (P,Q,R). Registers latch the tohens as each process finishes. The AND gette medies sure the first token is not generated until all three processes have finished. The fruish tolen resets the register to the unitical state. The OR gates are needed to avoid waiting an

exter cycle after the slowest process completes before

generating the first token.



| Department of Computing Examinations — 2006–2007 Session Confide                        | ential |
|-----------------------------------------------------------------------------------------|--------|
| MODEL ANSWER and MARKING SCHEME                                                         |        |
| First Examiner $\nu \rho s$ Paper Code $C2_{10} = E2.13$                                |        |
| Second Examiner OSKOLI Question 4 Page 6 out of                                         |        |
| Question labels in left margin  Mark allocations in right margin                        | argin  |
| (4a) MIExecution time = CPIx (now. instructions) x - 1 clock rate                       | 10     |
| $= \frac{1}{N}$                                                                         |        |
| ii) we Execution time = $x \cdot N \cdot \frac{1}{f_2}$ $= \frac{x \cdot N}{f_2}$       |        |
| iii) Un3 Execution true : (overage CPI) = N x 1/3                                       |        |
| average $CPI = (I-X)X + XY$                                                             |        |
| $\therefore \text{ execution time} = \left[ (1-\alpha) + \alpha y \right] \frac{N}{43}$ |        |
| For this to be fewler than U12:                                                         |        |
| $[(1-\alpha)x + \alpha y] \frac{N}{4z} < \frac{xN}{4z}$                                 |        |
|                                                                                         |        |

## Department of Computing Examinations - 2006-2007 Session Confidential MODEL ANSWER and MARKING SCHEME C210= E2.13 First Examiner Paper Code Page 7 out of 9 Second Examiner oskar Question Question labels in left margin Mark allocations in right margin 基 4 a $\alpha y < \frac{1}{4} - (1-\alpha)x$ cout. y 2 f3/f2 -1 + ∝ The R.H.S. is how many trues taske y has to be for UTS to be quidu, at executing P. than U1?.



