Index

  1. What is Code Generation?
  2. Issues in the design of a Code Generator
  3. Simple Code Generator
  4. Register Allocation and Assignment

What is Code Generation?

The Code Generation phase in a compiler is responsible for producing the target machine code (or assembly language) from intermediate representation (IR). This phase ensures that the generated code is correct, efficient, and adheres to the target machine’s architecture.


Issues in the design of a Code Generator

https://www.geeksforgeeks.org/issues-in-the-design-of-a-code-generator/

What is a Code Generator?

==Code generator converts the intermediate representation of source code into a form that can be readily executed by the machine==. A code generator is expected to generate the correct code. Designing of the code generator should be done in such a way that it can be easily implemented, tested, and maintained.

The following issues often arise while designing a Code Generator:

1. Issues in input to code generator

The input to the code generator is the intermediate code generated by the front end, along with information in the symbol table that determines the run-time addresses of the data objects denoted by the names in the intermediate representation.

Intermediate codes may be represented mostly in quadruples, triples, indirect triples, Postfix notation, syntax trees, DAGs, etc. The code generation phase just proceeds on an assumption that the input is free from all syntactic and state semantic errors, the necessary type checking has taken place and the type-conversion operators have been inserted wherever necessary.

So in case the input is not free of syntactic and state semantic errors and not properly optimized, there can be a lot of problems in the code generation phase.


2. Target Program and Target Machine Architecture

The target program is the output of the code generator. The output may be absolute machine language, relocatable machine language, or assembly language.

  • Absolute machine language as output has the advantages that it can be placed in a fixed memory location and can be immediately executed. For example, WATFIV is a compiler that produces the absolute machine code as output.

  • Relocatable machine language as an output allows subprograms and subroutines to be compiled separately. Relocatable object modules can be linked together and loaded by a linking loader. But there is added expense of linking and loading.

  • Assembly language as output makes the code generation easier. We can generate symbolic instructions and use the macro-facilities of assemblers in generating code. And we need an additional assembly step after code generation.

So sometimes there could be a problem in designing the code generator if there is too much uncertainty as to what is the target machine architecture.


3. Issues in Memory Management

Mapping the names in the source program to the addresses of data objects is done by the front end and the code generator. A name in the three address statements refers to the symbol table entry for the name. Then from the symbol table entry, a relative address can be determined for the name.


4. Issues in Instruction Selection

Selecting the best instructions will improve the efficiency of the program. It includes the instructions that should be complete and uniform. Instruction speeds and machine idioms also play a major role when efficiency is considered. But if we do not care about the efficiency of the target program then instruction selection is straightforward.

For example, the respective three-address statements would be translated into the latter code sequence as shown below:

P:=Q+R
S:=P+T
 
MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S

Here the fourth statement is redundant as the value of the P is loaded again in that statement that just has been stored in the previous statement. It leads to an inefficient code sequence. A given intermediate representation can be translated into many code sequences, with significant cost differences between the different implementations. Prior knowledge of instruction cost is needed in order to design good sequences, but accurate cost information is difficult to predict.


5. Register Allocation Issues

Use of registers make the computations faster in comparison to that of memory, so efficient utilization of registers is important. The use of registers is subdivided into two subproblems:

  1. During Register allocation – we select only those sets of variables that will reside in the registers at each point in the program.
  2. During a subsequent Register assignment phase, the specific register is picked to access the variable.
 // To understand the concept consider the following three address code sequence
 t:=a+b
 t:=t*c
 t:=t/d
 
 // Their efficient machine code sequence is as follows:
 
 MOV a,R0
 ADD b,R0
 MUL c,R0
 DIV d,R0
 MOV R0,t

  1. Evaluation order – The code generator decides the order in which the instruction will be executed. The order of computations affects the efficiency of the target code. Among many computational orders, some will require only fewer registers to hold the intermediate results. However, picking the best order in the general case is a difficult NP-complete problem.

  2. Approaches to code generation issues: Code generator must always generate the correct code. It is essential because of the number of special cases that a code generator might face. Some of the design goals of code generator are:

    • Correct
    • Easily maintainable
    • Testable
    • Efficient

Disadvantages in the design of a code generator:

  • Limited flexibility: Code generators are typically designed to produce a specific type of code, and as a result, they may not be flexible enough to handle a wide range of inputs or generate code for different target platforms. This can limit the usefulness of the code generator in certain situations.

  • Maintenance overhead: Code generators can add a significant maintenance overhead to a project, as they need to be maintained and updated alongside the code they generate. This can lead to additional complexity and potential errors.

  • Debugging difficulties: Debugging generated code can be more difficult than debugging hand-written code, as the generated code may not always be easy to read or understand. This can make it harder to identify and fix issues that arise during development.

  • Performance issues: Depending on the complexity of the code being generated, a code generator may not be able to generate optimal code that is as performant as hand-written code. This can be a concern in applications where performance is critical.

  • Learning curve: Code generators can have a steep learning curve, as they typically require a deep understanding of the underlying code generation framework and the programming languages being used. This can make it more difficult to onboard new developers onto a project that uses a code generator.

  • Over-reliance: It’s important to ensure that the use of a code generator doesn’t lead to over-reliance on generated code, to the point where developers are no longer able to write code manually when necessary. This can limit the flexibility and creativity of a development team, and may also result in lower quality code overall.


Simple Code Generator

https://www.geeksforgeeks.org/simple-code-generator/

A simple code generator is a basic implementation that translates intermediate code to target machine code without significant optimizations. It serves as a starting point to understand code generation fundamentals.

Steps in a Simple Code Generator

1. Parsing the input.

The first step involves constructing an Abstract Syntax Tree (AST) by traversing all possible paths through your input file(s). This tree will contain information about every bit of data in your program as they are encountered during parsing or execution time; it’s important to note that this can take place both at compile time (as part of compiling) or runtime (in some cases).


2. The Register Descriptor

Register descriptors are data structures that store information about the registers used in the program. This includes the registration number and its name, along with its type. The compiler uses this information when generating machine code for your program, so it’s important to keep it up-to-date while writing code!

The compiler uses the register file to determine what values will be available for use in your program. This is done by walking through each of the registers and determining if they contain valid data or not. If there’s nothing in a register, then it can be used for other purposes!


3. The Address Descriptor

An address descriptor is used to represent the memory locations used by a program. Address descriptors are created by the getReg function, which returns a structure containing information about how to access memory. Address descriptors can be created for any instruction in your program’s code and stored in registers or on the stack; however, only one instance of an address descriptor will exist at any given time (unless another thread is executing).

When the user wants to retrieve data from an arbitrary location within the program’s source code using getReg, call this method with two arguments: The first argument specifies which register contains your desired value (e.g., ‘M’), while the second argument specifies where exactly within this register should it be placed back onto its original storage location on disk/memory before returning it back up into main memory again after successfully accessing its contents via indirect calls like LoadFromBuffer() or StoreToBuffer().


4. The Code Generation Algorithm

The code generation algorithm is the core of the compiler. It sets up register and address descriptors, then generates machine instructions that give you CPU-level control over your program.

The algorithm is split into four parts: register descriptor set-up, basic block generation, instruction generation for operations on registers (e.g., addition), and ending the basic block with a jump statement or return command.

  1. Register Descriptor Set Up: This part sets up an individual register’s value in memory space by taking its index into an array of all possible values for that type of register (i32). It also stores information about what kind of operation was performed on it so that subsequent steps can identify which operation happened if they’re called multiple times during execution.

  2. Basic Block Generation: This step involves creating individual blocks within each basic block as well as lines between them so we can keep track of where things are happening at any given moment during execution.

  3. Instruction Generation For Operations On Registers: This step converts source code statements into machine instructions using information from both our ELF file format files (the ones generated by GCC) as well as other sources such as Bazel’s build system which knows how to generate particular kind of machine code for particular CPUs. This is where we start to see the magic of how compilers work in practice, as they’re able to generate code that’s optimized in various ways based on the type of operation being performed (e.g., addition) and the registers involved (i32). This step can also be thought of as “register allocation” because it’s where we determine which registers will be used for each operation, and how many there are in total. This step uses the information generated in the previous steps as well as other information such as rules about how many registers are needed for certain operations. For example, we might know that 32-bit addition requires two registers: one to hold the value being added, and one for the result of this operation.

  4. Instruction Scheduling: This step reorders instructions so that they’re executed efficiently on a particular CPU architecture. This step uses information about the execution resources available on each CPU architecture to determine the best order for executing operations. It also considers things like whether or not we have enough registers to store values (if some are in use), or if there’s a bottleneck somewhere else in the pipeline.


Example

For the given TAC :

t1 = a * b
t2 = c + d
t3 = t1 + t2

A simple code generator for a hypothetical machine might emit:

LOAD R1, a     // Load 'a' into R1
MUL R1, b      // Multiply R1 by 'b'
STORE R1, t1   // Store the result in t1
 
LOAD R2, c     // Load 'c' into R2
ADD R2, d      // Add 'd' to R2
STORE R2, t2   // Store the result in t2
 
LOAD R3, t1    // Load t1 into R3
ADD R3, t2     // Add t2 to R3
STORE R3, t3   // Store the result in t3

Register Allocation and Assignment

https://www.geeksforgeeks.org/register-allocations-in-code-generation/?ref=asr10

Registers are the fastest locations in the memory hierarchy. But unfortunately, this resource is limited. It comes under the most constrained resources of the target processor. Register allocation is an NP-complete problem. However, this problem can be reduced to graph coloring to achieve allocation and assignment. Therefore a good register allocator computes an effective approximate solution to a hard problem.

The register allocator determines which values will reside in the register and which register will hold each of those values. It takes as its input a program with an arbitrary number of registers and produces a program with a finite register set that can fit into the target machine. (See image)

Register Allocation vs Register Assignment

Allocation vs Assignment: 

Allocation – 
Maps an unlimited namespace onto that register set of the target machine

  • Reg. to Reg. Model: Maps virtual registers to physical registers but spills excess amount to memory.
  • Mem. to Mem. Model: Maps some subset of the memory location to a set of names that models the physical register set.

Allocation ensures that code will fit the target machine’s reg. set at each instruction.

Assignment – 
Maps an allocated name set to the physical register set of the target machine.  

  • Assumes allocation has been done so that code will fit into the set of physical registers.
  • No more than ‘k’ values are designated into the registers, where ‘k’ is the no. of physical registers.