Last modified on April 28, 2021, at 02:31

Compiler

A compiler is a computer program which translates source code written in a high-level programming language into executable machine code. The act of doing this is called compilation.

There are compilers for the more common programming languages available on a wide variety of different computer architectures. This means that a program written in one of these languages may be portable and can often be used on any of the different machines just by using the appropriate compiler; note that there may be occasional problems with machine-dependencies, especially if there are substantial differences in the accuracy and size of numbers on the different machines.

When first written many programs contain errors. Some errors can be detected by the compiler while other errors might only be detected when a program is running.[1] Some errors only occur under very specific circumstances and may never be detected.

Some compilers just perform a fast simple-minded translation from high-level source code to low-level machine language, while other compilers may optionally take (much) longer to perform the compilation by trying to improve or optimise the resulting machine code.

History

The first use of the term 'compiler', in relation to program translation, is credited to Grace Hopper,[2] based on the legal meaning of compilation, since her translator for the A-0 language merely put together various snippets of pre-built code.

The first translator which matched the current computer definition of a compiler, was the Fortran compiler for the IBM 704 computer. The Fortran manual was published in October 1956, and the compiler itself was made available to users in April 1957.[3]

Computer users nowadays may not realise the severe technical constraints which applied to programming in the mid 1950s. Modern personal computers have the entire central processing unit on a single chip (integrated circuit), have main memories (RAM) containing many hundred of millions of bytes, and have processing speeds of several million instructions per second. The IBM 704 computer (fairly advanced for its time) had a central processor made up of many thousands of electronic valves (each the size of a light bulb), had a main memory of only four thousand 36-bit words, and had a processing speed of perhaps forty thousand instructions per second (modern mobile phones are more powerful than this).

Most compilers in the 1950s and 1960s were written in low-level assembly languages; there wasn't much choice at that time since the computer hardware wasn't capable of supporting compilers written in high-level languages. The use of high-level languages for writing compilers became more mainstream with the advent of Pascal in 1970[4] and C between 1969 and 1973.[5]

Some compilers are written in the language which they translate (self-hosting), though some sort of bootstrapping process is needed to get the first version of such a compiler working. The first self-hosting compiler was for Lisp in 1962.[6] Both the Pascal and C compilers mentioned above are self-hosting.

Outline of the compilation process

In the freely available online book "Basics of Compiler Design",[7] Professor Torben Mogensen identifies seven phases within a compiler, with the various phases being performed in one or more passes over the program. A diagram illustrating these (and other optional optimisation) phases is shown on the last page of Professor Frank Pfenning's "Lecture Notes on Compiler Design: Overview"[8]

Lexical analysis: convert the source code into a sequence of tokens, such as variable names, keywords, numbers, and special symbols such as '+'.
Syntax analysis or parsing: arrange the tokens into a parse tree or syntax tree which represents the structure of the program; check that the structure is valid.
Semantic analysis: annotate the parse tree with semantic actions, and perform various consistency checks; a symbol table is used to keep track of names declared by the programmer.
Intermediate code generation: use the annotated parse tree to generate code in some simple machine-independent intermediate language.
Register allocation: map the symbolic names used in the intermediate code on to the registers available in the target machine.
Assembly code generation: translate the intermediate code into assembly language for the target machine.
Assembly and linking: translate the assembly language into executable machine code.

Dealing with errors

Almost all non-trivial programs contain errors. Detecting and fixing all errors is very expensive and time-consuming, so a balance has to be struck between the potential effect of certain errors (e.g. could they be life-threatening?) and the cost of fixing them. The design of a programming language may also affect the likelihood of certain types of errors and the difficulty or otherwise of fixing them.

Gerald Weinberg reported in 1983 that the three most expensive programming errors of all time each cost hundreds of millions of dollars and that each was a one-line, coding-level mistake.[9]

During translation compilers can usually detect all errors due to lexical or syntactical mistakes (spelling and grammar); they may also be able to detect a few simple semantic errors (meaning). Even when a compiler does detect and report an error, all it can do is report what the error is and where it was detected; unfortunately the point of detection can be in some cases be many lines away from the actual source of the error; also it may occasionally be difficult to explain the error in terms that the programmer can understand.

A semantic/coding error such as using a plus sign '+' instead of a minus sign '-' is almost impossible for any automated system to detect - thorough human checking of both code and calculated results would be needed in this situation.

There are a wide variety of errors which can occur when a program is running.[10] Some compilers can optionally generate extra code to check for many of these errors, and there are some software packages which can provide the same effect. Unfortunately this extra checking code can slow a program down by a factor of 10 or 20 times.

Research is continuing on the problem of proving that a program is correct. This is an extremely difficult problem - it has been shown that proving whether or not a program actually stops is undecidable. Furthermore, the proofs are often a lot longer than the programs being proved so there is the additional problem of checking that the proof itself is correct. Bruce Schneier reports [11] that it took six people working for five years to prove the correctness of 7,500 line program (about 250 lines per man-year). Jonathan Aldrich provides some examples of proof methods applied to tiny programs. [12]

Optimisation

In a long-running calculation, the choice of a good algorithm is essential. Changing to a better algorithm can often provide much better performance gains than is possible with compiler optimisation.

Programs often obey the 80-20 rule, where 20% of the code accounts for 80% of the CPU usage. Programmers usually do a poor job of guessing as to which 20% of the code is causing performance problems, so before doing any hand optimisation it is crucial to (use a tool to) measure the program performance and identify the hot spots.

Many compilers have one or more optional optimisation levels. As the level/amount of optimisation increases the compiler tends to spend more and more time doing the optimisation. The word 'optimisation' is perhaps misleading - perfect optimisation is a very hard problem (NP-complete or even undecidable), so a conceptually better term would be 'improvement'.

Note that the most important criteria for judging a compiler is that it be correct, i.e. it always produces a correct translation of the high-level language, regardless of the amount of optimisation performed.[13] The documentation for some compilers warns that results may be slightly different and/or that certain code may be incorrectly translated at high levels of optimisation.[14]

Such optimisations generally make the resulting machine code 'better', for some definition of 'better':

  • faster: higher speed may be important for frequently-used long-running programs.
  • smaller: less memory usage may be important for embedded computers used within other devices.
  • low power: less power consumption may be important for battery-powered computers (e.g. laptops).

Note that it is often only possible to optimise for a single criterion, since such optimisation may well make other criteria worse; it has been accepted for decades that many programs have a space/time trade-off so that reducing memory usage will increase execution time and vice versa.

The bulk of the machine-independent optimisation takes place after the intermediate code is generated, effectively by re-arranging and modifying this code. Some machine-dependent optimisation may take place during register allocation. The bulk of the machine-dependent optimisation takes place during the generation of assembly language; this often includes extensive peep-hole optimisation to replace a group of assembler instructions by some shorter or simpler group.

References

  1. Robert I. Pitts. Compiler, Linker and Run-Time Errors. Retrieved on 2012-04-02.
  2. Harold "Bud" Lawson and Howard Bromberg. The World's First COBOL Compilers. Retrieved on 2012-03-24.
  3. J A N Lee. Twenty five years of Fortran (pdf). Retrieved on 2012-03-27.
  4. Pascal-P: The portable Pascal compiler. Retrieved on 2012-03-31.
  5. Dennis M. Ritchie. The Development of the C Language. Retrieved on 2012-03-31.
  6. Tim Hart and Mike Levin. AI Memo 39-The new compiler. Retrieved on 2012-03-29.
  7. Torben Mogensen. Basics of Compiler Design. Retrieved on 2012-03-19.
  8. Frank Pfenning. Lecture Notes on Compiler Design: Overview (pdf). Retrieved on 2012-03-19.
  9. Steve McConnell. Who Cares About Software Construction?. Retrieved on 2012-04-02.
  10. Glenn R. Luecke et al. A Survey of Systems for Detecting Serial Run-Time Errors. Retrieved on 2012-04-03.
  11. Bruce Schneier. Proving a Computer Program's Correctness. Retrieved on 2012-04-03.
  12. Jonathan Aldrich. Hoare Logic: Proving Programs Correct. Retrieved on 2012-04-03.
  13. Christian Lindig. Random Testing of C Calling Conventions. Retrieved on 2012-04-05.
  14. Code optimization with the IBM XL Compilers. Retrieved on 2012-04-05.