GCC Front Ends

Here are some front ends I have written for GCC, as well as my notes on writing them. For the uninitiated, a new front end means extending GCC to support a new programming language.

Front ends

gquine
(download gquine-4.8.2.tar.gz)
A front end for "quine", a language I made up where any source file is a quine, a program which outputs its own source code. For example, the "Hello, World!" program in this language is:
Hello, World!
This was my first front end; I therefore used the simplest language I could imagine which did something. (Ignoring the "does something" condition, one could imagine a simpler language where any source file compiled into a program that does nothing or one where any source file was a syntax error.) Many systems include interpreters for the quine language, including cat on Unix-like systems and TYPE on DOS/Windows. (There are also common interpreters for the two simpler languages listed above; the commands true and false on Unix and Unix-like systems.)
gbf
(download gbf-4.8.2.tar.gz)
A front end for brainfuck, one of the best known esoteric programming languages. This implementation uses 8-bit wrapping cells and a fixed array of 30,000 cells with undefined behaviour on out-of-bounds access.
g99
(download g99-4.8.2.tar.gz)
A front end for 99, a joke language invented by Keith S. Thompson. Any syntactically correct program in 99 performs the same task when run: printing the lyrics to "99 Bottles of Beer".
Although an interpreter is available, there was until now no compiler for this language, so here is one. In fact, this was written as a proof-of-concept for writing a "compiler" which (mostly) ignores the input "source code" and effectively contains the program to be produced within itself.
This compiler actually introduces an extension to the 99 language. The language specification allows Perl-style comments: a hash (#) introduces a comment which runs to the end of the line. As I usually write C, C++ and Java, I found this unnecessarily restrictive, I have therefore allowed C-style (slash-star (/*) to star-slash (*/)) and C++-style (slash-slash (//) to end-of-line) comments. Since I used to write Visual Basic, I have also allowed comments between an apostraphe (') and the end of the line. This is a pure extension to the language — it does not affect the behaviour of any program which conforms to the specification. For those worried about their 99 programs being portable, the compiler option -Wforeign-comments will produce a warning for programs which use this extension; -Werror=foreign-comments will issue this warning as an error.[1]

Version compatibility: When compiled as part of

GCC 4.7.x
All of these front ends are incompatible
GCC 4.8.0 - 4.8.4
Compatible and tested!
GCC 4.9.0 - 4.9.1
Incompatible. However, they can be made compatible by adding the following includes to their language1.c files:
Somewhere after "tree.h": "stringpool.h"
Somewhere before "gimple.h": "dumpfile.h", "basic-block.h", "tree-ssa-alias.h", "internal-fn.h", "gimple-expr.h" and "is-a.h"
Somewhere after "gimple.h": "gimplfy.h"

[1]Although this sentence is very much tongue-in-cheek, it does allow the compiler to contain a useful example of compiler options and warnings.

Usage

Compiling and using each of these front ends follows the same pattern. First download the source tarballs for GCC 4.8.2 and the front end(s) that you wish to use. Extracting these should produce a directory gcc-4.8.2 for GCC itself and name-4.8.2 for each front end. Merge these directories together (each front end directory contains a directory name-4.8.2/gcc/language, which can be copied to gcc-4.8.2/gcc/language). Then, follow the instructions for compiling GCC. My preferred method looks something like this (assuming the GCC source code has been extracted to /home/chris/gcc/gcc-4.8.2):

/home/chris/gcc$ mkdir gcc-4.8.2-build
/home/chris/gcc$ cd gcc-4.8.2-build
/home/chris/gcc/gcc-4.8.2-build$ ../gcc-4.8.2/configure
Many lines of output omitted...
/home/chris/gcc/gcc-4.8.2-build$ make -j6
Many lines of output omitted...
/home/chris/gcc/gcc-4.8.2-build$ make DESTDIR=/home/chris/gcc/gcc-4.8.2-stage install
Many lines of output omitted...
/home/chris/gcc/gcc-4.8.2-build$ cd ..
/home/chris/gcc$

(Note: the -j6 parameter specifies to build using 6 jobs in parallel. You should tailor this according to the number of cores in your machine and whether you wish to get anything else done while GCC builds.) If this succeeds, the new compiler can be executed with `gcc-4.8.2-stage/usr/local/bin/gcc source-file'.

Notes

These are some notes about producing these; maybe they will help others with similar projects. (More likely they will be useful to me when I come back to this months later!) While we are here, a shoutout goes to the Pygments syntax highlighter (http://pygments.org/), without which the source code would be much less colourful.

Other References

The most accurate, complete and up-to-date reference is, of course, the code. The code is also well commented (this is pretty much a requirement for a project which has contributors all around the world who cannot discuss the code in person), so if you can find the right part of the code, you should be able to work out what it is doing. Of course, GCC is a very large project, so finding the right place becomes a task in itself.

The most important reference to use for this is the GCC internals manual, available at http://gcc.gnu.org/onlinedocs/gccint/. It is also available in other formats at the bottom of this page.

The GCC wiki (http://gcc.gnu.org/wiki/) is another useful resource, particularly the page GettingStarted. This also contains a useful link to a white paper about developing a GCC front end, although the paper was written for GCC 4.6.0 and several things have changed since then.

General Architecture

First, it is necessary to understand the overall process of building an executable from source code using GCC. It is a misconception that the executable gcc (note: gcc is distinct from GCC) is a compiler; more properly, gcc is a compiler driver. gcc is responsible for invoking, in turn, the compiler, assembler and linker, which looks a little like this:

handy diagram

The compiler itself includes a front end, which parses the source code and produces a neutral form called GENERIC, which it then transforms into GIMPLE. The back end is the machine-specific part, responsible for converting RTL into assembly code. The two are joined by the (oximoronically named) middle end, which performs optimisations in addition to converting GIMPLE to RTL.

another handy diagram

Files

All front ends require the following files.

Code

Now, the fun bit! The actual programming code for the compiler. The main interface between the front end and the rest of the compiler is in the form of language hooks. The header files langhooks.h and langhooks-def.h define the language hooks and default implementations for most of them. A front end usually implements the hooks by including these two headers, redefining the preprocessor macros (of the form LANG_HOOKS_*) for the required hooks and finally including the line of code `struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER;'.

The quine front end is simple enough that all of the code is in a single source file, quine1.c. It begins by including some header files:

/* Front-end language hooks for the quine language.
   Copyright (c) 2013 Free Software Foundation, Inc.
   Copyright (c) 2014 Christopher Smith

This file is part of gquine.

gquine is free software: you can redistribute it and/or modify it under the
terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.

gquine is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
gquine. If not, see <http://www.gnu.org/licenses/>. */

/* Basic stuff. */
#include "config.h"
#include "system.h"
#include "ansidecl.h"
/* Trees. */
#include "coretypes.h"
#include "tree.h"
/* Stuff we do with trees. */
#include "tree-iterator.h"
#include "dumpfile.h"
#include "gimple.h"
#include "cgraph.h"
/* Options. */
#include "opts.h"
#include "options.h"
#include "flags.h"
/* Warnings and errors. */
#include "diagnostic-core.h"
/* Hooks. */
#include "toplev.h"
#include "langhooks.h"
#include "langhooks-def.h"
/* GGC needs this. */
#include "debug.h"
/* Vector template class. */
#include "vec.h"

These structures must be defined by the front end. I haven't touched them myself.

/* Required structures. */

struct GTY(()) lang_decl
{
  char dummy;
};

struct GTY(()) lang_type
{
  char dummy;
};

struct GTY(()) language_function
{
  char dummy;
};

struct GTY(()) lang_identifier
{
  struct tree_identifier common;
};

union GTY((desc ("0"),
           chain_next ("(union lang_tree_node *) TREE_CHAIN (&%h.generic)")))
      lang_tree_node
{
  union tree_node GTY((tag ("0"), desc ("tree_node_structure (&%h)"))) generic;
};

We need a global variable to hold the global declarations (in this case, just the main () function).

/* Global variables. */

tree global_decls = NULL_TREE;

Now we are ready to start defining language hooks. The hooks are all stored in a global variable of type struct lang_hooks called lang_hooks. struct lang_hooks is defined in gcc/langhooks.h; the lang_hooks global must be defined by the front end. The easiest way to do this is with the LANG_HOOKS_INITIALIZER macro from gcc/langhooks-def.h; this macro itself expands to an initaliser list built up of other LANG_HOOKS_* macros, most of which have defaults specified in the same header file. The front end can #undef those it wishes to override and #define them to something else. Nearly all of the language hooks take the form of function pointers.

The hook lang_hooks.name is intialised with LANG_HOOKS_NAME; it specifies the name of the front end.

/* Language hooks. */

#undef LANG_HOOKS_NAME
#define LANG_HOOKS_NAME "GNU Quine"

The first hook function for this compiler is lang_hooks.option_lang_mask, AKA LANG_HOOKS_OPTION_LANG_MASK. All this function has to do is return a constant whose name is CL_ followed by the language's name (as specified in lang.opt). This allows the compiler to check whether that the user has only used options which are supported by the front end. Of course, just implementing this function does not actually process the options; for that you must override LANG_HOOKS_HANDLE_OPTION (see g99 for an example).

#undef LANG_HOOKS_OPTION_LANG_MASK
#define LANG_HOOKS_OPTION_LANG_MASK quine_option_lang_mask
unsigned int
quine_option_lang_mask (void)
{
  return CL_quine;
}

The hook lang_hooks.init is called after the options have been processed to let the front end perform its own initialisation. This is a good time to create commonly used trees, such as integer types.

#undef LANG_HOOKS_INIT
#define LANG_HOOKS_INIT quine_init
bool
quine_init (void)
{
  /* Create default types like int. */
  build_common_tree_nodes (false, false);
  return true;
}

The hook lang_hooks.parse_file reads in and parses the source file. This implementation stores the code (in GENERIC format) in the global_decls variable. This hook also dumps the source file (more properly, the translation unit) in tree format if this has been requested.

/* Language hook to read and parse the source file. */
#undef LANG_HOOKS_PARSE_FILE
#define LANG_HOOKS_PARSE_FILE quine_parse_file
void
quine_parse_file (void)
{
  /* Open the source file. */
  FILE *in = fopen (in_fnames[0], "r");
  if (in == NULL)
    {
      error ("Can't open input file %s", in_fnames[0]);
      return;
    }

  /* Read the file into an array. */
  vec<char> file_data = vNULL;
  int c;
  while ((c = fgetc (in)) != EOF)
    file_data.safe_push (c);
  /* If the file ends in a newline, remove it; puts () will append a newline
   * anyway. If the file does not end in a newline, puts () will add one, but
   * this is a Good Thing when run in a terminal. */
  if (!file_data.is_empty () && file_data.last () == '\n')
    file_data.pop ();
  file_data.safe_push (0);

  fclose (in);

  /* Create the main () function. */
  /* Type for a function which returns int and has no parameters. */
  tree main_type = build_function_type_list (integer_type_node, NULL_TREE);
  /* Function with no source location, named "main" and of the above type. */
  tree main_decl = build_decl (BUILTINS_LOCATION, FUNCTION_DECL,
                               get_identifier ("main"), main_type);

  /* File scope. */
  DECL_CONTEXT (main_decl) = NULL_TREE;
  /* Has a definition. */
  TREE_STATIC (main_decl) = true;
  /* External linkage. */
  TREE_PUBLIC (main_decl) = true;
  /* Don't need anywhere to store argumnets. */
  DECL_ARGUMENTS (main_decl) = NULL_TREE;

  /* Return value for main (). */
  /* Result variable with no name and same type as main () returns. */
  tree main_ret = build_decl (BUILTINS_LOCATION, RESULT_DECL, NULL_TREE,
                              TREE_TYPE (main_type));
  /* Scoped within the main function. */
  DECL_CONTEXT (main_ret) = main_decl;
  /* Generated by the compiler. */
  DECL_ARTIFICIAL (main_ret) = true;
  /* No debugging symbol. */
  DECL_IGNORED_P (main_ret) = true;

  /* The result of the function is the above value. */
  DECL_RESULT (main_decl) = main_ret;

  /* Block to represent the scope of local variables. */
  tree bl = build_block (NULL_TREE, NULL_TREE, main_decl, NULL_TREE);
  DECL_INITIAL (main_decl) = bl;
  TREE_USED (bl) = true;

  /* The bind expression contains the statements to execute. */
  tree bind = build3 (BIND_EXPR, void_type_node, BLOCK_VARS (bl), NULL_TREE,
                      bl);
  /* Don't optimise it away. */
  TREE_SIDE_EFFECTS (bind) = true;

  /* List of statements in the main () function. */
  tree main_stmts = alloc_stmt_list ();

  /* Build the puts () function declaration. */
  /* Function type which returns int with unspecified parameters. */
  tree puts_type = build_function_type (integer_type_node, NULL_TREE);
  /* Function named "puts" of that type. */
  tree puts_decl = build_decl (BUILTINS_LOCATION, FUNCTION_DECL,
                               get_identifier ("puts"), puts_type);
  /* Only a declaration, no definition. */
  DECL_EXTERNAL (puts_decl) = true;
  /* External linkage. */
  TREE_PUBLIC (puts_decl) = true;

  /* puts("..."); */
  /* Expression which calls puts () with 1 argument which is a string literal
   * containing the source file data. */
  tree call_puts = build_call_expr (puts_decl, 1,
                                    build_string_literal (
                                        file_data.length (),
                                        file_data.address ()));
  append_to_statement_list (call_puts, &main_stmts);

  /* return 0; */
  /* Assign 0 to the return value. */
  tree main_set_ret = build2 (MODIFY_EXPR, TREE_TYPE (main_ret), main_ret,
                              build_int_cst (integer_type_node, 0));
  /* Perform the assignment and return from the function. */
  tree main_ret_expr = build1 (RETURN_EXPR, void_type_node, main_set_ret);
  append_to_statement_list (main_ret_expr, &main_stmts);

  /* Make the bind contain the statements. */
  BIND_EXPR_BODY (bind) = main_stmts;
  /* Set main to use the statements in bind. */
  DECL_SAVED_TREE (main_decl) = bind;

  /* Add the main function to the list of global declarations. */
  global_decls = chainon (main_decl, global_decls);

  /* Dump out the translation unit. */
  FILE *tu_stream = dump_begin (TDI_tu, NULL);
  if (tu_stream)
    {
      dump_node (global_decls, 0, tu_stream);
      dump_end (TDI_tu, tu_stream);
    }

  /* Prepare declarations for middle-end. */
  for (tree t = global_decls; t; t = TREE_CHAIN (t))
    {
      gimplify_function_tree (t);
      cgraph_finalize_function (t, false);
    }
}

The next two hooks are not used and I don't know what they are meant to do. They are lang_hooks.decls.global_bindings_p and lang_hooks.decls.pushdecl respectively. They must be implemented by the front end as langhooks-def.h does not provide a usable default, so I simply call the function gcc_unreachable (), which dumps a stack trace and terminates the compilation. (In the comments, ICE stands for internal compiler error.)

/* Unused; terminate with ICE. */
#undef LANG_HOOKS_GLOBAL_BINDINGS_P
#define LANG_HOOKS_GLOBAL_BINDINGS_P quine_global_bindings_p
bool
quine_global_bindings_p (void)
{
  gcc_unreachable ();
}

/* Unused; terminate with ICE. */
#undef LANG_HOOKS_PUSHDECL
#define LANG_HOOKS_PUSHDECL quine_pushdecl
tree
quine_pushdecl (tree decl ATTRIBUTE_UNUSED)
{
  gcc_unreachable ();
}

lang_hooks.decls.getdecls returns the global declarations, chained together with chainon.

/* Return the global declarations. */
#undef LANG_HOOKS_GETDECLS
#define LANG_HOOKS_GETDECLS quine_getdecls
tree
quine_getdecls (void)
{
  return global_decls;
}

Two more unused but required functions; lang_hooks.types.type_for_mode and lang_hooks.types.type_for_mode. They are intended to return a suitable type for a given size and signedness. The only time I have seen either of them called is when using my brainfuck compiler with optimisations; so have a look at gbf for an implementation that does something.

/* Unused; terminate with ICE. */
#undef LANG_HOOKS_TYPE_FOR_MODE
#define LANG_HOOKS_TYPE_FOR_MODE quine_type_for_mode
tree
quine_type_for_mode (enum machine_mode mode ATTRIBUTE_UNUSED,
                     int unsignedp ATTRIBUTE_UNUSED)
{
  gcc_unreachable ();
}

/* Unused; terminate with ICE. */
#undef LANG_HOOKS_TYPE_FOR_SIZE
#define LANG_HOOKS_TYPE_FOR_SIZE quine_type_for_size
tree
quine_type_for_size (unsigned precision ATTRIBUTE_UNUSED,
                     int unsignedp ATTRIBUTE_UNUSED)
{
  gcc_unreachable ();
}

Next we provide the definition of the lang_hooks global variable. Finally we include the two header files that were generated by GGC.

/* Define the lang_hooks structure. */

struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER;

/* GGC stuff. */

#include "gt-quine-quine1.h"
#include "gtype-quine.h"