Here are some front ends I have written for GCC, as well as my notes on writing them. For the uninitiated, a new front end means extending GCC to support a new programming language.
Front ends
- gquine
-
(download gquine-4.8.2.tar.gz)
A front end for "quine", a language I made up where any source file is a quine, a program which outputs its own source code. For example, the "Hello, World!" program in this language is:Hello, World!
This was my first front end; I therefore used the simplest language I could imagine which did something. (Ignoring the "does something" condition, one could imagine a simpler language where any source file compiled into a program that does nothing or one where any source file was a syntax error.) Many systems include interpreters for the quine language, includingcat
on Unix-like systems andTYPE
on DOS/Windows. (There are also common interpreters for the two simpler languages listed above; the commandstrue
andfalse
on Unix and Unix-like systems.) - gbf
-
(download gbf-4.8.2.tar.gz)
A front end for brainfuck, one of the best known esoteric programming languages. This implementation uses 8-bit wrapping cells and a fixed array of 30,000 cells with undefined behaviour on out-of-bounds access. - g99
-
(download g99-4.8.2.tar.gz)
A front end for 99, a joke language invented by Keith S. Thompson. Any syntactically correct program in 99 performs the same task when run: printing the lyrics to "99 Bottles of Beer".
Although an interpreter is available, there was until now no compiler for this language, so here is one. In fact, this was written as a proof-of-concept for writing a "compiler" which (mostly) ignores the input "source code" and effectively contains the program to be produced within itself.
This compiler actually introduces an extension to the 99 language. The language specification allows Perl-style comments: a hash (#) introduces a comment which runs to the end of the line. As I usually write C, C++ and Java, I found this unnecessarily restrictive, I have therefore allowed C-style (slash-star (/*) to star-slash (*/)) and C++-style (slash-slash (//) to end-of-line) comments. Since I used to write Visual Basic, I have also allowed comments between an apostraphe (') and the end of the line. This is a pure extension to the language — it does not affect the behaviour of any program which conforms to the specification. For those worried about their 99 programs being portable, the compiler option-Wforeign-comments
will produce a warning for programs which use this extension;-Werror=foreign-comments
will issue this warning as an error.[1]
Version compatibility: When compiled as part of
- GCC 4.7.x
- All of these front ends are incompatible
- GCC 4.8.0 - 4.8.4
- Compatible and tested!
- GCC 4.9.0 - 4.9.1
-
Incompatible. However, they can be made compatible by adding the following includes to their
language1.c
files:
Somewhere after"tree.h"
:"stringpool.h"
Somewhere before"gimple.h"
:"dumpfile.h"
,"basic-block.h"
,"tree-ssa-alias.h"
,"internal-fn.h"
,"gimple-expr.h"
and"is-a.h"
Somewhere after"gimple.h"
:"gimplfy.h"
[1]Although this sentence is very much tongue-in-cheek, it does allow the compiler to contain a useful example of compiler options and warnings.
Usage
Compiling and using each of these front ends follows the same pattern. First download the source tarballs for GCC 4.8.2 and the front end(s) that you wish to use. Extracting these should produce a directory gcc-4.8.2
for GCC itself and name-4.8.2
for each front end. Merge these directories together (each front end directory contains a directory name-4.8.2/gcc/language
, which can be copied to gcc-4.8.2/gcc/language
). Then, follow the instructions for compiling GCC. My preferred method looks something like this (assuming the GCC source code has been extracted to /home/chris/gcc/gcc-4.8.2
):
/home/chris/gcc$ mkdir gcc-4.8.2-build /home/chris/gcc$ cd gcc-4.8.2-build /home/chris/gcc/gcc-4.8.2-build$ ../gcc-4.8.2/configure Many lines of output omitted... /home/chris/gcc/gcc-4.8.2-build$ make -j6 Many lines of output omitted... /home/chris/gcc/gcc-4.8.2-build$ make DESTDIR=/home/chris/gcc/gcc-4.8.2-stage install Many lines of output omitted... /home/chris/gcc/gcc-4.8.2-build$ cd .. /home/chris/gcc$
(Note: the -j6 parameter specifies to build using 6 jobs in parallel. You should tailor this according to the number of cores in your machine and whether you wish to get anything else done while GCC builds.) If this succeeds, the new compiler can be executed with `gcc-4.8.2-stage/usr/local/bin/gcc source-file
'.
Notes
These are some notes about producing these; maybe they will help others with similar projects. (More likely they will be useful to me when I come back to this months later!) While we are here, a shoutout goes to the Pygments syntax highlighter (http://pygments.org/), without which the source code would be much less colourful.
Other References
The most accurate, complete and up-to-date reference is, of course, the code. The code is also well commented (this is pretty much a requirement for a project which has contributors all around the world who cannot discuss the code in person), so if you can find the right part of the code, you should be able to work out what it is doing. Of course, GCC is a very large project, so finding the right place becomes a task in itself.
The most important reference to use for this is the GCC internals manual, available at http://gcc.gnu.org/onlinedocs/gccint/. It is also available in other formats at the bottom of this page.
The GCC wiki (http://gcc.gnu.org/wiki/) is another useful resource, particularly the page GettingStarted. This also contains a useful link to a white paper about developing a GCC front end, although the paper was written for GCC 4.6.0 and several things have changed since then.
General Architecture
First, it is necessary to understand the overall process of building an executable from source code using GCC. It is a misconception that the executable gcc
(note: gcc
is distinct from GCC) is a compiler; more properly, gcc
is a compiler driver. gcc
is responsible for invoking, in turn, the compiler, assembler and linker, which looks a little like this:
The compiler itself includes a front end, which parses the source code and produces a neutral form called GENERIC, which it then transforms into GIMPLE. The back end is the machine-specific part, responsible for converting RTL into assembly code. The two are joined by the (oximoronically named) middle end, which performs optimisations in addition to converting GIMPLE to RTL.
Files
All front ends require the following files.
-
config-lang.in
Configuration file for the front end. It is executed as a shell script as part ofconfigure
. It is expected to set the following variables:- language: the name of the language.
- compilers: the name of the compiler executable.
- gtfiles: a list of source that need to be processed by GGC, the garbage collector used in GCC.
# Front-end configuration file for the quine language. # Copyright (c) 2013 Free Software Foundation, Inc. # Copyright (c) 2014 Christopher Smith # # This file is part of gquine. # # gquine is free software: you can redistribute it and/or modify it under the # terms of the GNU General Public License as published by the Free Software # Foundation, either version 3 of the License, or (at your option) any later # version. # # gquine is distributed in the hope that it will be useful, but WITHOUT ANY # WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR # A PARTICULAR PURPOSE. See the GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along with # gquine. If not, see <http://www.gnu.org/licenses/>. language="quine" compilers="quine1\$(exeext)" gtfiles="\$(srcdir)/quine/quine1.c"
-
lang-specs.h
This file gets included by the compiler driver, in the initialiser for the array `struct compiler default_compilers[]
' . It specifies when and how the compiler driver should invoke the compiler. This example specifies two entries in the array. The first begins".quine"
, so it specifies the rule to use for source files with the extension 'quine'; the rule"@quine"
says to use the rule for the language 'quine'. The second entry begins"@quine"
, so it specifies the rule for the quine language. The rule is treated as a command line; the percent sign allows special substitutions (for the full list see the comment beginning on line 270 ofgcc-4.8.2/gcc/gcc.c
titled "The Specs Language"). Soquine1
is the name of the compiler to execute,%(cc1_options)
includes another specification string which is common to almost all front ends,%i
simply passes the name of the source file and%{!fsyntax-only:%(invoke_as)}
includes a specification string which invokes the assembler, unless -fsyntax-only was specified.
The name following the @ sign can also be specified by the user to override the language, e.g. `gcc -x quine myfile
' will compile myfile as quine regardless of its extension. The three zeros are a way to specify extra properties of the rule, see the definition ofstruct compiler
ingcc-4.8.2/gcc/gcc.c
for more.
Example from gquine:/* Language specifications for the quine language. Copyright (c) 2013 Free Software Foundation, Inc. Copyright (c) 2014 Christopher Smith This file is part of gquine. gquine is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. gquine is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with gquine. If not, see <http://www.gnu.org/licenses/>. */ {".quine", "@quine", 0, 0, 0}, {"@quine", "quine1 %(cc1_options) %i %{!fsyntax-only:%(invoke_as)}", 0, 0, 0},
-
lang.opt
This file specifies what command line options are understood by the compiler. It is pretty well documented in the GCC internals manual, under Options. See my g99 project for an example which actually defines something here.
Example from gquine:; Options file for the quine language. ; Copyright (c) 2013 Free Software Foundation, Inc. ; Copyright (c) 2014 Christopher Smith ; ; This file is part of gquine. ; ; gquine is free software: you can redistribute it and/or modify it under the ; terms of the GNU General Public License as published by the Free Software ; Foundation, either version 3 of the License, or (at your option) any later ; version. ; ; gquine is distributed in the hope that it will be useful, but WITHOUT ANY ; WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR ; A PARTICULAR PURPOSE. See the GNU General Public License for more details. ; ; You should have received a copy of the GNU General Public License along with ; gquine. If not, see <http://www.gnu.org/licenses/>. ; Please try to keep this file in ASCII collating order. Language quine ;
-
Make-lang.in
This file gets used as part of the global makefile for GCC. Most of the required targets do nothing for a simple front end like this; the target that builds the compiler is simply the name of the language. Most of the others don't do anything here, mainly because they are about the documentation and I'm to lazy to write any.
Example from gquine:# Makefile fragement for the quine language. # Copyright (c) 2013 Free Software Foundation, Inc. # Copyright (c) 2014 Christopher Smith # # This file is part of gquine. # # gquine is free software: you can redistribute it and/or modify it under the # terms of the GNU General Public License as published by the Free Software # Foundation, either version 3 of the License, or (at your option) any later # version. # # gquine is distributed in the hope that it will be useful, but WITHOUT ANY # WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR # A PARTICULAR PURPOSE. See the GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along with # gquine. If not, see <http://www.gnu.org/licenses/>. .PHONY: quine quine: quine1$(exeext) quine/quine1.o: quine/quine1.c $(CONFIG_H) $(SYSTEM_H) $(TREE_H) \ tree-iterator.h dumpfile.h $(GIMPLE_H) $(CGRAPH_H) $(OPTS_H) \ $(OPTIONS_H) $(FLAGS_H) $(DIAGNOSTIC_CORE_H) toplev.h \ langhooks.h $(LANGHOOKS_DEF_H) debug.h $(VEC_H) quine1$(exeext): quine/quine1.o $(BACKEND) $(LIBSDEPS) attribs.o +$(LINKER) quine/quine1.o $(ALL_CFLAGS) $(LDFLAGS) $(BACKEND) $(LIBS) \ $(GMPLIBS) $(BACKENDLIBS) attribs.o -o $@ quine.all.cross: quine.start.encap: quine.rest.encap: quine.tags: quine.info: quine.dvi: quine.pdf: quine.html: quine.man: quine.install-common: quine.install-info: quine.install-man: quine.install-plugin: quine.srcextra: quine.srcinfo: quine.srcman: quine.uninstall: quine.mostlyclean: $(RM) quine/quine1.o quine.clean: quine.mostlyclean quine.distclean: quine.clean quine.maintainer-clean: quine.distclean
Code
Now, the fun bit! The actual programming code for the compiler. The main interface between the front end and the rest of the compiler is in the form of language hooks. The header files langhooks.h
and langhooks-def.h
define the language hooks and default implementations for most of them. A front end usually implements the hooks by including these two headers, redefining the preprocessor macros (of the form LANG_HOOKS_*) for the required hooks and finally including the line of code `struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER;
'.
The quine front end is simple enough that all of the code is in a single source file, quine1.c. It begins by including some header files:
/* Front-end language hooks for the quine language. Copyright (c) 2013 Free Software Foundation, Inc. Copyright (c) 2014 Christopher Smith This file is part of gquine. gquine is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. gquine is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with gquine. If not, see <http://www.gnu.org/licenses/>. */ /* Basic stuff. */ #include "config.h" #include "system.h" #include "ansidecl.h" /* Trees. */ #include "coretypes.h" #include "tree.h" /* Stuff we do with trees. */ #include "tree-iterator.h" #include "dumpfile.h" #include "gimple.h" #include "cgraph.h" /* Options. */ #include "opts.h" #include "options.h" #include "flags.h" /* Warnings and errors. */ #include "diagnostic-core.h" /* Hooks. */ #include "toplev.h" #include "langhooks.h" #include "langhooks-def.h" /* GGC needs this. */ #include "debug.h" /* Vector template class. */ #include "vec.h"
These structures must be defined by the front end. I haven't touched them myself.
/* Required structures. */ struct GTY(()) lang_decl { char dummy; }; struct GTY(()) lang_type { char dummy; }; struct GTY(()) language_function { char dummy; }; struct GTY(()) lang_identifier { struct tree_identifier common; }; union GTY((desc ("0"), chain_next ("(union lang_tree_node *) TREE_CHAIN (&%h.generic)"))) lang_tree_node { union tree_node GTY((tag ("0"), desc ("tree_node_structure (&%h)"))) generic; };
We need a global variable to hold the global declarations (in this case, just the main ()
function).
/* Global variables. */ tree global_decls = NULL_TREE;
Now we are ready to start defining language hooks. The hooks are all stored in a global variable of type struct lang_hooks
called lang_hooks
. struct lang_hooks
is defined in gcc/langhooks.h
; the lang_hooks
global must be defined by the front end. The easiest way to do this is with the LANG_HOOKS_INITIALIZER
macro from gcc/langhooks-def.h
; this macro itself expands to an initaliser list built up of other LANG_HOOKS_*
macros, most of which have defaults specified in the same header file. The front end can #undef
those it wishes to override and #define
them to something else. Nearly all of the language hooks take the form of function pointers.
The hook lang_hooks.name
is intialised with LANG_HOOKS_NAME
; it specifies the name of the front end.
/* Language hooks. */ #undef LANG_HOOKS_NAME #define LANG_HOOKS_NAME "GNU Quine"
The first hook function for this compiler is lang_hooks.option_lang_mask
, AKA LANG_HOOKS_OPTION_LANG_MASK
. All this function has to do is return a constant whose name is CL_ followed by the language's name (as specified in lang.opt
). This allows the compiler to check whether that the user has only used options which are supported by the front end. Of course, just implementing this function does not actually process the options; for that you must override LANG_HOOKS_HANDLE_OPTION
(see g99 for an example).
#undef LANG_HOOKS_OPTION_LANG_MASK #define LANG_HOOKS_OPTION_LANG_MASK quine_option_lang_mask unsigned int quine_option_lang_mask (void) { return CL_quine; }
The hook lang_hooks.init
is called after the options have been processed to let the front end perform its own initialisation. This is a good time to create commonly used trees, such as integer types.
#undef LANG_HOOKS_INIT #define LANG_HOOKS_INIT quine_init bool quine_init (void) { /* Create default types like int. */ build_common_tree_nodes (false, false); return true; }
The hook lang_hooks.parse_file
reads in and parses the source file. This implementation stores the code (in GENERIC format) in the global_decls
variable. This hook also dumps the source file (more properly, the translation unit) in tree format if this has been requested.
/* Language hook to read and parse the source file. */ #undef LANG_HOOKS_PARSE_FILE #define LANG_HOOKS_PARSE_FILE quine_parse_file void quine_parse_file (void) { /* Open the source file. */ FILE *in = fopen (in_fnames[0], "r"); if (in == NULL) { error ("Can't open input file %s", in_fnames[0]); return; } /* Read the file into an array. */ vec<char> file_data = vNULL; int c; while ((c = fgetc (in)) != EOF) file_data.safe_push (c); /* If the file ends in a newline, remove it; puts () will append a newline * anyway. If the file does not end in a newline, puts () will add one, but * this is a Good Thing when run in a terminal. */ if (!file_data.is_empty () && file_data.last () == '\n') file_data.pop (); file_data.safe_push (0); fclose (in); /* Create the main () function. */ /* Type for a function which returns int and has no parameters. */ tree main_type = build_function_type_list (integer_type_node, NULL_TREE); /* Function with no source location, named "main" and of the above type. */ tree main_decl = build_decl (BUILTINS_LOCATION, FUNCTION_DECL, get_identifier ("main"), main_type); /* File scope. */ DECL_CONTEXT (main_decl) = NULL_TREE; /* Has a definition. */ TREE_STATIC (main_decl) = true; /* External linkage. */ TREE_PUBLIC (main_decl) = true; /* Don't need anywhere to store argumnets. */ DECL_ARGUMENTS (main_decl) = NULL_TREE; /* Return value for main (). */ /* Result variable with no name and same type as main () returns. */ tree main_ret = build_decl (BUILTINS_LOCATION, RESULT_DECL, NULL_TREE, TREE_TYPE (main_type)); /* Scoped within the main function. */ DECL_CONTEXT (main_ret) = main_decl; /* Generated by the compiler. */ DECL_ARTIFICIAL (main_ret) = true; /* No debugging symbol. */ DECL_IGNORED_P (main_ret) = true; /* The result of the function is the above value. */ DECL_RESULT (main_decl) = main_ret; /* Block to represent the scope of local variables. */ tree bl = build_block (NULL_TREE, NULL_TREE, main_decl, NULL_TREE); DECL_INITIAL (main_decl) = bl; TREE_USED (bl) = true; /* The bind expression contains the statements to execute. */ tree bind = build3 (BIND_EXPR, void_type_node, BLOCK_VARS (bl), NULL_TREE, bl); /* Don't optimise it away. */ TREE_SIDE_EFFECTS (bind) = true; /* List of statements in the main () function. */ tree main_stmts = alloc_stmt_list (); /* Build the puts () function declaration. */ /* Function type which returns int with unspecified parameters. */ tree puts_type = build_function_type (integer_type_node, NULL_TREE); /* Function named "puts" of that type. */ tree puts_decl = build_decl (BUILTINS_LOCATION, FUNCTION_DECL, get_identifier ("puts"), puts_type); /* Only a declaration, no definition. */ DECL_EXTERNAL (puts_decl) = true; /* External linkage. */ TREE_PUBLIC (puts_decl) = true; /* puts("..."); */ /* Expression which calls puts () with 1 argument which is a string literal * containing the source file data. */ tree call_puts = build_call_expr (puts_decl, 1, build_string_literal ( file_data.length (), file_data.address ())); append_to_statement_list (call_puts, &main_stmts); /* return 0; */ /* Assign 0 to the return value. */ tree main_set_ret = build2 (MODIFY_EXPR, TREE_TYPE (main_ret), main_ret, build_int_cst (integer_type_node, 0)); /* Perform the assignment and return from the function. */ tree main_ret_expr = build1 (RETURN_EXPR, void_type_node, main_set_ret); append_to_statement_list (main_ret_expr, &main_stmts); /* Make the bind contain the statements. */ BIND_EXPR_BODY (bind) = main_stmts; /* Set main to use the statements in bind. */ DECL_SAVED_TREE (main_decl) = bind; /* Add the main function to the list of global declarations. */ global_decls = chainon (main_decl, global_decls); /* Dump out the translation unit. */ FILE *tu_stream = dump_begin (TDI_tu, NULL); if (tu_stream) { dump_node (global_decls, 0, tu_stream); dump_end (TDI_tu, tu_stream); } /* Prepare declarations for middle-end. */ for (tree t = global_decls; t; t = TREE_CHAIN (t)) { gimplify_function_tree (t); cgraph_finalize_function (t, false); } }
The next two hooks are not used and I don't know what they are meant to do. They are lang_hooks.decls.global_bindings_p
and lang_hooks.decls.pushdecl
respectively. They must be implemented by the front end as langhooks-def.h
does not provide a usable default, so I simply call the function gcc_unreachable ()
, which dumps a stack trace and terminates the compilation. (In the comments, ICE stands for internal compiler error.)
/* Unused; terminate with ICE. */ #undef LANG_HOOKS_GLOBAL_BINDINGS_P #define LANG_HOOKS_GLOBAL_BINDINGS_P quine_global_bindings_p bool quine_global_bindings_p (void) { gcc_unreachable (); } /* Unused; terminate with ICE. */ #undef LANG_HOOKS_PUSHDECL #define LANG_HOOKS_PUSHDECL quine_pushdecl tree quine_pushdecl (tree decl ATTRIBUTE_UNUSED) { gcc_unreachable (); }
lang_hooks.decls.getdecls
returns the global declarations, chained together with chainon
.
/* Return the global declarations. */ #undef LANG_HOOKS_GETDECLS #define LANG_HOOKS_GETDECLS quine_getdecls tree quine_getdecls (void) { return global_decls; }
Two more unused but required functions; lang_hooks.types.type_for_mode
and lang_hooks.types.type_for_mode
. They are intended to return a suitable type for a given size and signedness. The only time I have seen either of them called is when using my brainfuck compiler with optimisations; so have a look at gbf for an implementation that does something.
/* Unused; terminate with ICE. */ #undef LANG_HOOKS_TYPE_FOR_MODE #define LANG_HOOKS_TYPE_FOR_MODE quine_type_for_mode tree quine_type_for_mode (enum machine_mode mode ATTRIBUTE_UNUSED, int unsignedp ATTRIBUTE_UNUSED) { gcc_unreachable (); } /* Unused; terminate with ICE. */ #undef LANG_HOOKS_TYPE_FOR_SIZE #define LANG_HOOKS_TYPE_FOR_SIZE quine_type_for_size tree quine_type_for_size (unsigned precision ATTRIBUTE_UNUSED, int unsignedp ATTRIBUTE_UNUSED) { gcc_unreachable (); }
Next we provide the definition of the lang_hooks
global variable. Finally we include the two header files that were generated by GGC.
/* Define the lang_hooks structure. */ struct lang_hooks lang_hooks = LANG_HOOKS_INITIALIZER; /* GGC stuff. */ #include "gt-quine-quine1.h" #include "gtype-quine.h"