Wednesday, August 18, 2010

What include guards in C++ are, and what they are not

And why they do not solve "multiple definition" errors.

Include guards are heavily used in C++ and you see them in virtually any code base (that has more than two files). Sometimes there is a bit of confusion what they do, and what they can not do. I decided to explain everything in detail, so I ended up explaining the whole C++ compilation process of preprocessing, compiling and linking your application or library.

Introduction to include guards

First, it's important to know that include guards are not a language feature. They are a technique that use standard preprocessor features to solve a common issue. Namely, they avoid that one header is included multiple times. Let's assume a simple example of two files: test.cpp and the corresponding test.hpp:





test.hpp:
class Test
{
   public:
      void foo();
};
test.cpp:

#include <test.hpp>
/* Some more includes */
#include <test.hpp> // Error! class Test is already defined

void Test::foo()
{
   // Do something here
}

int main()
{
   Test test;
   test.foo();
}

As you can see, test.hpp is included two times in this example. This might not be common in the same implementation file, but when the list of include files is very long, this might very well happen. Remember that an #include is basically the same as when you copy & pasted the content of the included file at this very position. Only that this is done for you by the preprocessor. The resulting code when you run test.cpp through a preprocessor will look like this:

test.cpp - preprocessed

class Test
{
   public:
      void foo();
};

class Test
{
   public:
      void foo();
};

void Test::foo()
{
   // Do something here
}

int main()
{
   Test test;
   test.foo();
}

It should be obvious now why the error occurs. You can't define two classes with the same name in the same translation unit. And that's where the include guards come to the rescue. The basic idea behind include guards is: Do something in the header that will make the preprocessor not "copy & paste" the content into the including implementation file a second time. Besides getting rid of duplicate declarations, this even speeds up compiling, since otherwise the compiler would have to process the same content twice. This goal is achieved by checking whether a certain preprocessor macro is defined. If not, the macro will be defined so that a second check will notice that it is already there, which means that the header is already included. Here's how it is be done:

#ifndef TEST_H
#define TEST_H

class Test
{
   public:
      void foo();
};

#endif

This is what happens, from the viewpoint of the preprocessor:

  • 1st Include of test.hpp
    • Is TEST_H defined? - no
    • Define TEST_H
    • Include content of test.hpp
  • 2nd Include of test.hpp
    • Is TEST_H defined? - yes
    • Skip to the #endif direction
    • Include everything in test.hpp after the #endif

      (In 99.9%, this should be empty)

This leads to the desired result for the preprocessed test.cpp, even when we include test.hpp twice:

test.cpp - preprocessed with include guards
class Test
{
   public:
      void foo();
};

// Here should be the second include, which has been avoided by the include guard

void Test::foo()
{
   // Do something here
}

int main()
{
   Test test;
   test.foo();
}

Compiler-specific directions, #pragma once

Since include guards are so common, some compiler vendors (read: Microsoft) decided to create a special directive for them. You might have come across the statement #pragma once while reading through code. This #pragma once is all three preprocessor-directives in one. It replaces the #ifndef, the #define and the #endif (where the #endif is always at the very end of the file). As this is a vendor-specific directive, not all compilers support it. One reason for this may be because it does not solve a problem that couldn't be solved otherwise. This is why you should obide the following, simple rule:

Don't use #pragma once

Because:

  • It's compiler-specific and is not portable
  • Because of this, not everyone reading your code might know it
  • It does not offer big advantages over the "classic" include guards.
(I said "big" advantages. As noted on the Wikipedia article for #pragma once, Visual C++ includes optimization code that makes headers using #pragma once be skipped faster than classic include guards. While this might be desirable, I can't imagine that in practice it will be of great benefit. And, after all, GCC includes optimization for classic include guards, too, which might render the speed improvements of #pragma once on Visual Studio minimal.)

The compile process: Preprocess, compile, link

To fully understand the issue at hand, you have to know how the whole compilation procedure for C++ works. It is divided into three steps.

The first step is the preprocessor. Everything that starts with a # sign is a preprocessor directive. The most common ones are #define, #ifdef or #ifndef, #endif and so on. The preprocessor is essential to C++ because it's the only pragmatic way to split up declarations and definitions and make the same function or class usable from multiple implementation files. The preprocessor is called upon each implementation files (to make this clear again: These are usually called .cpp). The implementation files include header files (and should never include other implementation files). All preprocessor directives in the header files will be processed, too, which means that you can #include other files in the header, #define macros and create include guards. The result of a preprocessed implementation file is a large file that includes every included header, and of course the header included of those headers, and so on. For headers that use libraries such as the standard library, boost, Qt or something like this, the result will often be huge.

The second step is the compilation. This huge mess - also called the translation unit - will now be run through the compiler. Yes, the compiler only compiles one file. And this is not the .cpp file you might have passed on the command line or that you added to the project. It will be a file generated from the preprocessor that does not even remotely resemble your original .cpp file (except for the very bottom part). Most noteworthy, you will not find a single preprocessor directive (like #define) in this resulting file, since they all have been already processed by the preprocessor. The result of the compilation is the object file. The object file is not human readable and does not include source code anymore. Instead, it includes so-called symbols and their content, which might be variables, constants, functions and virtual function tables. This is why you usually have one object-file in your compiler's working directory (this is the Release and Debug directories for Visual Studio. GCC puts the object-files right where your .cpp file is, if you do not specify a file explicitly) for each implementation file in your project. You don't have object files for headers, because they are not individually compiled. They are only "glued" on top of your implementation files. At this point it is important to know that when the compiler creates a function call, it does not include the absolute address of the function in the generated code. Instead, it includes the symbol of that function. In C, this simply is the function's name. In C++, it's the function's name mangled with some meta-information that depends on the compiler. It's very similar with global variables and virtual function tables. They are not referenced by an absolute address, either, but via their symbol name. This means that the object files created by the compiler do not include machine code that is ready to be executed. The symbols have to be replaced with their absolute addresses. And this is where the linker jumps in.

The third and last step is the linking. Just like the compiler, the linker does not know about header files. And it does not know about implementation files, either. It only knows the object files and the linker's job is to link all object files together to one executable or library, that can finally be interpreted by the hardware. It does this by pasting all object files into one, and whenever a symbol occurs (this might be in a virtual or static function call or when referencing a global variable, for example), it replaces this symbol with it's absolute address. That is why you sometimes get errors like "Symbol xyz is already defined in foo.obj". This means that you defined a function or variable in two different translation units, which will generate the same symbol. This is not allowed per One Definition Rule because the linker would not know which of them it should use.

Multiple definitions

There are two ways you can cause multiple definitions of the same symbol: You accidentally define the same global variable or function at two distinct places (e.g. one time in the header and a second time in the implementation file) or you include a header that defines a global variable or function in two separate translation units. The first issue is handled by the compiler. As this double definition is in but one translation unit, the compiler notices that the identifier in question is already declared (and probably defined) and bails out. Here is an example of a double definition of a function foo():

test.hpp
void foo() { } // Note the empty function body!

test.cpp

void foo()
{
   /* Do something */
}

This sometimes can happen when changing a function to be inline and forgetting to remove the original definition.

The second issue is a bit more tricky.
Remember that a translation unit is the implementation file prepended by all included headers. So we take the same header as before, defining an inline function void foo() { }. But this time, it is included from two files in the project, one named test.cpp and the other named test2.cpp:

test.hpp
void foo() { }

test.cpp
#include "test.hpp"

// The actual content of the program is irrelevant
int main ()
{
}

test2.cpp
#include "test.hpp"

// Here are some functions defined in test2.cpp

When you compile this project, you will get an error message along the lines of

multiple definition of `foo()'
. This is because both translation units define a symbol foo, which is prohibited by the holy One Definition Rule.

Include guards to the... rescue?

So, you might think .o( When the function may not be defined twice, let's wrap test.hpp in an include guard! ). Try it and see for yourself. It will not make a difference. This is because the preprocessor will be run for each implementation file separately, leading to two distinct translation units. When test.cpp is preprocessed, TEST_H will not be defined and the header's content will be included in the resulting translation unit. Then the resulting code is compiled, leading to a test.o or test.obj, without any error. There is only one declaration and definition of the function in this translation unit, after all. When the second implementation file is being preprocessed, the preprocessor starts over, which means that #defines made in the first translation unit will not be available in the second or any other preprocessing. This means that TEST_H is still not defined for test2.cpp, so foo()'s definition will be included a second time. It will now be compiled into test2.o or test2.obj. foo() will be defined in both, test.obj and test2.obj.

The final step is the linking, which should lead to an executable or machine-readable library. But the linker notices that foo() is defined two times, prints the aforementioned error message and aborts it's task. Because the linker does not even look at the source code that the object files were compiled from, it does not know where the double definitions occured. In fact, the linker even works without the source code being available.

This means that include guards do not solve multiple definition errors.

6 comments:

  1. thanks for this article. very helpful for beginners, and those that forgot about this whole process.

    ReplyDelete
  2. Where does the standard say that variables can only be defined once in an entire program? The example you give above works fine for functions, but what if foo were a variable definition?

    ReplyDelete
  3. Hey gred,
    try to put "int blubb;" in a header and include this header in two .cpp (implementation)-files, say one.cpp and two.cpp.
    You will get a "multiple definition of blubb" error message by your linker (NOT compiler!).

    That's because global variable definitions have external linking. This means that the object files that are being generated when one.cpp and two.cpp are being compiled (let's name them one.o and two.o, that's what GCC does), both include a symbol "blubb".
    When the linker then tries to link the blubb variable, it will find it in one.o and two.o, which is not allowed.

    See also:
    http://en.wikipedia.org/wiki/One_Definition_Rule

    ReplyDelete
  4. Thank you for he article. But to me it seems the article ended prematurely, with the reader (me) wondering how to prevent multiple definitions in practice. What kind of programming habits and practices would prevent such problems? For example, should I always try to put `#include` in header files rather than `cpp` files (or the other way round)?

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete