Multi-file Hello, world

Imagine that Joe (after few years of learning and coding, fortunately he learns quickly) evolved the Hello, world so that it can produce more cool greetings. Judge yourself:

$ hello

********************
** Hello, world ! **
********************

$ _

He was so amazened by the look of this new banner that he wanted to use it in his other programs. He decided to split the code out of the rest of the greetings program. Finally he ended up with these two files:

File hello.c:

#include "banner.c"

int main(void)
{
  NewLine();
  GenerateBanner("Hello, world !");
  NewLine();
  return(0);
}

File banner.c:

#include <stdio.h>
#include <string.h>

void NewLine(void)
{
  printf("\n");
}

void GenerateStars(int Count)
{
  char BanStr[11];

  memset(BanStr,'*',10);
  BanStr[10]=0;
  while (Count>10) {
    printf(BanStr);
    Count-=10;
  }
  BanStr[Count]='\0';
  printf(BanStr);
}

void GenerateBanner(char *Str)
{
  int Len;

  Len=strlen(Str);
  GenerateStars(Len+6);
  printf("\n** %s **\n",Str);
  GenerateStars(Len+6);
  printf("\n");
}

This is the way he splits code all the time. It works ... but after a while he discovered that there is a problem with the speed of the compilation. As the size of the program he was developing increased, so did the time of the compilation. In our example it is no problem but the other program he is working now with has several tens of files with total size of 100 KB and its compilation lasts a quarter of hour. And the last gasper: when he changes anything in that very program, no matter how big or small change it was, the result is that he must wait that quarter of hour again to see his changes in action. Especially GCC is very sensitive to these issues with speed.

What's the matter? It is linked with the way how C compiler handles the files Jim gives it to process. The files in fact don't go directly into the C compiler. They are processed by something called C preprocessor first. The #include lines in the program are the directives interpreted by the preprocessor. In particular #include tells the preprocessor to include the content of the specified file at the point where it is lying. The result is that preprocessor combines all the tens of the files of the program into one large file that is passed to the C compiler itself. So the compiler really compiles all of the 100 KB of the sources everytime a small change is encountered in them.

Modularization

How to cope with that speed issue described above? We have to modularize our big program.

The source code of big programs is separated into modules that are compiled separately. The compilation of each module produces a file called relocatable object module. When each module has been compiled, the linker takes all these relocatable object modules produced by the compiler and assembles the resulting executable file from them.

The relocatable object module files are not removed from the harddisk once the executable is produced so they can be reused later. This has the advantage that when changes are made to one of the modules, only that very module must be recompiled. The remaining, unchanged modules have already their relocatable module files current so the programmer does not have to waste time recompiling them. The linking time of the program is much shorter than the compilation time, so this modularization appoarch can greatly reduce recompilation on small changes in the files even when working with very large software systems.

The modules reference each other; in each module it is clearly stated, which symbols are defined in the module itself and which need to be supplied by additional modules.

Modularization in C

So Jim is about to modularize his hello program. As said above, in each module he has to state, which symbols are imported from other module. To make things simpler and digestedly ordered, he decided to place the definition of the GenerareBanner() call provided by the module banner into banner.h and changed the #include in hello.c so that does not feed the whole banner module into that file but rather that small declaration only. So he ended up with three files instead of two:

File hello.c:

#include "banner.h"

int main(void)
{
  NewLine();
  GenerateBanner("Hello, world !");
  NewLine();
  return(0);
}

File banner.h:

void NewLine(void);
void GenerateBanner(char *Str);

File banner.c:

#include <stdio.h>
#include <string.h>

void NewLine(void)
{
  printf("\n");
}

void GenerateStars(int Count)
{
  char BanStr[11];

  memset(BanStr,'*',10);
  BanStr[10]=0;
  while (Count>10) {
    printf(BanStr);
    Count-=10;
  }
  BanStr[Count]='\0';
  printf(BanStr);
}

void GenerateBanner(char *Str)
{
  int Len;

  Len=strlen(Str);
  GenerateStars(Len+6);
  printf("\n** %s **\n",Str);
  GenerateStars(Len+6);
  printf("\n");
}

And now the compilation is going to be complex. We cannot compile the code using the single command anymore:

$ gcc hello.c
/home/jozef/tmp/ccsd3dvE.o(.text+0x12): In function 'main':
: undefined reference to 'NewLine'
/home/jozef/tmp/ccsd3dvE.o(.text+0x1e): In function 'main':
: undefined reference to 'GenerateBanner'
/home/jozef/tmp/ccsd3dvE.o(.text+0x23): In function 'main':
: undefined reference to 'NewLine'
collect2: ld returned 1 exit status
$ _

One of the ways how to fix things is to state all the modules at the commandline of the compiler. With gcc this works:

$ gcc hello.c banner.c
$ _

But not each compiler understands this trick:

C:\PROGS> cc hello.c banner.c
CC: Too many arguments

C:\PROGS> _

And even if it would, the operating system itself may have troubles if there are too many modules. For example this happens when I try to compile SYSLIB in MS-DOS by hand:

C:\PROGS> gcc pkgsyms.c scanfile.c scandop.c scanc.c scanpkg.c pkgmods.c exparr.
c pkgitems.c pkglist.c pkgtrans.c pkgerrs.c pkghdrs.c pack
(beeeeeep), (beeeeeeep), (beeeeep) ...

Finally, the problem with the compilation time is still there. The compiler still recompiles both modules even if only one is changed.

The other way is to compile everything separately and then link all the relocatable module files together using the linker:

C:\PROGS> cc hello.c

C:\PROGS> cc banner.c

C:\PROGS> link hello.obj banner.obj

C:\PROGS> hello

********************
** Hello, world ! **
********************

C:\PROGS> _

With gcc it is somewhat more complex. First we must tell it that we want to compile the modules only using the -c option (unlike the single file compilation, the gcc now names the relocatable object modules after their respecive sources so we don't have to rename the result to the proper name after each compilation). The second question is that how to get the pieces together when there is no linker in GNU/Linux (1)? The answer is to pass the object module files to gcc just like any other source code:

$ gcc -c hello.c
$ gcc -c banner.c
$ gcc hello.o banner.o
$ _

When Jim later changes the banner.c file only, he can omit the command that compiles hello.c and enter the two remaining commands to produce correct result.

The conclusion

Now we can see why the extra linking step is not so bad idea that it looks at first look. The precompiled relocatable modules can save us great amount of time when we are doing only small changes to the program. But now we need to figure out ourselves, what was changed and thus requires recompilation. And if we guess it wrong, we can end up with "ghosts of bugs", which are manifestations of the already fixed bugs.

Jim is going to be upset with this need. Is there a solution for this problem? The answer to this question is "yes" and we will look at it in the next story.