Types of size overhead in current software

I belong to those people that in spare time look at the binary files using textfile viewers such as the one present in the Volkov Commander. So I was able to notify that there is something wrong with them. Almost all binaries that I looked at were bigger than they really needed to be. So I started a little research to reveal myself, which overheads are in the binaries and how many space in the binary they take.

1. Holes

This is the most visible type of overhead. Long series of NULL characters are present in almost any binary that was not made by myself and was not processed by any of the binary compression programs. It seems that many old compilers generate data sections in the middle of the program, followed by other code of the program. Therefore these data section must appear in the program's image in the binary file even if they don't contain useful data for the program loader. Usually these sections appear as long sequences of NULLs.

Another type of "holes" are paddings of various kind. Some paddings are purely to ensure that certain data structure or section in the file is at a position satisfying certain condition. These cases can be easily solved by designing the format as one not requiring such conditions. Other paddings are required by the code − some code needs to be aligned otherwise it will execute slowly or not at all. In fact it is difficult if not impossible to remove this kind of padding overhead. But manytimes it is possible that the speed loss from the unaligned code will be outweighted by the speed gain coming from the fact that the padding bytes are removed − especially if there are lots of them.

2. Dead code

Many people think all the code present in a binary is important. But in fact there are really few binaries that have nothing but the bare neccessities in them. You could be surprised how many bytes in a typical program binary could be changed to NULLs (or NOPs) without damaging it.

In the year 1995 I worked with a Turbo Pascal by Borland compiler that had implemented interesting feature: dead code elimination. It compiled each module of the program into a special object file, where each procedure was recorded as object of its own. When the program was linked, the linker took only such procedures that were referenced (directly or indirectly) by the main program. The rest was left out.

It is strange that it was the only compiler that had such type of dead code elimination implemented. All C compilers I ever tried (including GCC) were unable to eliminate unreferenced procedured and functions from my source code. Each of these compilers treat each module as one large object. When anything from the object is referenced, the whole object is placed into the resulting binary and no attention is paid to the fact that other pieces of this large object may be unused. The worse thing is that if some of the unreferenced procedures need another module that is never referenced from the live code, that module must also be linked into the program. And finally many of the compilers place all the objects stated on commandline into the program regardless to the usage of the symbols inside.

3. Semilive code

In fact there are two types of dead code in the binaries. The first type of dead code is code that is really dead. It will never be executed because there are no references (calls or jmps) from the live code that lead to it.

The second type of dead code is the code that seems to be live but in fact it is dead, because its execution is bound to such conditions that prevent it from executing in the program it resides in. I call such a code "semilive", because it seems to be live (at least to the compiler) but it isn't. Since there is no reasonable way to detect that this code is in fact unused, it is always placed into the program.

An example of such piece of code is the "CRLF" translation handling code in the STDIO library of any ANSI C compiler. If it is linked into a binary code processor that always uses "rb" as the file open mode parameter of fopen(), this CRLF handling stuff will be never executed in that particular program, because the fopen() calls in the program always say the library that the program doesn't want the CRLF translation to be done. However the code is still included in the program because there is no easy way how to tell the compiler that the code can be safely left out.

4. Unnecessary live code

The third type of overhead code is the code that implements unnecessary features. This code is executed in the program but the result of the execution of the code is never used by the program. Such code is sometimes called "dead assignment code".

This type of overhead code is especially painfull, because it not only drains speed by being unnecessarily loaded but also by being unnecessarily executed (and this second type of drain cannot be killed by things like disk caches).

An example of such a code are the various library initializers. There is no way to say the compiler that "the Initialize() call initializes LibFunction()" so that compiler can't see that in a simple "hello, world" program the LibFunction() is never called so there is no need to call Initialize() at the beginning of the program.

For example when you write and compile the notorically known "Hello, world !" program, the linker places into the resulting binary the code that initializes the memory manager subsystem (the malloc() function family) or the code that prepares the program arguments for the program (1). The fact that there are no malloc() calls in our program and that our program ignores the commandline parameters is not taken into account by the linker.