Digital Cabin

log whoami uses code feeds

Executable C source files

by dweller - 2024-07-18

#programming #dev #c #sh #unix


(on UNIX-like systems)

$ tail -6 magic.c
extern int write(int, const void*, unsigned long long int);
int main(void)
{
    write(0, "Hell world!\n", 12ULL);
    return 0;
}
$ chmod +x magic.c
$ ./magic.c
Hell world!
$

So some time ago I came up with a way to build and run C source files as if they were like shell or python scripts with shebang(#!). I called it “celf” as in self, as in self-building C files. It also produces ELF executables on modern UNIX-like platforms.

Questionable play on words aside, it might not be immediately obvious how they work to someone not familiar with how POSIX sh works, C pre-processor and peculiarities of Linux/UNIX executable loading. As the README says:

== How does it work?

It abuses C preprocessor, shell and old UNIX heritage to run shell scripts.

But wait. Why, though?

I came up with this scheme while just messing around, so it’s probably not for you.

Don’t you want to just chmod +x main.c and ./main.c? No? Just me? Okay. Well that’s why.

I primarily use this idea for my “playground”/scratch space projects. Sometimes you just want to slap something together to test an idea. Sure I can keep manually writing cc ... main.c -o ..., or hit a bunch. Heck, maybe I’ll even do history | grep cc or something and !xxx. But you know what, that doesn’t scale at all. Some of the playground projects need a lot of flags, not to mention I like to turn on a lot of the warnings, and all those flags add up. “Just use a Makefile” some would say, but most of the playground projects are just one file (although this scheme does work with multi-file projects, see below.) Besides, I hate adding dependencies, even build dependencies, they add up. Say I want to bootstrap my own minimal POSIX system, or just a Linux distro. Now I have to compile GNU Make and all it depends on, or BSD Make, better but still. These days I usually just write a ./build.sh shell script for most of my projects, but in the usual case, now I have 2 files per project.

$ cd playground
$ ls
proj1.c proj1.sh proj2.c proj2.sh proj3.c proj3.sh proj4_1.c proj4.c proj4.sh proj5/

“Put them in separate dirs”. Uhm, sure I do so for multi-file projects, but that’s just hiding the problem.

The real problem

The real problem is that C does not have a builtin build system. Now I don’t mean something like Rust’s “cargo” or go’s “go build”. God forbid! Those are, in my humble opinion, horrible things. They often obfuscate from the programmer what is going on with the toolchain, and how is it configured, often producing surprising results. And I don’t want to be surprised with a dynamically linked executable when a static was promised (looking at you Go.)

What I envision is something much simpler, a way to tell the compiler how and what to compile in the programming language itself. For example of this see Jonathan’s Blow unnamed language known as “Jai”. Now, I am not a member of closed-beta, and my memory may fault me, so correct me if I am wrong via an email, but if I recall correctly, you write some Jai code in your first file that sets the compiler’s options, add files to the project and even run arbitrary Jai code at build time.

I am not sure about the last step, although I have nothing against it in principle. But the core idea of not leaving the programming language to describe what to build in it is very appealing to me.

What’s funny is that Microsoft’s MSVC has some non-standard #pragma’s (do note, that MS loves to kill its links, so it might be dead in the future) that let you include a library or set some linker flags straight in the source code. I think that’s step in the right direction.

This will make more sense after I describe how I build C programs in general, but first…

How does it work?

Let’s examine the whole magic.c file from the example at the beginning of this post:

#if 0
    cc $0 && exec ./a.out
#endif

extern int write(int, const void*, unsigned long long int);
int main(void)
{
    write(0, "Hell world!\n", 12ULL);
    return 0;
}

The whole magic is in the first 3 lines. Let’s go step by step:

  1. Abuse the fact that # is used in all C pre-processor directives, and is a comment in POSIX sh. So sh will ignore #if 0 and #endif and C will ignore everything between those;
  2. Since we cannot use shebang, as it is not a valid pre-processor directive, we simply assume we will run in the shell and write a shell script in between the #if 0 and #endif; At this point we could just run $ sh magic.c But we don’t even need to do that;
  3. In ye olden days, before UNIX kernel supported #! magic sequence, execve(2) syscall would fail when loading a file format that it didn’t “know” how to execute (like a.out, COFF, and now ELF). People back then wanted to just execute scripts like we do now, as in run $ ./script instead of $ sh script, and so sh developers added a hack. If exec syscall failed, they tried to interpret the file as a shell script, simply assuming that it was. I am not sure if later they used some sort of heuristic to determine if file is a valid shell script or not, but what matters to us, is that this hack become standardized in POSIX. Hence any POSIX compliant sh implementation like ash, or even those that extend it, like bash or zsh retain this behaviour. If a file has x bit set, and you try to run it from a POSIX-compliant shell, it will try to exec it, and upon failure try to interpret it as a script;
  4. Summing up, we write a valid C source file, that is at the same time a valid POSIX shell file. We gate shell script inside C’s #if 0 pre-processor directive, and we don’t let shell script to start interpreting C code (well trying to) by manually terminating the script, either with exit or as in the example above by execing into the built executable.

And that’s it! For those who aren’t much into shell scripting, $0 is the 0th argument, to the shell script(or rather, any program), it’s always the filepath of the executable, in our case the C source file. So we run the C compiler cc on our source file $0, and (&&) if it returns with exit code 0 (success) we exec the resulting binary, which by default is “a.out” since we didn’t specify it with cc source -o out. exec will not fork(2) rom the current executing script’s shell, but directly replace the executable image of the running shell script with the file passed as its argument. So it runs our program, and exits, hence no need for separate exit at the end of our shell script part.

And so the mystery is revealed. Not much of a mystery, just a bunch of hacks.

The core “insight” here is that we can run arbitrary shell scripts that are stored in a C file that was set as executable. And so I can use it to put any build script I want there. This does not put me in my desired “one language” system, but it does put us in the next best thing. Everything is in the same file.

“celf” is one such build script. It’s a small script, and as I say in the README, I do encourage you to read the script itself. It has some basic things built in. Like timing the build process, not rebuilding if files didn’t change, pass and set debug/release flags, and running the resulting executable. Really it’s just an example of the technique, not a “product”. You can call make from there if you really want to, although it somewhat defeats the purpose.

So our magic.c using celf would look like:

#if 0
    CFLAGS="-Wall -Wextra -pedantic"
    . build.sh
#endif

extern int write(int, const void*, unsigned long long int);
int main(void)
{
    write(0, "Hell world!\n", 12ULL);
    return 0;
}

And produce:

$ ./magic.c
 --- cc time: .028220280 sec
 --- debug=yes; static=yes
 --- Program output:

Hell world!
$ ./magic.c
 --- rebuild not necessary
 --- Program output:

Hell world!

For me, the only negative is a non-portable nature of this, as I cannot do something like this on Microsoft Windows. (I would also have to use different cl.exe (MSVC compiler) flags anyways.) So I still require different build scripts/systems for non-POSIX platforms.

But that’s stupid.

Yes. But I like stupid.

How I build my C projects

For most small to medium projects having an incremental build system, hell, any build system is way overkill. Not only is it an unnecessary build dependency, but also it encourages the complexity demon (see: https://grugbrain.dev/). So for tiny projects you can just call:

$ cc myprog.c -o myprog

For small to medium projects I prefer Single Compilation Unit build. You might have heard it called “Unity” build (no relation to mediocre game engine). If you are unaware of them, the gist is that you collect all your sources into one compilation unit (think one .c file) and just compile that. How is it better? Well it is usually faster and produces better code. It can and will be slower to build for large projects, but see the next paragraph about that. The resulting code is usually smaller and faster because the compiler has visibility of the whole source. As it has the full context and that lets it use more “aggressive” optimizations.

I don’t often do large projects. But they too, can probably be built using SCU as long as they don’t use excessive source dependencies. Or you can break a large project into a logical units and SCU those, you will still yield a separate linking stage, but you don’t need to recompile the whole project.

This really is a tooling issue, as Jonathan’s Blow “Jai” and Google’s Carbon (or so they claim, I didn’t look into the latter) compilers show incredible speed, sophisticated features and modern optimizations.

So, typically, these days, I have a file called build.c that #include’s all the other *.c and *.h files, #define’s global constants, and, if using it, calls celf. Otherwise I use it as a single input file to compile in a separate build shell or batch script.

#if 0
    OUT=my_program
    CFLAGS="-Wall -Wextra -Wpedantic -Wno-long-long -Wformat=2 -Wfloat-equal -Wshadow \
        -std=c89 -fwrapv -fwhole-program \
        -pipe \
    "
    DBGFLAGS="-g3 -Og -DDEBUG=1 -fsanitize=undefined"
    RELFLAGS="-O2"
    . build.sh
#endif

#define _DEFAULT_SOURCE
#define _POSIX_C_SOURCE 200809L

#define PROGNAME   "my_program"
#define PROGVER_MAJ 1
#define PROGVER_MIN 0
#define PROGVER_FIX 0
#define PROGVER_REL "rel" /* release */

#define  EXTERNAL_LIB_IMPL
#include "lib/extneral.h"

#include "base/common.c"
#include "base/special.c"

#include "unit/a.c"
#include "unit/b.c"

#include "main.c"

This lets me have everything in one place, which I really like. I just go to build.c and change the things I need. It’s all there, in one place. I don’t need to hunt and decipher Makefiles in each folder (not that I do that when I do use Makefiles.) Nor do I have to, God forbid, deal with cmake, ninjas, yarns, ants and whatever else people came up with to create more problems for the rest of us.

I also switched to (almost) exclusively static builds, because at some point someone has to notice that building containers (especially things like AppImages, snaps and flatpacks) are just worse way to do a statically linked executable. Like I get the idea to bundle the configuration files and maybe resources. But if you claim dynamic libraries are good because you can update them, but then you version them, and then you pack them into a static container… My friend, reexamine your life choices. But this is a separate rant.

You are not forced to do as I do, you can add linker flags and pass -lname to dynamically link with your libraries. And of course, you don’t need to use Unity/SCU build, just gather your sources with find and call the compiler on them. Or just don’t use this at all, a single BSD Makefile is probably fine.

exit

In closing, I hope this was at least interesting. It would be even easier if C pre-processor ignored #! so I could just ‘#!/usr/local/bin/celf’ or something. But really we just need a new language that fits the niche that C has, but modernized. I don’t think Rust is that. It’s more of a C++ contender, and don’t start me on Go. It has GC, that’s all one needs to know that it’s in a different world. Zig, Odin, maybe even Nim are all trying, and I’ve yet to try all of them. But I am not sure if any of them have something like this in mind, except Zig and of course unreleased Jai.

Perhaps I should jump on the bandwagon and write my own language? No, that’d be potentially useful! (Probably not though.) And I’m all about that useless stuff, like wiring a CPU in a Logisim! ;) Although another toylang would be fun to make one day. I’ve been reading about FORTH you know :P


[Valid Atom 1.0] More…
If you can spare some $$$... Help Ukraine: Or me :P