The last word in static site generator for me

2022-10-19

This article describes the how and why I think that my current solution for generating my website will fulfill all my needs

Motivation

My experience from more than eight years of (irregular) blogging and maintaining a personal web page has lead to some structural patterns for my sites in common. For example, most of the time my personal site contains a navigation bar, some (invariant) pages (ie. about, teaching, contact), and a blog page (aka index of posts). For a long time, I've been using static site generators according to my preferred programming language at that time. I was using hugo (feel free to guess the language) as the latest iteration of site generator. However, using hugo for such simple site structure is kind of overkill IMO. So, let's try to go simple with basic Unix tools.

There are numerous motivation from other people why they have decided to build their sites with an own (homebrewn) solution. The central aspect for me is that you have full control over the workflow from source to deploy. From experience, it even doesn't play a big role what 'project' you want to build. With the same approach you're able to build a thesis, a presentation plus handouts, and a blog. How is that possible, you may ask. These kind of projects share a common pattern that basically describes a transformation from a given source format (markdown, LaTeX, org, reST) to a target format (PDF, html, reactjs, beamer).

What is left is to select the tools that fits best for the job. For example, karl.berlin has decided for a shell-based approach, whereas technomancy is using a make-based approach to cover the overall orchestration of their tools. For the conversion from one format to another there are even more possible solutions. I personally continuously through pandoc into the mix when it comes to the conversion between different formats. For generating an atom feed or processing images there is also a vast amount of tools to choose from (m4, imagemagick, inkscape, tikz). Which brings us to another important point.

You decide how much software bloat is good for you, which includes both the tools and the content. For example for some projects even pandoc is too heavy weight or not supported on your platform (unlikely), then you would like to prefer a more (but limited) approach like lowdown when markdown is the single source file type you want to process. Talking about content, there is no magic boilerplate code buried in some html partial from an external theme or framework if you don't want to use it. Of course, you are free to link as many external sources (ie. javascript) as you like or structure your html in a "clever" way, no one will stop you. But, there are good reasons especially for low-bandwidth regions to strip down your page by similar keeping the relevant information. If you're looking for some advice into that direction aptivate has you covered and effectively leads to some 250kb or textonly websites for great good.

Even if I'm convinced that everybody should shape his/her own workflow, I would like to share a combination that works best from my experience (ie. my projects done so far). Most of the time, I'm using a combination of the following tools in one way or another:

Within the remainder of the article I'll describe some aspects of my workflow. Be aware of that every design decision comes with a trade off. What follows is, therefore, either a silver bullet nor a dogmatic approach blindly following some greater ideas of software design (ie. KISS, DRY, suckless etc.). It is meant to be a (practical) starting point for your own approach.

A Workflow for Generating my Personal Website

The most generic operation my workflow foresees is to copy a file as-is from source to target directory. This kind of "catch all" target gets executed whenever there is no more specific rule. As an example you can think about a css stylesheet, favicon, or even images that are not meant for further processing.

build/%: source/%
    cp $< $@

A slightly more complex rule is to actually deal with the content of files. In case of html output, you want to combine an invariant part of a html file (e.g. header, footer) with a changing body part. Assuming you write plain html files, the following example uses cat to combine header, footer and body to one html output file.

build/%.html: source/%.html
    cat header.html $< footer.html > $@

In case you need to process the files before combining them to a final result, the former rule can be extended by your preferred processing tool. The following example uses pandoc in standalone mode to produce a html file.

build/%.html: source/%.md
    pandoc -standalone -input $< -output $@

So far we've looked at files that are "predefined" in terms of content which means the files get used as-is. But what if we want to produce the content on-the-fly? This is, for example, the case when we need a index page for all our posts. That is more complex because processing includes a dynamic aspect. What works well for me is using a simple database (actually a tsv file). Each entry consists of a line with the following fields:

updated directory   title   summary tags

Each line describes one post by defining when it was updated/published, where to find the content, the title, a summary, and (optional) tags. Attentive readers recognize here some relation to well-known yaml code blocks of metadata in markdown files. Indeed, that file is a consolidated list of metadata. In the past I was used to store the meta-information of each post close to the content itself (ie. same directory), but that introduced another extraction and consolidation step to produce such database for all my posts. An overhead I'm keen to ditch even if I now need to maintain the file by hand. That file-based database has a nice bonus, I can use unix tools for reading and processing the entries. Now, to create a basic listing of all my posts in html I'm using awk like:

BEGIN{ FS='\t'; print "<ul>" }
{ 
 print "<li>" $1 "<a href=./" $2 "/index.html>" $3 "<\a><\li>"  
}
END{ print "<\ul>" }

That list gets created on-the-fly and is relevant for time until the final index page gets generated. An ideal case for applying an intermediate target and let make delete the files after creation.

.INTERMEDIATE: build/posts.html
build/posts.html: posts.tsv
    ./ul $< > $@
    
build/blog/index.html: build/posts.html
    cat header.html $< footer.html > $@

You could argue that this is an overhead and why are we not writing the posts index page by hand? A valid point and my answer is that I want to use the same information (i.e. single point of truth) about my posts at another place in the processing, namely for creating an atom feed.

From my experience (i.e. writing my own blog engine in golang) creating an rss feed is one of the most vital parts of a blog for two reasons. First, RSS is still a way to syndicate information over the wire and let the reader decide how to read it (e.g. terminal newsreader, app, web etc.). Second, it is not playing inline with the previous tools used in processing of html files even if the standards (XML and HTML) are close to each other. Let me elaborate a little bit on the latter point by using an example. Basically, what you get for a single post are meta infos like title, date, and summary. Wouldn't it be great if we can reuse these information as often we need during our processing by similar keep our set of tools constant? For example, using pandoc to generate html files and an atom feed out of a list of posts. In several attempts I collect all the metainformation from source, polished it and put it together in one big file that could be "understood" by pandoc to create a (not supported) xml file (i.e. I "convinced" pandoc that the output is html). So, I ended up in some hackish approach that utilizes pandocs metainfo capabilities in a way it was not designed for. I'm pretty sure that sooner or later I'll forget about my "clever" trick at that point in time and maintenance/debugging becomes a headache.

So my solution tries to build the bridge between maintainabilty (i.e. I understand what the thing is doing even if I look at it 5 years from now on) and overhead (i.e. do you really need xslproc). We already have the database with our consolitdated meta information for all posts in place. What is missing so far is a way to translate that information to XML. I guess you already name it, I'm using awk for that step as well because I already introduced it for the sake of creating an index page for my posts. Similar to the ul application, feed is reading posts.tsv and generates an atom.xml file. I hereby use a (XML) feed template, rather than HTML like the ul application, and fill the remaining information from the database. Which brings us to the following target for producing a feed.

build/atom.xml: posts.tsv
    ./feed $< > $@

With that approach I'm introducing no other tool for generating an atom feed into my processing workflow. Due to the fact that awk is a standard unix tool, data driven, and lightweight it is IMO a nice deal. Furthermore, I can use awk and exchange the data depending on the use-case. For example, I'm not limited to HTML or XML output, if I would like to produce an offline version of my posts (e.g. low-tech book) I could use the same information and let awk produce a LaTex formated file output.

There exists some further targets in my Makefile that I don't want to explain in great detail because I guess you already get the idea. Further targets roundup:

Summary

It is irony that whenever I plan to write an article about my static site generator endavours, I'm procastinating by re-constructing my static site generator. As you now,

the only constant (in software) is change

the question is, is that really the last word in static site generator for me. Yes and no, for my current need of maintaining my little corner in the internet it is enough and I can finally move forward with other topics. But, there are so many other interesting approaches out there, it would be a pitty not to try them and see what their authors had in mind. For example, there is pollen lang, org publish, karl.berlins approach with shell heredoc that also inspires me for my own workflow, chrisman with a kind of m4gic, ssg, and many others. And as a last word about bloat, this is just a starting point. See make as an "entry" to your project that can be combined with continous integration, replace by nix or deployed via docker. The solutions are endless but how many effort do you want to invest in generating your own static web site?

That's all for now.