Zygmunt Krynicki

Free software developer. Huawei, ex-Canonical, ex-Linaro, ex-Samsung. I write in C, Go and Python. I usually do system software. Zyga aka Zygoon

Deploying Raspberry Pi as LAVA dispatcher for USB-attached DUTs

Today I will talk about using a Raspberry Pi 4B (8GB) with an SSD disk as infrastructure for deploying LAVA dispatcher used for testing USB-attached devices such as The 96Boards Nitrogen board.


This somewhat lengthy post goes into the practical details of setting up a minature infrastructure and test stack on a single board or a cluster of identical boards. It is separated into two main parts - the physical part with the bare-metal OS (infrastructure) and the software-defined services part built on top (test/payload). You can use it as a story, as a tutorial or as a quick google search result for a particular problem (perhaps).

I wanted to approach this problem as an infrastructure problem. With separate layer for managing the hardware and the base OS and another layer for whatever is needed by the testing stack. This split seems to be natural in environments with separate testing team and infrastructure team, where their goals differ.

Infrastructure layer

At the infrastructure layer we're using Raspberry Pi 4B with 8GB of RAM, up-to-date EEPROM bootloader configured to boot from USB, a USB-SATA adapter with a low-cost 128GB SATA SSD. Those are significantly faster and more robust than micro SD cards.

Ubuntu 20.04 LTS + cloud-init

For the management side operating system I've been using Ubuntu 20.04 LTS. This version is, at the time of this writing, the latest long-term-support release. In the future I plan to upgrade to 22.04 LTS as that may cut one step required from the setup process.

The setup process involves two stages. Preparing the hardware itself (assembling everything, updating the boot firmware using raspbian, wiring everything together) and software setup (putting some initial image on the SSD).

Start with ubuntu-20.04.3-preinstalled-server-arm64+raspi.img.xz. You can copy it to your SSD with dd, for example, assuming that your SSD is at /dev/sdb and the downloaded image is the current directory: xzcat ubuntu-20.04.3-preinstalled-server-arm64+raspi.img.xz | sudo dd of=/dev/sdb conv=sparse bs=4M. This should take just a moment, since we use conv=sparse to detect and skip writing all-zero blocks.

We don't want to use the default ubuntu user. Instead we want some admin accounts, ssh keys and the like. We can achieve that with cloud-init by preparing a user-data file. Here's an edited (abbreviated) file I've used:

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Huawei Inc.

- name: zyga
  gecos: Zygmunt Krynicki
  primary_group: zyga
  groups: users, adm, sudo, lxd
  shell: /bin/bash
  - "ssh-rsa (edited-out) zyga@hostname"

  list: [zyga:password]
  expire: true

hostname: pi4-1

  enabled: true

timezone: Europe/Warsaw

  - ssh

package_update: true
package_upgrade: true
package_reboot_if_required: true

This file is extremely basic and has some short-cuts applied. The real file I've used registered the system with our management and our VPN systems. This did bring additional complexity I will talk about later. For the moment you can see that it gives me a single user account, with authorized key and a fixed password that has to be changed on 1st boot. There's also a fixed hostname, ntp and time zone configuration, initial set of packages to install and a request to update everything on first boot. Ssh is important since we want to be able to log in and administer the device remotely.

Normally I would not set the password at all but during development it is very useful to be able to login interactively from the console or over the serial port. The cloud-init user-data file can be provided in one of several ways, but one that's low-cost and easy to do is to copy the file to the root directory of the boot partition (which is the first partition in the image). We'll do that shortly but first... some caveats.

In a perfect world that would be all that we need. I sincerely hope so for when Ubuntu 22.04 ships, that will be true. With no more u-boot and a more recent kernel, improved systemd and cloud-init. For the current Ubuntu 20.04 LTS, there are a few extra steps to cover.

Caveat: mass storage mode vs UFS mode

This doesn't apply to all the USB-SATA adapters but at least with Linux 5.4 and with the USB-SATA adapter using vendor:product 152d:0578 I had to add a quirk to force the device to operate in mass-storage mode. UFS mode was buggy and the device would never boot.

We can do that by editing the default kernel command line. The boot firmware loads it from cmdline.txt (by default). All I had to do was to append usb-storage.quirks=152d:0578:u to the parameter list there.

Caveat: USB boot support

Ubuntu 20.04 uses the following boot chain:

  • System firmware from EEPROM looks for boot media (configurable) and picks SD card (default).
  • The card is searched for a FAT partition with several files, notably start.elf, fixup.dat and config.txt (and several others, those don't matter to us).
  • The config.txt file instructs the firmware to load the kernel from a file specific to the version of the board used, here it would have been uboot_rpi_4.bin.
  • That in turn loads boot.scr and then picks up vmlinuz, de-compresses it in memory, loads initrd.mg and the system starts runing Linux.

There's only one problem. The version of u-boot used here doesn't support the USB chip used on the board, so we cannot boot from our SSD. Ooops!

Fortunately, the current boot firmware (start.elf and earlier a copy of bootcode.bin which is read from on-board EEPROM) is capable of doing that directly. Moreover, it also supports compressed kernels, another limitation that is now lifted.

In general the firmware loads config.txt. On Ubuntu 20.04 that file is set to load two include files (this is processed by start.elf) - those are syscfg.txt and usercfg.txt. I've used the first one to tell the boot firmware to ignore u-boot and load the kernel and initrd directly. Since start.elf has already been loaded from USB mass storage, we have no problems loading all the other files as well. We can use our USB-SATA adapter just fine.

Here's the syscfg.txt I've made. I left the comments so that you can see what various parts are for. You can remove everything but the last three lines, where we instruct the bootloader to load the kernel from vmlinuz. This overrides an earlier definition, in config.txt, which sets kernel= to one of the u-boot images.

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Huawei Inc.


# Enable UART for debugging.

# Use debug version of the boot firmware.

# Enable UART in the 2nd stage bootloader (start.elf).
# This is useful for diagnosing boot problems.

# Move bluetooth to slower UART to assist in early boot debugging.

# Ask the firmware to only poll the SD card once. This avoids a bug where udev
# is getting stuck when booting from USB and the SD card is not populated.

# Load the kernel directly, without looping via u-boot. This is what allows us
# to boot from USB. It is not needed for u-boot assisted SD card boots.
initramfs initrd.img followkernel

In my setup I've used a Makefile to program and alter the image. As you can see I'm copying user-data, cmdline.txt and syscfg.txt all in one step. The makefile rule below is edited for brevity.

flash-%: MEDIA_DEVICE ?= $(error set MEDIA_DEVICE= to the path of the block device)
flash-%: BOOT_PART ?= 1
flash-%: IMAGE_XZ ?= ubuntu-20.04.3-preinstalled-server-arm64+raspi.img.xz
flash-%: D := $(shell mktemp -d)flash-%: %/user-data %/cmdline.txt %/syscfg.txt
        test `id -u` -eq 0
        xzcat $(IMAGE_XZ) | dd of=$(MEDIA_DEVICE) conv=sparse bs=4M
        unshare -m /bin/sh -c "mount $(MEDIA_DEVICE)$(BOOT_PART) $(D) && cp -v $^ $(D)"

Caveat: initial system time

Raspberry Pi 4 is popular and fun, but like all the boards in its lineage, does not come with a battery-powered real time clock. When you boot the system, it's 1970 all over again. Parts of the OS have been tuned to do better than that, by loading timestamps from various places, like a built-in timestamp in systemd (system time must not be earlier than systemd build time) or support packages that store and restore the time at shutdown and startup. Nothing here can make a device left in cold storage for a while and booted while being disconnected from the network will know what time it is. It just doesn't have the hardware to do that.

This makes first boot annoyingly complicated. We want to go the the network and fetch packages. That implies verifying certificates and their associated validity window. I didn't debug this deeply but at least in Ubuntu 20.04 there's no way to synchronize attempt to install the first package (or add a 3rd party repository) with system having obtained initial synchronization from network time servers.

What I've found is that, at least the version of systemd used in Ubuntu 20.04 has one way of setting initial time. Systemd will stat /var/lib/systemd/timesync/clock and use the mtime as the earliest valid system time. This is paramount for the case of creating an initial disk image now and quickly booting the system with it, as we can make sure that the system will have so-so time early enough when you start to initiate https connections and need to validate certificates.

This is a kludge. Ideally cloud-init should have a way to force the rest of the commands to wait for NTP time sync, if enabled.

In my helper Makefile I solved it like this:

flash-%: ROOT_PART ?= 2
flash-%: D := $(shell mktemp -d)
    unshare -m /bin/sh -c "mount $(MEDIA_DEVICE)$(ROOT_PART) $(D) && mkdir -v -p $(D)/var/lib/systemd/timesync && touch $(D)/var/lib/systemd/timesync/clock"

The variables don't matter, except for D which, as you can see, is a temporary directory. All that matters is that we touch the afforementioned file on the second partition of the image.

First boot

With all of this done and a few goats sacrificed, we can now plug in network and power and boot the system for the first time.

The first boot is critical. Network must be working, stars must align. If things fail here, the device will just sit idle, forever. Nothing will be re-tried. Normally cloud-init is used in the cloud, where network typically works. If something fails in your local environment you have two options:

  • Start over and re-write the disk.
  • Log in interactively, if you can, to run cloud-init clean and reboot.

For debugging you may want to run cloud-init collect-logs and inspect the resulting tarball on your preferred system. It contains crucial information from first boot. Two things to look out for: DNS issues and the exact moment system time jumps from the fake time baked into the image to the actual local time.

Once the system boots this part is done. Good work. You have a solid enough infrastructure to provision the next level of the stack and let your users in.

Optional extras


If you are a landscape user, you may want to extend your user-data file with a section that looks like this one:

    url: "https://landscape.canonical.com/message-system"
    # NOTE: this cannot use https, don't change it!
    ping_url: "http://landscape.canonical.com/ping"
    data_path: "/var/lib/landscape/client"
    tags: ""
    computer_title: "{{ v1.local-hostname }}"
    account_name: "your landscape account name"

Note the {{ ... }} fragment. This is a jinja template. If you use that you have to put one at the top of the user-data file:

## template: jinja

Use those exact number of hashes and spaces, or the template magic won't kick-in.

This will set everything up to register your system with Landscape. Unless you use an on-prem installation the URLs are valid and ready to go. You may want to set an account key as well, if you use one.

You should check if landscape-client.service is enabled. It will work on first boot but after the first reboot, I've seen cases where the service was just inactive. You may want to add systemctl enable landscape-client.service to your runcmd: section.


If you want to use tailscale to VPN into your infrascturcture you will want to add the repository and register to it on the first boot. There are some caveats here as well, sadly. First of all, tailscale SSL certifcate is from Let's Encrypt and, at least for the Ubuntu 20.04.3 image I've been using, is not valid until you update your ssl certificates. This means that you cannot just apt-get update as apt will not accept the certificate from tailscale until you update ca-certificates.

You can use this snippet as a starting point and perhaps manage it via landscape with package profiles and user scripts. Alternatively you can do it in runcmds directly, if you don't mind runnnig curl | sudo sh style programs.

      source: deb https://pkgs.tailscale.com/stable/ubuntu focal main
      key: |
        -----BEGIN PGP PUBLIC KEY BLOCK-----

        -----END PGP PUBLIC KEY BLOCK-----

 - tailscale up -authkey your-tailscale-key-here


We have a nice system running but it should be manged as just another server in your fleet. What I've described here is the bare minimum I happen to do. Your site may use other software and stacks to ease remote administration at scale.

Using a Raspberry Pi gives us the advantage of having a low-cost system with various interesting peripherals that often come in handy when doing device testing.

In the next chapter we will look at the next part of the stack, our LAVA dispatcher.

The test/content layer

We now have a nice and mostly vanilla low-cost system. Let's use it for deploying LAVA. After a few iterations with this idea I'm deploying LAVA inside a virtual machine, with USB pass-through offering access to USB-serial adapters and USB-attached 96Boards Nitrogen micro-controller board.

Why like that? Let's break it down:

1) We can destroy and re-provision anything on top of this system without touching the hardware. This is just nice for management, as the hardware may be locked in a lab room somewhere, with controlled access and you may be far away, for instance working from home. This part is non-controversial.

2) We can use the system for other things. In particular it's probably the low-cost aarch64 system you always wanted to try. You can enable it in CI (we'll talk about resource management later) and let your developers compile and test on aarch64 naively.

3) LAVA is tested this way. It likes to have access to a full-system like environment. With a kernel image around and things to use for libguestfs. With udev rules and all the other dirty parts of the plumbing layer that may be needed. This lets your test team set anything up they want and not have to worry about a too-tightly controlled single-process container environment.

4) It works in practice with USB devices. Some requirements are harder, if you need access to GPIO's or other ports you may need additional software to either move the device over entirely or mediate access to a shared resource (e.g. so that only a specific pin can be controlled). That can be done naturally by a privileged and managed service that runs on the host that something in the guest VM or container talks to.

5) Lastly VM vs container. Initially I've used system containers for this but, at least with LAVA and device access, this was cumbersome. While LXD does work admirably well, some of the finer points of being able to talk to udev (which is not present in a system container) are missing. Using a VM is a cheap way to avoid that. At the time of this writing, system containers have better USB hot-plug support but that's only useful if you can use them in the first place. If you have to unplug hardware from your system you may need to reboot the virtual machine for the software to notice that. At least until LXD is improved.

Let's look at how we're going to provide that next layer.


Deploying LXD on Ubuntu is a breeze. It's pre-installed. If you want to create a LXD cluster you can do that too, but in that case it's recommend to set up snap cohort so that your entire stack sees the same exact version of LXD as it refreshes. You can do that with snap create-cohort lxd on any one system, and then snap refresh lxd --cohort=... with the cohort key printed earlier. Setting up an LXD cluster is well documented and I won't cover it here.

To set up a VM with some sane defaults run this command:

lxc launch ubuntu:20.04 --vm -c limits.memory=4GB -c limits.cpu=2 -c security.secureboot=false

Let's break it down:

  • First we pick ubuntu:20.04 as our guest. You can pick anything you want to use but if you want to use it as a virtual machine, you should really stick to the images: remote where LXD maintainers publish tested images that bundle the LXD management agent. The agent is important for the system to act in a way that is possible to control from the outside.
  • The second argument, --vm, tells LXD to create a virtual machine instead of a container.
  • Next we set the amount of memory and virtual CPUs to present to the guest system. Lastly we ask LXD to disable secure boot.
  • Ubuntu 20.04 aarch64 images apparently don't have the right signatures. I think this is fixed with later versions. If you try 22.04 you may give it a go without that argument. When experimenting pass --console as well, to instantly attach a virtual console to the system and see what's going on. You can also use that to interact with virtual boot EFI firmware.

Wait for the system to spin up, lxc list should show the IP address it was assigned. You can jump in with lxc shell (you have to pass the randomly-generated system name as well) and look around. Once you have that working with whatever OS of choice you have, stop the system with lxc stop and let's set up USB pass-through.

USB pass-through

The easiest way to do this interactively is to run lxc config edit. This launches your $EDITOR and lets you just edit whatever you want. In our case we want to edit the empty devices: {} section so that it contains two virtual devices: one for USB nitrogen and one more for USB FTDI serial adapter. This is how this looks like:

    productid: "6001"
    type: usb
    vendorid: "0403"
    productid: "0204"
    type: usb
    vendorid: 0d28

Note that I've removed the {} value that devices: was initially set to. You can see that we just tell LXD to pass through two USB devices and specify their product and vendor IDs.

Important: if you have multiple matching devices on your host they will all be forwarded. I've started working on a patch that lets you pick the exact device but it is not yet merged upstream.

Save the file and exit your editor. Re-start the machine with lxc start. Wait for it to boot and run lxc shell to get in. Run lsusb and compare that with the output of the same command running on the host. Success? Almost.

By default, and for good reasons, LXD uses cloud images and those have a kernel tuned for virtual environments. Those don't ship with your USB-to-serial adapter drivers and a lot of other junk not needed when you want to spin those virtual machines up in seconds.

We have to change that. Fortunately, for ubuntu at least, it's super easy to just install linux-image-generic and reboot. Having done that you can see that uname -a will talk about the generic kernel and that /dev/serial/by-id is populated with nice symbolic links to your devices attached to the host.

Can we automate that? Sure! It's all cloud-init again. This time wrapped in an LXD profile. Let's see how this looks like:

LXD profiles

LXD has a system of profiles which let us extract a piece of configuration from a specific system and apply it to a class of systems. This works with storage, network, devices, limits and pretty much anything else LXD supports.

Let's create a profile for our class of systems. Let's call it lava-dispatcher, since the profile will be applied to all the dispatchers in our fleet. If you've deployed LXD as a cluster, you can define the profile once and spin up many dispatchers, for example one per node. Let's create the profile with lxc profile create lava-dispatcher and define it in a yaml file for our convenience.

While you can use lxc profile edit lava-dispatcher to set things up interactively, the text you will be presented is generated by LXD from the internal database. We want to store our config in git so let's define a file and then load it into the running LXD (cluster).

Here's the file I've prepared:

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Huawei Inc.
  boot.autostart: true
  limits.cpu: 2
  limits.memory: 4GB
    security.secureboot: false
  user.user-data: |
      # Install the generic kernel for access to various USB drivers.
      - linux-image-generic
    package_update: true
    package_upgrade: true
    package_reboot_if_required: true
    productid: "6001"
    type: usb
    vendorid: "0403"
    productid: "0204"
    type: usb
    vendorid: 0d28

As you can see it has a few things. At the top we tell LXD to auto-start the instance with this profile. We want to be able to reboot the infrastructure host without having to remember to boot individual payloads. We also work around aarch64 secure-boot problem. We set the limits we've talked about earlier, we also have the devices down at the bottom. In the middle we have another user-data file, embedded inside the LXD configuration (yaml-in-yaml). Note that the | character makes the whole section one large string, without any structure, so you have to be careful as LXD cannot validate the user-data file for you. Here we install the generic kernel and tell the system to update and reboot if necessary.

Let's try it out. Every time you make changes to lava-dispatcher.yaml you can run lxc profile edit lava-dispatcher < lava-dispatcher.yaml to load the profile into the cluster. You cannot do that when there are virtual machines running with that profile enabled, as LXD will refuse to apply changes to this type of systems on the fly.

Let's destroy our initial virtual machine with lxc delete --force instance-name and try to set everything up in one step with lxc launch ubuntu:20.04 --vm --profile default --profile lava-dispatcher. We pass --profile twice since the default profile defines networking and storage and our lava-dispatcher profile defines just the details that make the instance into a valid host for a future LAVA dispatcher.

Wait for the instance to spin up and explore it interactively. It should be ready to go but setup will be considerably longer, since it will update lots of packages and reboot. Just give it time. You can observe the console with lxc console if you want to. Remember that to quit the console you have to press ctrl+a q.

From here on you can finish automating your software deployment, connect to ansible, set up users or do anything else that makes sense.

Docker CE

Since LAVA uses Docker we may want to use a more up-to-date version of the Docker package. Again, cloud-init can help us do that.

We want to modify two parts of the profile: the apt: part will list a new repository to load and the packages: part will just tell cloud-init to install docker-ce. This is how it looks like:

    # ...
    user.user-data: |
                    source: deb https://download.docker.com/linux/ubuntu focal stable
                    keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
                # ...
                - docker-ce

Tailscale (inside)

If you want to let someone into the instance you can set up... another tailscale tunnel. Again, since the instance runs with working real-time clock, you don't need to jump through hoops this time.

    # ...
    user.user-data: |
                    source: deb https://pkgs.tailscale.com/stable/ubuntu focal main
                    keyid: 2596A99EAAB33821893C0A79458CA832957F5868
                # ...
                - tailscale
            - tailscale up -authkey your-tailscale-key-here


There's a lot of messy wires and hand-holding but also a healthy amount of code and automation in this setup. Cloud-init is not everything but it does help to spin things up the way you want to considerably. From there on you only need to keep holding the steering wheel and drive your infrastructure where you want to.

Introduction to SystemOTA (1/n)

Today I will talk about SystemOTA. An image-based, transactional update system for embedded Linux devices.

Disclaimer. This blog talks about my work but offers my personal opinions and does not speak for my employer or customers.


SystemOTA is one of the first projects started by Huawei Europe Open Source Technology Center, (OSTC). It is an Apache 2.0-licensed, no-strings-attach update system for devices such as routers, gateways, smart switches, smart thermostats or any other device that is based on Linux, has moderate amount of memory and storage.

SystemOTA came out of the desire to build something we called the Transparent Gateway. The transparent gateway is an OSTC Reference Design for a device combining connectivity to the internet with connectivity to a local mesh network. The gateway bridges the gap between small, Zephyr-based devices and the rest of the network. From the point of view of a prospective hardware vendor, the gateway comes with an open source operating system, a set of open source services and applications and a set of open source server-side software that allow the gateway to be kept up to date and secure.

It was very important to us to create a solution that is open at the level of the platform. We didn't want to make something that is technically open source on the device, but in practice requires a proprietary component running in the cloud.

There are several interesting software update projects out there today. Some of them are entirely open, others less so. Some have the right features but use that as the business model, keeping something essential closed. Others still have made technology choices that we would like to avoid. In the end we chose to write a new system that, we hope, can offer value to the market. Combining the right blend of openness, performance and scalability.

Given the scale of the problem, we wanted to be conservative in what we promise: a reliable update system for home devices based on Linux, that any company or enthusiast could deploy and operate, either for themselves or as a service for others.

Before we look at the server side, let's look at the update feature from the point of view of the device and the kind of features we want to offer.

  • We want the device to be able to update itself in the field, without the oversight of a person.
  • We want to either update fully or not at all, avoiding any intermediate state.
  • We want the update process to be resilient to externally induced problems, like power loss.

The first requirement is perhaps somewhat controversial. On the one hand side, ownership of a device might imply that all the decisions are taken by the owner. On the other side, the reality of operating connected devices, with the constant threat of malware and nascent legal requirements put the responsibility of offering automatic updates on anyone who wants to produce modern devices. Luckily we are not producing a physical box, our software is a part of reference architecture. The final decisions and consequences are on the eventual manufacturer. On our side, we want to offer unattended updates as a technical possibility. This implies that certain decisions must be taken by the software, automatically and correctly. The only way to do that is to reduce complexity, to manage choices and to constrain the stack enough, that a fully automatic update can leave the device in a state in no way worse than original.

The second requirement is a direct consequence of the first requirement. We cannot allow ourselves to break the utility offered by the device due to a failed or partial update. While one might argue that a device that is disconnected from network and power supply has ultimate security, our goal is to provide the best possible security which retains function. In technical terms we must design the system to behave like a traditional database transaction. Regardless of how much work was performed during the update, until the transaction is committed, software running on the device sees only the current software and state, not any combination of the two. Once the update is complete, all the programs are updated at once, atomically and without corruption of the data.

The last requirement is a re-affirmation of the first requirement. Given that we have no control over how the devices are used, whatever we do during the update process must survive and not regress on the other two requirements when faced with the sudden loss of power supply. The proverbial user may yank the power cord at any time, and it must not matter.

Given finite resources, we also wanted to put some constraints, so that bounds of the system are clearly set:

  • SystemOTA is not aimed at updating applications that are not a part of the operating system
  • SystemOTA is not aimed at updating firmware embedded into the components that constitute the product
  • SystemOTA is not aimed at absolutely smallest amount of storage and memory that is still available on the market.

The first constraint is actually quite liberating. By not handling updates of the application layer, we are not prescribing how it must look like. There are number of competing choices here and we believe our design is compatible with all of them. This space also has plenty of innovation and may offer better experience (e.g. no downtime) as compared to handling everything exactly the same way, as system updates.

The second constraint is just practical. LVFS is the way to handle firmware updates and we want to align with that, convincing our partners where appropriate. Updating boot firmware is tricky in the sense that it is rarely handled automatically in the same way like OS updates have started to become. Not all platforms have fallback mechanisms and those that do may come with additional requirements that we cannot require ourselves. We think that for the moment, separating platform firmware from the OS is the right choice.

The last requirement is also practical. We want to use the transparent gateway as a platform for special type of applications that bridge the world between micro-controller world and both the local network and the cloud alike. We cannot do that on a system so stripped down, that single megabytes matter. We picked a compiled but memory-safe language with automatic memory management instead of the, perhaps historically more traditional C, for the OTA stack. We believe that safety and security are more important than ultimate performance, and that this is the right trade-off. For that reason, the reference implementation is written in Go, with access to a high-quality http stack, ease of use and ease of testing. At the time of this writing, the SystemOTA takes about four megabytes of storage and a comparable amount of memory at runtime. It is also designed to exit when idle, avoiding the cost of constant memory tax.

We believe that the set of requirements and constraints will allow us to create a high-quality, competitive OS update solution. Combined with the design of the state management system, the transparent gateway will serve as a solid foundation for additional reference designs coming in 2022.

In the next installment we will look at the architecture of the transparent gateway and how our choices enable the update process to offer those guarantees.

Unexpected control flow

Today I will talk about unexpected control flow in shell

I was doing code review of a piece of shell code. It looked good on paper, if some poor soul is printing shell on paper, and was even shellcheck-clean, but we decided to add set -e at the top of the script, to see if anything was failing silently.

It was, obviously.

After a moment of shock and disbelief, it struck us what was broken. The broken part was inside out heads. Shell itself did exactly what we told it to do. Our understanding of the semantics of all the shell features, combined, was the faulty element.

The code we looked at can be reduced to this snippet:

set -e
# surprise, exit 1 here

If you don't immediately see the problem and are unwilling to look at the bash(1) manual page for clues, let me just jump to the conclusion.


Should shellcheck warn about this issue? It might. I doubt any developer intentionally wrote code with the desire to always exit after the assignment, just because set -e and ((a++)) together require that.

Fixing VMware Workstation 16.1 after kernel upgrade on Ubuntu 20.04

Today I will talk about making VMware Workstation survive a kernel update

This is a continuation of https://listed.zygoon.pl/21339/installing-vmware-workstation-16-pro-on-ubuntu-20-04

When you get a new kernel you need to re-compile out-of-tree modules in order
for them to continue working, if possible. VMware currently uses two kernel
modules: vmmon and vmnet. Surprisingly after all those years VMware did not
adopt DKMS so you you, the customer, are responsible for doing the hard work
that ought to be just integrated and never seen.

Enough complaints. Let's make it work for us right now, when we updated and are
grumpy that it broke:

sudo vmware-modconfig --console --install-all
sudo /usr/src/linux-headers-$(uname -r)/scripts/sign-file sha256 /var/lib/shim-signed/mok/MOK.priv /var/lib/shim-signed/mok/MOK.der $(modinfo -n vmmon)
sudo /usr/src/linux-headers-$(uname -r)/scripts/sign-file sha256 /var/lib/shim-signed/mok/MOK.priv /var/lib/shim-signed/mok/MOK.der $(modinfo -n vmnet)
sudo systemctl restart vmware

If anyone at VMware ever reads this, I would love to know the backstory and
understand the technical problems with adopting DKMS.

Integrating PVS Studio and Coverity with make

Today I will talk about integration of PVS Studio and Coverity into a make-based build system.

As a part of my work on Open Harmony, I'm looking into static analysis for C and
C++ projects.

I had prior experience with PVS Studio and Coverity. I've used them in my
personal projects as well as in my work on snap-confine, a privileged part of
snapd responsible for creation of the execution environment for snap application
processes, where security is extremely important.

Let us briefly look at how static analysis tools integrate with the build
system. In general static analysis tools do not run the code but instead
read the code and deduce useful properties this way. Since C includes a
pre-processor, which handles #include and various #if statements, the ideal
input to a static analyzer is pre-processed code. This way the static analysis
tool can be fed standalone information that no longer relies on system headers
or third party libraries that are necessary to understand definitions in the
code. This also makes such pre-processed input convenient for SAAS-like service,
where the analysis tool is not running locally with access to the local

In general all the static analysis tools I've tried behave this way. The main
difference from an integrator point of view if the pre-process step is done
implicitly or explicitly. The cheapest way to integrate with a tool if this can
be done implicitly, namely by following and observing an existing build process.
The analyzer support tool can trigger an otherwise standard build process,
observe how the compiler is executed, extract the relevant -I, -iquote and -D
flags, find the translation units and eventually pre-process the code internally.

Some tools offer the "build system observer", others do not, or in special case,
gcc doesn't need to as the analyzer is an integral part of the compiler itself.

Let's consider two examples, explicit pre-processing with PVS Studio and
implicit pre-processing with Coverity. Examples below use make(1) syntax.

First we need a way to pre-process arbitrary file. Interestingly this works for
both C and C++ as the pre-processor does not care about the actual source
language. Well, maybe except __cplusplus defines.

%.i: %:
    $(CPP) $(CPPFLAGS) $< -E -o $@

For those who don't read make, this will invoke the pre-processor, which by
convention is named by the $(CPP) variable, pass all the options defined by
another variable, $(CPPFLAGS), then the input file $<, then the -E option,
which asks the compiler to stop at the pre-processing, followed by -o $@ to
write the result to the output file, which is named on the left hand side of the
rule header %.i. The rule header shows how to make a file with the .i
extension out of any file % behaves like * in globs.

Now we can ask the static analysis tool to do its job. Let's write another rule:

%.PVS-Studio.log: %.i
    pvs-studio --cfg .pvs-studio.cfg --i-file $< --source-file $* --output-file $@

Some bits are omitted for clarity. The real rule has additional dependencies on
the PVS-Studio license file and some directories. Interestingly we need to both
provide the pre-processed file $< and the original source file $*. Here $<
is the first dependency and $* is the text that was matched against

The resulting *.PVS-Studio.log files are opaque. They represent arbitrary
knowledge extracted by the tool from an individual translation unit. This mode
of operation is beneficial for parallelism, as those tasks can be executed
concurrently. As with compilation, in the end we need to link the results
together to get the result of our analysis.

pvs-report: $(wildcard *.PVS-Studio.log)
    plog-converter --setings .pvs-studio.cfg \
        --srcRoot . \
        --projectName foo \
        --projectVersion 1.0 \
        --renderTypes fullhtml \
        --output $@ \

Here we use another tool, plog-converter to merge the result of the analysis.
There are some additional options that influence the format and contents of the
generated report but those are self-explanatory. One interesting observation is
that this command does not fail in the make check style, by existing with an
error code. If you intend to block on static analysis results there are some
additional steps you need to take. For PVS Studio I've created the appropriate
rules inside zmk, so that the logs can be processed and displayed in the same
style that a compiler would otherwise produce, so that the output is useful to
editors like vim.

That's it for PVS Studio, now let's look at Coverity. Coverity offers a sizable
(715MB) archive with all kinds of tooling. From the point of view we need just
one tool, the cov-build wrapper. The wrapper invokes arbitrary build command,
in our case make and stores the analysis inside a directory. Coverity requires
that directory to be called cov-int so we will follow along for simplicity.

Here is our make rule:

cov-int: $(MAKEFILE_LIST) $(wildcard *.c *.h) # everything!
    cov-build --dir $@ $(MAKE) -B

The rule is simple and could be improved. Ideally, to avoid the -B (aka
--always-make argument) we would perform an out-of-tree build in a temporary
directory. What we are after is a condition where make invokes the compiler,
even for the files we may have built in our tree before, so that cov-build
gets to observe the relevant arguments, as was described before. The more
problematic part is the build-dependency, which technically depends on
everything that the input may need. Here I simplified the real dependency set.
For practical CI systems that's sufficient, for purists it requires some care to
properly describe the recursive dependencies if the implicit target (commonly
called all).

The result is a directory we need to tar and send to Coverity for analysis. That
part is not interesting and details are available inside the zmk library.

Coverity processing is both asynchronous and capped behind a quota. I would not
recommend using it to block builds in CI, except if you have a commercial local
instance that never rejects your uploads. Coverity has a REST API for accessing
analysis results so with some extra integration that API can be queried and
appropriate blocking rules can be used to "break the build". Personally I did
not attempt this, mainly due to quota.

As a last note, Coverity puts additional requirements on valid submissions. At least 80%
of compilation units must be "ready for analysis". This can be checked with ad-hoc rule
that uses some shell to look at the analysis log file. The following shell snippet is not
quoted for correct use inside Make, see zmk for the quoted original:

test "$(tail cov-int/build-log.txt -n 3 | \
        awk -e '/[[:digit:]] C\/C\+\+ compilation units \([[:digit:]]+%) are ready for analysis/ { gsub(/[()%]/, "", $6); print $6; }')" \
     -gt 80

Next time we will look at the perceived value of the various static analysis
tools I've tried.

Footnote: ZMK can be found at https://github.com/zyga/zmk/

Installing VMware® Workstation 16 Pro on Ubuntu 20.04

Today I will talk about installing VMware® Workstation 16 Pro on Ubuntu 20.04 x86_64

For the most part, installation of this program has been streamlined. Compared to earlier versions, you really don't have do do anything more than:

chmod +x VMware-Workstation-Full-16.1.0-17198959.x86_64.bundle
sudo ./VMware-Workstation-Full-16.1.0-17198959.x86_64.bundle

This will give you working application but won't let you run any virtual machines yet.

For a while, kernel lockdown is in effect, where the kernel will not load unsigned kernel modules, so that they cannot be used to circumvent secure boot. Details are not interesting for now. Let's focus on the installation.

The following instructions assume you have enabled support for proprietary drives during your installation process. If you don't remember setting "secure boot password" and seeing weirdly looking and weirdly named MOK prompt during first boot following the installation of Ubuntu, you probably did not do this and the following instructions will not work.

If you did you have all the bits necessary now:

sudo /usr/src/linux-headers-$(uname -r)/scripts/sign-file sha256 /var/lib/shim-signed/mok/MOK.priv /var/lib/shim-signed/mok/MOK.der $(modinfo -n vmmon)
sudo /usr/src/linux-headers-$(uname -r)/scripts/sign-file sha256 /var/lib/shim-signed/mok/MOK.priv /var/lib/shim-signed/mok/MOK.der $(modinfo -n vmnet)
sudo modprobe vmmon
sudo modprobe vmnet

What is going on here is that the kernel lockdown prevents usage of unsigned modules. You can find this message in your journal / syslog and follow the breadcrumbs to the relevant manual page.

kernel: Lockdown: modprobe: unsigned module loading is restricted; see man kernel_lockdown.7

You can read the manual page here: https://man7.org/linux/man-pages/man7/kernel_lockdown.7.html

Remember that this has to be done every time you change your kernel.

zmk 0.4.2 released

I've released zmk 0.4.2 with several small bug-fixes.

ZMK is a Make library for writing makefiles that behave like autotools without having the related baggage. It works out-of-the-box on POSIX systems, including MacOS.

You can find ZMK releases at https://github.com/zyga/zmk/releases

The changelog for zmk 0.4.2 is:

  • The PVS module no longer fails when running the pvs-report target.

  • The Header module no longer clobbers custom InstallDir.

  • The Library.DyLib template no longer creates symlink foo -> foo.dylib when
    the library is not versioned. In addition the amount of code shared between
    Library.So and Library.DyLib has increased.

Raspberry Pi 4B and 4K display at 60Hz

Today I will talk about using Raspberry Pi 4B as a Ubuntu 20.10 desktop, on a 4K TV.

If you are interested in using Ubuntu 20.10 on a Raspberry Pi 4B with 8GB of RAM and want to use a TV as a display I have a bit of advice that can help you out and save your time.

1) Do not upgrade from 20.04 - after upgrading the essential boot section changes won't happen, and you won't have hardware acceleration. Unless you know what to change, just install 20.10 from scratch.

2) You need to edit /boot/firmware/config.txt and add hdmi_enable_4kp60=1, preferably to the [pi4] section. This setting is applied on boot. Your Raspberry Pi should have proper cooling.

3) You need to use the micro-HDMI port next to the USB-C power connector.

4) Your TV needs to be set to "PC" mode. Details differ but at least with my TV I was only getting 30Hz in any other mode. It may be related to the color format, I don't know.

5) If you boot and see the login screen just fine but get a blank / no signal screen after logging in then look at your ~/.config/monitors.xml file. I had the 30Hz refresh rate selected there and (again) it was not accepted by my TV in PC mode.

Good luck!

Introduction to bashunit - unit testing for bash scripts

Today I will talk about bashunit - a unit testing library for bash scripts.

All the posts about bash were building up to this. I wanted to be able to test bash scripts, but having found nothing that makes that practical, I decided to roll my own.

Having created bashcov earlier, I needed to connect the dots between discovering tests, running them, reporting errors in a nice way and measuring coverage at the same time. I also wanted to avoid having to source bashunit from test scripts, to avoid complexity related to having two moving parts, the tests and the program running them. I settled on the following design.

bashunit is a standalone bash script. It can enumerate tests by looking for files ending with _test.sh. In isolation, it sources each one and discovers functions starting with test_. In further isolation, it calls each test function, having established tracing in order to compute coverage. All output from the test function is redirected. On success only test function names and test file names are printed. On failure the subset of the trace related to the test function is displayed, as is any output that was collected.

The ultimate success of failure is returned with the exit code, making bashunit suitable for embedding into a larger build process.

In addition, since coverage is enlightening, similar to bashcov the collected coverage data is used to create an annotated script with the extension .coverage, corresponding to each sourced program that was executed through testing. This data is entirely human readable and is meant to pinpoint gaps in testing or unexpected code flow due to the complexity in bash itself.

Let's look at a simple example. We will be working with a pair of files, one called example.sh, housing our production code, and another one called example_test.sh, with our unit tests.

Let's look at example.sh first

hello_world() {
    echo "Hello, World"

if [ "${0##*/}" = example.sh ]; then

When executed, the script prints Hello World. When sourced id defines the hello_world function and does nothing else. Simple enough. Let's look at our unit tests.


. example.sh

test_hello_world() {
    hello_world | grep -qFx 'Hello World'

In the UNIX tradition, we use grep to match the output of the hello_world function. The arguments to grep are -q for --quiet, to avoid printing the matching output, -F for --fixed-strings to avoid using regular expressions and finally -x for --line-regexp to consider matches only matching entire lines (avoids matching a substring by accident).

Running bashunit in the same directory yields this the following output:

bashunit: sourcing example_test.sh
bashunit: calling test_hello_world
bashunit: calling test_hello_world resulted in exit code 1
bashunit: trace of execution of test_hello_world
bashunit:    ++ . example.sh
bashunit:    + hello_world
bashunit:    + grep -qFx 'Hello World'

What is that? Our tests have failed. Well, I made them to, if you look carefully the example code has a comma between Hello and World, while test the code does not.

Correcting that discrepancy produces the following output:

bashunit: sourcing example_test.sh
bashunit: calling test_hello_world

The exit status is zero, indicating that our tests have passed. In addition to this, we have coverage analysis in the file example.sh.coverage, it looks like this:

  -: #!/bin/bash
  -: hello_world() {
  1:    echo "Hello, World"
  -: }
  1: if [ "${0##*/}" = example.sh ]; then
  -:     hello_world
  -: fi

The two 1s indicate that the corresponding line was executed one time. If you add loops or write multiple tests for a single function you will see that number increment accordingly.

Interestingly, not only test functions are executed, the guard at the bottom, where we either execute hello_world or do nothing more is also executed. This is done when the example.sh script is sourced by bashunit.

Much more is possible, but this is the initial version of bashunit. It requires a bit more modern bash than what is available on MacOS, so if you want to use it, a good Linux distribution is your best bet.

You can find bashunit, bashcov and the example at https://github.com/zyga/bashcov

Broken composition or the tale of bash and set -e

Today I will talk about a surprising behavior in bash, that may cause issues by hiding bugs.

Bash has rather simple error checking support. There are, in general, two ways one can approach error checking. The first one is entirely unrealistic, the second one has rather poor user experience.

The first way to handle errors is to wrap, every single thing in and if-then-else statement, provide a tailored error message, perform cleanup and quit. Nobody is doing that. Shell scripts, it's sloppy more often than not.

The second way to handle errors is to let bash do it for you, by setting te errexit option, by using set -e. To illustrate this, look at the following shell program:

set -e

echo "I'm doing stuff"
echo "I'm done doing stuff"

As one may expect,false will be the last executed command.

Sadly, things are not always what one would expect. For good reasons set -e is ignored in certain contexts. Consider the following program:

set -e
if ! false; then
   echo "not false is true"

Again, as one would expect, the program executes in its entirety. Execution does not stop immediately at false, as that would prevent anyone from using set -e in any non-trivial script.

This behavior is documented by the bash manual page:

Exit immediately if a pipeline (which may consist of a single simple command), a list, or a compound command (see SHELL GRAMMAR above), exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test following the if or elif reserved words, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command's return value is being inverted with !. If a compound command other than a subshell returns a non-zero status because a command failed while -e was being ignored, the shell does not exit. A trap on ERR, if set, is executed before the shell exits. This option applies to the shell environment and each subshell environment separately (see COMMAND EXECUTION ENVIRONMENT above), and may cause subshells to exit before executing all the commands in the subshell.

The set of situations where set -e is ignored is much larger. It contains things like pipelines, commands involving && and ||, except for the final element. It is also ignored when the exit status is inverted with !, making something as innocent as ! true silently ignore the exit status.

What may arguably need more emphasizing, is this:

If a compound command or shell function executes in a context where -e is being ignored, none of the commands executed within the compound command or function body will be affected by the -e setting, even if -e is set and a command returns a failure status. If a compound command or shell function sets -e while executing in a context where -e is ignored, that setting will not have any effect until the compound command or the command containing the function call completes.

What this really says is that set -e is disabled for the enitre duration of a compound command. This is huge, as it means that code which looks perfectly fine and works perfectly fine in isolation, is automatically broken by other code, which also looks and behaves perfectly fine in isolation.

Consider this function:

foo() {
    set -e # for extra sanity
    echo "doing stuff"
    echo "finished doing stuff"

This function looks correct and is correct in isolation. Lets carry on and look at another function:

bar() {
    set -e # for extra sanity
    if ! foo; then
        echo "foo has failed, alert!"
        return 1

This also looks correct. In fact, it is correct as well, as long as foo is an external program. If foo is a function, like what we defined above. The outcome is that foo executes all the way to the end, ignoring set -e's normal effect after the non-zero exit code from false. Unlike when invoked in isolation, foo prints the finished doing stuff message. What is worse, because echo succeeds, foo doesn't fail!

Bash breaks composition of correct code. Two correct functions stop being correct, if the correctness was based on the assumption, that set -e stops execution of a list of commands, after the first failing element of that list.

It's a documented feature, so it's hardly something one can report as a bug on bash. One can argue that shellcheck should warn about that. I will file a bug on shellcheck, but discovering this has strongly weakened my trust in using bash for anything other than an isolated, executed script. Using functions or sourcing other scripts is a huge risk, as the actual semantics is not what one would expect.

Poor man's introspection in bash

Today I wanted to talk about bashunit, a unit testing library for bash scripts. This turned out to be a bigger topic, so you will have to wait a bit longer for the complete story. Instead I will talk about how to do poor man's introspection in bash, so that writing tests is less cumbersome.

In some sense, if you have to know this sort of obscure bash feature, it may be a good indication to stop, take a step back and run away. Still if you are reading this, chances are you run towards things like that, not away.

While writing bashunit, the library for unit testing bash scripts, I wanted to not have to enumerate test functions by hand. Doing that would be annoying, error-prone and just silly. Bash being a kitchen sink, must have a way to discover that instead. Most programming languages, with the notable exception of C, have some sort of reflection or introspection capability, where the program can discover something about itself through a specialized API. Details differ widely but the closer a language is to graphical user interfaces or serialization and wire protocols, the more likely it is to grow such capability. Introspection has to have a cost, as there must be additional meta-data that describes the various types, classes and functions. On the upside, much of this data is required by the garbage collector anyway, so you might as well use it.

Bash is very far away from that world. Bash is rather crude in terms of language design. Still it has enough for us to accomplish this task. The key idea is to use the declare built-in, which is normally used to define variables with specific properties. When used with the -F switch, it can also be used to list function declarations, omitting their body text.

We can couple that with a loop that reads subsequent function declarations, filter out the declaration syntax and end up with just the list of names. From there on all we need is a simple match expression and we can find anything matching a simple pattern. Et voilà, our main function, which discovers all the test functions and runs them.

bashunit_main() {
    local def
    local name
    declare -F | while IFS= read -r def; do
        name="${def##declare -f }"
        case "$name" in
                if ! "$name"; then
                    echo "bashunit: test $name failed"
                    return 1

Tomorrow we will build on this idea, to create a very simple test runner and coverage analyzer.

Measuring execution coverage of shell scripts

Today I will talk about measuring test coverage of shell scripts

Testing is being honest about our flawed brains that constantly make mistakes regardless of how much we try to avoid it. Modern programming languages make writing test code a first-class concept, with intrinsic support in the language syntax and in the first-party tooling. Next to memory safety, concurrency safety, excellent testing support allows us to craft ever larger applications with an acceptable failure rate.

Shell scripts are as old as UNIX, and are usually devoted to glue logic. Normally testing shell scripts is done the hard way, in production. For more critical scripts there's a tendency to test the end-to-end interaction but as far as I'm aware of, writing unit tests and measuring coverage is unexpected.

In a way that's sensible, as long as shell scripts are small, rarely changed and indeed are battle tested in production. On the other hand nothing is unchanged forever, environments change, code is subtly broken and programmers on the entire range of the experience spectrum, can easily come across a subtly misunderstood, or broken, feature of the shell.

In a way static analysis tools have outpaced the traditional hard way of testing shell programs. The utterly excellent shellcheck program should be a mandatory tool in the arsenal of anyone who routinely works with shell programs. Today we will not look at shellcheck, instead we will look at how we can measure test coverage of a shell program.

I must apologize, at all times when I wrote shell I really meant bash. Not because bash is the best or most featureful shell, merely because it happens to have the right intersection of having enough features and being commonly used enough to warrant an experiment. It's plausible or even likely that zsh or fish have similar capabilities that I have not explored yet.

What capabilities are those? Ability to implement an execution coverage program in bash itself. Much like in when using Python, C, Java or Go, we want to see if our test code at least executes a specific portion of the program code.

Bash has two features that make writing such a tool possible. The first one is most likely known to everyone, the set -x option, which enables tracing. Tracing prints the commands, just as they are executed, to standard error. This feels like almost what we want, if only we could easily map the command to a location in a source file, we could construct a crude, line-oriented analysis tool. The second feature is also standard, albeit perhaps less well-known. It is the PS4 variable, which defines the format of the trace output. If only we could put something as simple as $FILENAME:$LINENO there, right? Well, in bash we can, although the first variable has a bash-specific name $BASH_SOURCE. The second feature which makes this convenient, is the ability to redirect the trace to a different file descriptor. We can do that by setting $BASH_XTRACE_FD=... to a file descriptor of an open file.

With those two features combined we can easily run a test program, which sources a production program, exercises a specific function and quits. We can write unit tests. We a can also run integration tests and check if any of the production code is missing coverage that indicates important test is missing.

I pieced together a very simple program that uses this idea. It is available at https://github.com/zyga/bashcov and is written in bash itself.

Signal to noise ratio in build systems

Today I will argue why silent rules are a useful feature of good build systems.

Build systems build stuff, mainly by invoking other tools, like compilers, linkers, code generators and file system manipulation tools. Build tools were traditionally printing some indication of progress. Make displays the commands as they are executed. CMake displays a quasi progress bar, including the name of the compiled file and a counter.

Interestingly, it seems the more vertically oriented, the less output shows up by default. If you need to hand-craft a solution out of parts, like with make, debugging the parts is important to the program you are building. Compare the verbosity of a autotools build system with a go build ./... invocation, that can build many thousands of programs and libraries. The former prints walls of text, the latter prints, nothing, unless there's an error.

As an extreme case, this is taken from the build log of Firefox 79. This is the command used to compile a single file. Note that the command is not really verbatim, as the <<PKGBUILDDIR>> parts hide long directory names used internally in the real log (this part is coming from the Debian build system). Also note that despite the length, this is a single line.

/usr/bin/gcc -std=gnu99 -o mpi.o -c -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DNDEBUG=1 -DTRIMMED=1 -DNSS_PKCS11_2_0_COMPAT -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DSTATIC_EXPORTABLE_JS_API -I/<<PKGBUILDDIR>>/third_party/prio -I/<<PKGBUILDDIR>>/build-browser/third_party/prio -I/<<PKGBUILDDIR>>/security/nss/lib/freebl/mpi -I/<<PKGBUILDDIR>>/third_party/msgpack/include -I/<<PKGBUILDDIR>>/third_party/prio/include -I/<<PKGBUILDDIR>>/build-browser/dist/include -I/usr/include/nspr -I/usr/include/nss -I/usr/include/nspr -I/<<PKGBUILDDIR>>/build-browser/dist/include/nss -fPIC -include /<<PKGBUILDDIR>>/build-browser/mozilla-config.h -DMOZILLA_CLIENT -Wdate-time -D_FORTIFY_SOURCE=2 -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -fno-strict-aliasing -ffunction-sections -fdata-sections -fno-math-errno -pthread -pipe -g -freorder-blocks -O2 -fomit-frame-pointer -funwind-tables -Wall -Wempty-body -Wignored-qualifiers -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wduplicated-cond -Wno-error=maybe-uninitialized -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object -Wno-multistatement-macros -Wno-error=class-memaccess -Wno-error=deprecated-copy -Wformat -Wformat-overflow=2 -MD -MP -MF .deps/mpi.o.pp /<<PKGBUILDDIR>>/security/nss/lib/freebl/mpi/mpi.c

This particular build log is 299738 lines long. That's about 40MB of text output, for a single build.

Obvously, not all builds are alike. There is value in an overly verbose log, like this one, because when something fails the log may be all you get. It is useful to be able to repeat the exact steps taken to see the failure in order to fix it.

On the other end of the spectrum you can look at incremental builds, performed locally while editing the source. There some things are notable:

  • The initial build is much like the one quoted above, except that the log file will not be looked at by hand. An IDE may parse it to pick up warnings or errors. Many developers don't use IDEs and just run the build and ignore the wall of text it produces, as long as it doesn't fail entirely.

  • As code is changed, the build system will re-compile the parts that became invalidated by the changes. This can be as little as one .c file or as many as all the .c files, that include a common header that was changed. Interestingly computing the number of files that need changes may take a while and it may be faster to fire start compiling even before the whole set is known. Having a precise progress bar may be detrimental to the performance.

  • The output of the compiler may be more important than the invocation of the compiler. After all, it's very easy to invoke the build system again. Reading a page-long argument list to gcc is less relevant than the printed error or warning.

That last point is what I want to focus on. The whole idea is to hide or simplify some information, in order to present other kind of information more prominently. We attenunate the build command to amplify the compiler output.

Compare those two make check output logs from my toy library. I'm working on a few new manual pages and I have a rule which uses man to verify syntax. Note that I specifically used markup that wraps long lines, as this is also something you'd see in a terminal window.

This is what you get out of the box:

zyga@x240 ~/D/libzt (feature/defer)> make check
/usr/bin/shellcheck configure
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_CMP_BOOL.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_CMP_BOOL.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_CMP_INT.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_CMP_INT.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_CMP_PTR.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_CMP_PTR.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_CMP_RUNE.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_CMP_RUNE.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_CMP_UINT.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_CMP_UINT.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_CURRENT_LOCATION.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_CURRENT_LOCATION.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_FALSE.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_FALSE.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_NOT_NULL.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_NOT_NULL.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_NULL.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_NULL.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/ZT_TRUE.3 2>&1 >/dev/null | sed -e 's@tbl:@man/ZT_TRUE.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/libzt-test.1 2>&1 >/dev/null | sed -e 's@tbl:@man/libzt-test.1@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/libzt.3 2>&1 >/dev/null | sed -e 's@tbl:@man/libzt.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_check.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_check.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_claim.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_claim.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_closure.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_closure.3@g'
mdoc warning: A .Bd directive has no matching .Ed (#20)
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_closure_func0.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_closure_func0.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_closure_func1.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_closure_func1.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_defer.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_defer.3@g'
Usage: .Fn function_name [function_arg] ... (#16)
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_location.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_location.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_location_at.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_location_at.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_main.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_main.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_boolean.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_boolean.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_closure0.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_closure0.3@g'
mdoc warning: A .Bd directive has no matching .Ed (#20)
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_closure1.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_closure1.3@g'
mdoc warning: A .Bd directive has no matching .Ed (#21)
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_integer.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_integer.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_nothing.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_nothing.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_pointer.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_pointer.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_rune.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_rune.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_string.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_string.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_pack_unsigned.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_pack_unsigned.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_test.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_test.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_test_case_func.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_test_case_func.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_test_suite_func.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_test_suite_func.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_value.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_value.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_visit_test_case.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_visit_test_case.3@g'
LC_ALL=C MANROFFSEQ= MANWIDTH=80 man --warnings=all --encoding=UTF-8 --troff-device=utf8 --ditroff --local-file man/zt_visitor.3 2>&1 >/dev/null | sed -e 's@tbl:@man/zt_visitor.3@g'
plog-converter --settings ./.pvs-studio.cfg -d V1042 --srcRoot . --renderTypes errorfile zt.c.PVS-Studio.log zt-test.c.PVS-Studio.log | srcdir=. abssrcdir=/home/zyga/Dokumenty/libzt awk -f /usr/local/include/zmk/pvs-filter.awk
libzt self-test successful

This is what you get when you use ./configure --enable-silent-rules:

zyga@x240 ~/D/libzt (feature/defer)> make check
SHELLCHECK configure
MAN man/libzt-test.1
MAN man/libzt.3
MAN man/zt_check.3
MAN man/zt_claim.3
MAN man/zt_closure.3
mdoc warning: A .Bd directive has no matching .Ed (#20)
MAN man/zt_closure_func0.3
MAN man/zt_closure_func1.3
MAN man/zt_defer.3
Usage: .Fn function_name [function_arg] ... (#16)
MAN man/zt_location.3
MAN man/zt_location_at.3
MAN man/zt_main.3
MAN man/zt_pack_boolean.3
MAN man/zt_pack_closure0.3
mdoc warning: A .Bd directive has no matching .Ed (#20)
MAN man/zt_pack_closure1.3
mdoc warning: A .Bd directive has no matching .Ed (#21)
MAN man/zt_pack_integer.3
MAN man/zt_pack_nothing.3
MAN man/zt_pack_pointer.3
MAN man/zt_pack_rune.3
MAN man/zt_pack_string.3
MAN man/zt_pack_unsigned.3
MAN man/zt_test.3
MAN man/zt_test_case_func.3
MAN man/zt_test_suite_func.3
MAN man/zt_value.3
MAN man/zt_visit_test_case.3
MAN man/zt_visitor.3
PLOG-CONVERTER zt.c.PVS-Studio.log zt-test.c.PVS-Studio.log static-check-pvs
EXEC libzt-test
libzt self-test successful

I will let you decide which output is more readable. Did you spot the mdoc warning lines on your first read? If you build system supports that, consider using silend rules and fix those warnings you now see.

Build system griefs - autotools

I always had a strong dislike of commonly used build systems. There's always something that would bug me about those I had to use or interact with.

Autoconf/Automake/Libtool are complex, slow and ugly. Custom, weird macro language? Check. Makefile lookalike with different features? Check. Super slow single threaded configuration phase. Check. Gigantic, generated scripts and makefiles, full of compatibility code, workarounds for dead platforms and general feeling of misery when something goes wrong. Check. Practical value for porting between Linux, MacOS and Windows. Well, sort of, if you want to endure the pain of setting up the dependency chain there. It feels very foreign away from GNU.

Autotools were the first build system I've encountered. Decades later, it's still remarkably popular, by both broad feature support and inertia. Decades later it still lights up exactly one core, on those multi-core workstations we call laptops. The documentation is complete but arguably cryptic and locked in weird info pages. Most projects I've seen cargo-cult and tweak their scripts and macros from one place to another.

Despite all the criticism autotools did get some things right, in my opinion. The build-time detection of features, as ugly, slow and abused for checking things that are available everywhere now, is still the killer feature. There would be no portable C software as we know it today without the ability to toggle those ifdefs and enable sections of the code depending on the availability and functionality of an API, dependency or platform feature.

The user-interaction via the configuration script, now commonly used to draw lines in the sand and show how one distribution archetype differs from the other, is ironically still one of the best user interfaces for building that does not involve a full blown menu system.

The theory where you don't need autotools to use a project built with it. The theoretical portability, albeit to mostly fringe systems, is also a noble goal. Though today I rarely see systems that don't rip out the generated build system and re-generate it from source, mainly to ensure nobody has snuck in anything nasty into that huge, un-auditable, generated shell monstrosity that's rivaling the size of modest projects.

Will autotools eventually be replaced? I don't think so. It seems like one of those things that gets phased out only when a maintainer retires. The benefits of rewriting the whole build system and move to something more modern must outweigh the pain and cost of doing so. In the end it would help to modernize autotools more than it would help to convince everyone to port their software over.

New Blog

Just testing the whole blogging via-notes-thing, thanks to https://standardnotes.org/.

The idea is that you can blog from a desktop or mobile client,
by creating a set of notes that appear as distinct posts. Not all notes are public, in fact, by default they are all encrypted and private.

The effort to set this up is remarkably low. The only downside is that, as all hosted products, there's a free tier that is not as nice as the paid subscription.

There's a snap or appimage for Linux. The way to get your blog listed on, wait for it, https://listed.to, (har har), is a bit cumbersome, but this is all thanks to privacy so it's not too bad.

I may keep this.

Oh and thanks to https://listed.to/@jasone for the idea.