Introduction to SystemOTA (1/n)

Today I will talk about SystemOTA. An image-based, transactional update system for embedded Linux devices.

Disclaimer. This blog talks about my work but offers my personal opinions and does not speak for my employer or customers.

--

SystemOTA is one of the first projects started by Huawei Europe Open Source Technology Center, (OSTC). It is an Apache 2.0-licensed, no-strings-attach update system for devices such as routers, gateways, smart switches, smart thermostats or any other device that is based on Linux, has moderate amount of memory and storage.

SystemOTA came out of the desire to build something we called the Transparent Gateway. The transparent gateway is an OSTC Reference Design for a device combining connectivity to the internet with connectivity to a local mesh network. The gateway bridges the gap between small, Zephyr-based devices and the rest of the network. From the point of view of a prospective hardware vendor, the gateway comes with an open source operating system, a set of open source services and applications and a set of open source server-side software that allow the gateway to be kept up to date and secure.

It was very important to us to create a solution that is open at the level of the platform. We didn't want to make something that is technically open source on the device, but in practice requires a proprietary component running in the cloud.

There are several interesting software update projects out there today. Some of them are entirely open, others less so. Some have the right features but use that as the business model, keeping something essential closed. Others still have made technology choices that we would like to avoid. In the end we chose to write a new system that, we hope, can offer value to the market. Combining the right blend of openness, performance and scalability.

Given the scale of the problem, we wanted to be conservative in what we promise: a reliable update system for home devices based on Linux, that any company or enthusiast could deploy and operate, either for themselves or as a service for others.

Before we look at the server side, let's look at the update feature from the point of view of the device and the kind of features we want to offer.

  • We want the device to be able to update itself in the field, without the oversight of a person.
  • We want to either update fully or not at all, avoiding any intermediate state.
  • We want the update process to be resilient to externally induced problems, like power loss.

The first requirement is perhaps somewhat controversial. On the one hand side, ownership of a device might imply that all the decisions are taken by the owner. On the other side, the reality of operating connected devices, with the constant threat of malware and nascent legal requirements put the responsibility of offering automatic updates on anyone who wants to produce modern devices. Luckily we are not producing a physical box, our software is a part of reference architecture. The final decisions and consequences are on the eventual manufacturer. On our side, we want to offer unattended updates as a technical possibility. This implies that certain decisions must be taken by the software, automatically and correctly. The only way to do that is to reduce complexity, to manage choices and to constrain the stack enough, that a fully automatic update can leave the device in a state in no way worse than original.

The second requirement is a direct consequence of the first requirement. We cannot allow ourselves to break the utility offered by the device due to a failed or partial update. While one might argue that a device that is disconnected from network and power supply has ultimate security, our goal is to provide the best possible security which retains function. In technical terms we must design the system to behave like a traditional database transaction. Regardless of how much work was performed during the update, until the transaction is committed, software running on the device sees only the current software and state, not any combination of the two. Once the update is complete, all the programs are updated at once, atomically and without corruption of the data.

The last requirement is a re-affirmation of the first requirement. Given that we have no control over how the devices are used, whatever we do during the update process must survive and not regress on the other two requirements when faced with the sudden loss of power supply. The proverbial user may yank the power cord at any time, and it must not matter.

Given finite resources, we also wanted to put some constraints, so that bounds of the system are clearly set:

  • SystemOTA is not aimed at updating applications that are not a part of the operating system
  • SystemOTA is not aimed at updating firmware embedded into the components that constitute the product
  • SystemOTA is not aimed at absolutely smallest amount of storage and memory that is still available on the market.

The first constraint is actually quite liberating. By not handling updates of the application layer, we are not prescribing how it must look like. There are number of competing choices here and we believe our design is compatible with all of them. This space also has plenty of innovation and may offer better experience (e.g. no downtime) as compared to handling everything exactly the same way, as system updates.

The second constraint is just practical. LVFS is the way to handle firmware updates and we want to align with that, convincing our partners where appropriate. Updating boot firmware is tricky in the sense that it is rarely handled automatically in the same way like OS updates have started to become. Not all platforms have fallback mechanisms and those that do may come with additional requirements that we cannot require ourselves. We think that for the moment, separating platform firmware from the OS is the right choice.

The last requirement is also practical. We want to use the transparent gateway as a platform for special type of applications that bridge the world between micro-controller world and both the local network and the cloud alike. We cannot do that on a system so stripped down, that single megabytes matter. We picked a compiled but memory-safe language with automatic memory management instead of the, perhaps historically more traditional C, for the OTA stack. We believe that safety and security are more important than ultimate performance, and that this is the right trade-off. For that reason, the reference implementation is written in Go, with access to a high-quality http stack, ease of use and ease of testing. At the time of this writing, the SystemOTA takes about four megabytes of storage and a comparable amount of memory at runtime. It is also designed to exit when idle, avoiding the cost of constant memory tax.

We believe that the set of requirements and constraints will allow us to create a high-quality, competitive OS update solution. Combined with the design of the state management system, the transparent gateway will serve as a solid foundation for additional reference designs coming in 2022.

In the next installment we will look at the architecture of the transparent gateway and how our choices enable the update process to offer those guarantees.


You'll only receive email when they publish something new.

More from Zygmunt Krynicki
All posts