Key Considerations for Software Updates for Embedded Linux and IoT

Key Criteria for Embedded Software Updaters

Now that you are more familiar with the embedded environment, let's consider the implications this has for an embedded software updater.

Robust and Secure

As you've seen, both the power and the network can be very unreliable and insecure in an embedded environment. An embedded updater must have a couple properties and features in order to tackle these challenges sufficiently.

Atomic Updates

The database industry is very familiar with the concept of atomic transactions—the "A" in ACID, where a set of operations either all complete or none of them complete. The classic example for the need for this requirement in database theory is with online transactions. When one user transfers money to another, deducting money from one account should occur only if you also successfully add the money to the other account.

This same property is very important for embedded updaters, in order to handle intermittent update errors like sudden power loss. For an embedded updater, the atomic property can be defined in two parts:

  • An update is always either completed fully or not at all.

  • No software component (besides the updater) ever sees a partially installed update.

You can see that common ways of deploying software updates in the desktop environment do not meet this atomicity requirement. For example, while you are installing an rpm package, many files are written and modified across the filesystem, and they would be in an inconsistent and potentially non-recoverable state if you suddenly unplugged your desktop during the installation—the application being updated probably wouldn't start at all.

Consistent Deployments

An important approach to mitigate the risks of bricking devices is to test new software updates extensively before releasing them into production. However, in order to rely on test results, you need a test environment that is as identical as possible to the production environment. It is a classic problem in general operations, be it for embedded devices or data centers, that the test environment diverges from production, so that changes work well in the test environment but cause significant downtime when released to production. This is one of the reasons why full-image updates are so prevalent in the embedded space. If your entire root filesystem is the same, block by block, in the test and production environments, then there are guaranteed similarities. Contrast this to a deployment using rpm packages, which may depend on libraries that have different versions, or patches, on the test and production environments, and maybe even across the production environments as well. Over time, such a design typically will lead to production deployments that fail for reasons that are inconsistent and hard to diagnose.

Authenticity Checks before Updates

From a security perspective, it is very important to know whether software comes from an authorized source or whether an attacker could have injected malicious software into the update. There have been countless cases where embedded devices are simply broadcasting their desire to install an update, and anyone who responds would be able to inject the software of their choosing into the device.

A basic approach to ensuring a level of authenticity is to leverage in-transit security protocols like TLS. If done correctly, this will ensure that the update cannot be modified while in transit from an update server to the device.

However, a more robust end-to-end approach is to embed cryptographic authenticity metadata as part of the update itself. Typically a form of code signing is employed, where digital signatures are created by an authority and verified at the device.

One of the key advantages of code signing over solely relying on in-transit security is that the authority that signs the update can be decoupled from the server that hosts it. For example, someone in the QA department could sign an update offline. This reduces the attack surface in cases where the update server gets compromised, because an attacker can still deploy only updates that have been signed by the QA department.

For performance-sensitive devices, cryptographic mechanisms like Message Authentication Code (MAC) or Elliptic Curve signatures should be considered, as they provide much more efficient verification than RSA or DSA at the same level of security.

Sanity Checks after Updates

Embedded devices are typically single-purpose and run only one main application, although in some cases, they could run several. In either instance, it's important to check the health of such applications after deploying an update. Are they running? Do they have network access? Can the user interact with them successfully on the device?

A software update should not be considered successful just because the device boots; there should be a way to integrate custom application sanity checks as well. Finally, a critical check that should be covered by the updater generically is this: Is it possible to deploy another update?

If any of these checks fail, the updater should have the capability to roll back to the previous known-working software, so that downtime is avoided while the issue is being diagnosed and resolved.

The general workflow for deploying software updates is shown in Figure 4.

Figure 4. General Workflow for Deploying Software Updates

Integration with Existing Development Workflow

If you are one person starting from scratch with an embedded/IoT project, you likely can choose all the tools and processes you like the best. However, once several people are collaborating on the same project, and in particular, if there is a product already being developed before software updates were taken into account, it is very important that the software update process integrates well with the development workflow.

At first glance, this may look like a strange criteria for an updater, but many approaches to software updates require a full replacement of existing development workflows. Commercial updater tools more often than not are offered as part of a "platform", where the updater is bundled together with a full device OS, a cloud back end and other device management features. For existing products, this can pose a significant challenge, because the device OS needs to be replaced, potentially also together with the build system, version control and associated QA processes.

For homegrown updaters, this criteria is typically implicitly taken into account, because teams tend to start with what they have and see what is the shortest path to develop and integrate an updater into it. Since existing build systems tend to output packages like rpm or opkg easily, this is an approach that integrates well and is chosen by many homegrown updaters. However, package-based updates have significant drawbacks with respect to lack of robustness, as I discussed earlier.

Bandwidth

As I mentioned previously, embedded devices typically are connected with some kind of low data rate wireless connection. An update process that requires less bandwidth will be favorable over one that takes more, simply because it would cost less and take less time to deploy an update.

Compression is the first feature to look at in order to reduce bandwidth, as this could cut the size of the update in half or more, depending on type and compressibility of the update. There is also a variety of delta-based update mechanisms that could be employed to reduce bandwidth usage further.

Downtime during Update

While an update is being deployed, it is desirable to have as little downtime on the device as possible. How much downtime is acceptable is clearly dependent on the use case of the embedded device. Is it part of the power grid that must function 24x7, or is it a consumer audio system that isn't used at night?

The method for deploying updates impacts the required downtime the most. For example, for full image updates, it's possible to deploy the update from a maintenance mode or use a dual-A/B rootfs approach. The maintenance-mode approach works by rebooting into a maintenance partition, installing the update to the root filesystem partition and then rebooting into the root filesystem partition again; the device is unusable for all of this period. In a dual-A/B rootfs approach, the update is installed to the inactive root filesystem while the device can continue to be used. The downtime in this case is only during the reboot into the updated (previously inactive) partition. The dual-A/B rootfs partition update design is shown in Figure 5.

Figure 5. Dual-A/B rootfs Partition Update Design

Deployment Management

As you can see, many design choices and trade-offs need to be made on the device side for an embedded updater client.

However, once an updater client is installed and working on embedded devices, the problem of managing all those clients becomes apparent. How can a new update be installed on 1,000 of these embedded devices? Which version of the software are they running? How do you know if an update is installed successfully everywhere, and is there a log for failed updates?

These use cases typically are handled with an update management server, so that updates can be managed across a device fleet.

Conclusion

Many design trade-offs need to be considered in order to deploy software updates to IoT devices. Although historically most teams have decided to implement their homegrown updaters, the recent appearance of several open-source software updaters for embedded Linux means that we should be able to stop re-inventing the wheel.

Resources

SWUpdate is a very flexible client-side embedded updater for full image updates, licensed GPL version 2.0+.

Mender, the project the author of this article is involved in, focuses on ease of use and consists of a client updater and management server with a UI and is licensed under Apache License 2.0.

______________________

Eystein Stenberg has more than seven years of experience in security and systems management software and has spoken at various conferences. You can reach him at eystein@mender.io.