Re-thinking system and distro development

Keynote at FOSDEM 2012, 2012-02-04.

Introduction

I think the way current, mainstream Linux distributions are developed is okay-ish, and they do work pretty well, but I want everything to be awesome.

For us to reach awesome, we will need to change some fundamental things in the development workflow and the resulting systems. Most importantly, we need to reduce the complexity of our systems, and adopt at least continuous integration from agile development.

In 1991, my computer became the first computer onto which Linux got installed. Linus had grown Linux on top of his existing Minix installation. He was going to be making the first Linux release and needed to have a way for other people to install Linux. He used my computer to test the installation method. He called it the boot floppy.

Installation with the boot floppy was quite smooth. After you had edited the master boot record with a text editor, in hex mode if you were lucky, and figured out how to install software from upstream release tarballs, without having a network connection of any sort, you were able to do exciting things like compile more programs.

Modern Linux distributions make things much easier. Indeed, it is almost ridiculously easy install Linux on a PC, and then to install any of tens of thousands of programs on the machine. Upgrading to newer versions is also simple.

All the mainstream Linux distributions are about equal when it comes to these things. Some are better at one thing, others at something else, but there's no clear overall winner. They're all using approximately identical technology to solve the same problems, so it's no surprise the end result is approximately identical.

I wish to suggest some things that would, I believe, make things better. Apart from that first ever Linux install, I've been a Debian developer since 1996, and I've worked for Canonical to develop Ubuntu, so I have some background in the area of distribution development. I currently work for Codethink, a British company, where I develop a better way to develop embedded Linux systems, which is very similar to developing distributions.

What's the point of a distribution

Before we get started, I will describe what the point of a distribution is.

The purpose of a Linux distribution is to provide tools for installing a system on some hardware, and to make managing additional software on the system easier than having to compile everything from upstream tarballs directly. Ideally, the distribution provides tools for upgrading the system as well.

Another important job for the distribution is to choose what software to include. It's not possible to include everything. The choice may be done using various criteria, such as license, purpose, quality, popularity, usability, technology, or suitability for particular kinds of users.

Further, it's the distribution's job to integrate all the software it includes so that it all works together. For example, the distribution might have a policy on how a web application should be installed so that the web server will run it with a minimum of fuss, and so that the web application can access a database on the same host, or a different host, without requiring much configuration by the sysadmin.

I believe that we, the people who develop Linux distributions, are doing a pretty good job. The purpose of this talk is not to complain, but to inspire the distro development community to think about our development processes, and hopefully come up with better ways to do things.

A pretty good job is not good enough, if we can be awesome. And we can.

Mainstream distributions are too complex

The core problem is that the current mainstream distros are too big and complex for the development methodologies and tools and abstractions we're currently using. As an example, Debian currently has about 17 thousand source packages, which build about 35 thousand binary packages.

Imagine you're not intimately familiar with how a Linux system works. You want to learn a new programming language. Which packages should you choose from? If the language is at all popular, searching the package list will result in a lot of hits, and some of them conflict with each other. Most people will experience something called decision paralysis from information overload caused by having too much to choose from. It is not helpful.

But let's ignore end users for now. This talk is about making distro developer's lives better. Instead of choosing packages to install, let's document package dependencies correctly. We don't have the tools or the technology to do this correctly for every aspect. We can do it well for shared libraries, but not for, say, choosing whether to depend on a minimal Posix shell or bash for the hashbang line for a shell script. Indeed, it's not possible to do that fully automatically, because of stupid things like the self-modifying scripts and the halting problem.

The above image shows the dependencies between the set of packages belonging to the build essential set in Debian. These are the packages required to build simple C or C++ programs with a typical build system. It is only a handful of packages, but even so the dependency graph is quite complicated.

Managing dependencies manually, when you have thousands of source files per package, and tens of thousands of binary packages to choose from, is hard. However, anything we have to do manually is troublesome.

The current way of managing dependencies works just well enough that we're not suffering too much yet, but we're not going to be able to grow much more. We need to reduce the number of packages or we'll drown in the cominatorial explosion of their relationships.

It's not just about getting dependencies right, of course. With 35 thousand packages, almost everyone will have an almost unique set of packages installed. With backported packages and third party packages, even if two machines have the same packages installed, they're likely to have different versions of the packages. Who can test all those combinations? If we don't test them, how do we know they work?

The interconnectedness of all packages has other consequences. Take transitions to new versions of software that has many reverse dependencies. Suppose your distro supports all three of GNOME, Xfce, and KDE. GNOME comes out with a new version, which means you need to rebuild everything that depends on GNOME, or any of the GTK libraries.

This will inevitably result in some errors. There will always be errors when large changes happen. No worries, you can fix them. Unfortunately, fixing them is manual work, and takes a while to do. While you're doing that, there's a new release of Xfce. The Xfce packagers in your distro start their own transition and start fixing problems caused by that. Some of the same packages are involved as with your transition.

Now you have two groups of people making changes to the same code. Unless they co-ordinate very carefully, they both keep making each other's lives worse.

One way to deal with this is to do only one transition at a time. That means the overall pace of development of the distribution becomes slower. You can do some things in parallel if they don't touch the same packages, but it's still slow, and requires careful co-ordination.

This is essentially the same problem as version control systems have. RCS required you to lock a file before editing it, preventing anyone else from working on it. What we need is something more like a modern distributed version control system, where lots of things can happen in parallel, even if they touch the same parts of the code, and smart merging algorithms can handle things in almost all cases.

Packages are the wrong abstraction today

All of this leads me to conclude that the package is the wrong unit of abstraction today. Dpkg and RPM were great innovations in the mid-1990s, but back then, a distribution would contain maybe a few hundred packages at most. Now they contain a hundred times more. A minor irritation for a small system often becomes a roadblock in a big system. If you walk a hundred meters, a small pebble in your shoe does not really matter. If you walk ten kilometers, your foot will be a bleeding mess.

It's not acceptable to kick out most software in a modern distro. Instead, we can combine multiple upstream projects into bigger collections. For example, collect all the basic development tools, such as gcc and binutils and make, into a development collection. Collect all of GNOME into one, or maybe split it into a platform and desktop environment.

This would drop the number of packages, if you wish, down dramatically, reducing the dependency graphs so much they become manageable again.

It also reduces the flexibility to install just the desired software on a host. For a laptop, desktop, or server, this often does not matter, since the unused software is inert on disk, and disk space is not a bottleneck. The bottleneck is, instead, our ability to deal with the complexity of the package graph.

For embedded systems, and other systems where precise choice matters, it's possible to allow using the same upstream projects in multiple collections. We could have one collection for an embedded base system, with busybox, and another for bigger systems, with bash and coreutils and so on. This is still massively simpler than the current necessity to pick and choose for every system.

This new simplicity makes testing and general quality assurance not just easier, but meaningful. What's the point of testing an essentially random set of package versions which nobody will really be running for real? If everyone runs the same big collections of software, it makes sense to test that thoroughly.

Support and debugging become easier too. The support person can easily run the same versions of the same software as the user, rather than different versions of a different set of software. Sometimes this matters a lot.

When more effort can be spent on a specific of software versions, it's more likely that testing will be able to find problems.

These concepts need work. Perhaps some other abstraction than a collection of related software works better? Should the collections be implemented as system images? I do not have definitive answers yet, but I urge everyone to think about these, and write up your thoughts. Let's get a discussion going.

Automated testing is really quite helpful, you know

Automated testing of software is not a new idea, but it has taken off in a big way in the past decade and a half. New software is now expected to have unit tests for individual methods and classes, integration tests for the whole software, deployment tests for the installed software, and so on.

That's the reality of individual upstream projects. Some automated testing exists for Linux distros too, but we need more of it, and we need it running all the time. If we do this right, it will greatly reduce the number of problems that reach people, even if those people are developers and testers.

In 2005 I wrote a testing tool for Debian, called piuparts. It tests individual packages and checks that they can be installed, upgraded, and removed without causing problems. It does not check that the software in the package actually works, just that the software can be installed and removed safely. This is really a very simple check, but it was new at the time.

I was told in 2010 by someone who makes a live CD based on Debian that he'd noticed a remarkable improvement in the quality of packages when I started running piuparts and reporting bugs. Even a simple test often finds lots of problems if it tests for things that developers and heavy users of a package do not actually use. For example, it is common in Debian that developers package software they use themselves. If you use the software, you probably never actually remove it, so that code path gets no testing from the developer. An automatic tool, however, doesn't get attached to the packages.

Here's what I want to see. We assume that upstreams make sure their software basically works. We, the distro developers, don't need to make sure that, say, Apache works well. Instead, we'll check that the version of Apache we've compiled and installed onto our system image works. A smoke test, if you wish, plus more specific testing about aspects of the configuration and setup that we are most worried about.

These tests need to be run automatically, whenever there have been changes. Ideally, for every package upload, though that's probably too much work. This will give a much quicker feedback loop to the uploaders about whether they have broken something. The longer it takes to get that feedback, the harder it becomes to fix a bug.

Ideally, a failed test will prevent an upload from reaching testers: after all, if the automatic test failed, then something, somewhere is wrong, and needs to be investigated and fixed, before it's worth bothering human testers.

Automated testing is possible without abandoning the package as an abstraction. I suspect it will be harder to do, since the overall complexity of the system is greater, but the two concepts are orthogonal and should be worked on independently.

After basic functional tests, we can add more. We can add tests for boot speed, or system size, or stress testing of a web application, or security checks, etc. With a good continuous integration infrastructure in place, adding tests and improving coverage gets fairly easy, and has a clear and immediate positive impact on the quality of the final system.

Conclusion

Let me wrap things up with an executive summary, of sorts. Lots of relatively tiny packages causes a combinatorial explosion that makes everything more difficult. By combining related software into bigger collections we can make the package graphs manageable again.

An important consequence is that users are more likely to actually run a known set of software and their versions, making it much more meaningful to do automatated testing. By adopting continuous integration we can greatly raise the confidence level we have in the end result, and also its quality.

What next? Think about what you think could be done better, and talk about your ideas with your peers. Let's start the next revolution in operating systems development now.