Continuity of Linux distributions

I know people who imagine distribution development as the process of piling up the code in the git repository for 6 months and then building it all in one go at the end of those 6 months, so that it can finally be shipped. This is very far from reality. And it is impossible to explain things like CentOS Stream without addressing this confusion.

Linux distributions are not just developed continuously, they are built continuously.

When you have your generic application, you have your sources. You contribute changes to sources, integrating them into the main branch. And then you decide to build the tip of the branch and you get an artifact – a binary. When later you make changes to the sources, you throw away the previous binary, and build yourself a new one.

Linux distributions are different.

When you change a distribution, you apply a change to a one part of the sources, then you build those sources into a package and add the package to a shared pool of latest packages. And then this shared pool of packages (we call it buildroot) is used to build a next change in the distribution.

Linux distribution is “self-hosted”, it grows by updating its buildroot and using its new state to build its next updates. And updates are applied individually per package.

Why am I focusing on this? Because it has practical consequences.

Packaged Linux distribution is not a single binary, it is a “compound” artifact, where different parts of it (packages) are built at different times using different states of the buildroot.

Imagine we have two packages A and B in the distribution. Package A is more static, it doesn’t get that many updates, so in four weeks it got updated once. While package B is more actively developed and gets updates every week.

This will look somewhat like this:

Time      | Week 1 | Week 2 | Week 3 | Week 4 |
Package A | v1.0.0 | v1.0.1                   |  --->    A-v1.0.1, B-v18
Package B | v15    | v16    | v17    | v18    |

So if you look at the result of the 4-weeks development, you see that package A has been updated from v1.0.0 to v1.0.1 and package B has been updated from v15 to v18. Yet package A was not built on week 4. It was built on week 2 using the state of the buildroot available at that week 2. If that package A has a build dependency on Package B, then it was using that v16 version of the dependency. And it was not updated after the change to v17 in B.

There are several points to take from here.

First point: packaged distributions do not appear out of nowhere.

Linux distributions are either continuously developed from some origin, like Fedora Rawhide is developed for 20 years from its original Fedora 1 state (I'll ask people more knowledgeable than me to explain how that original state was created). Or they are branched (aka forked) from another distribution. Or they are bootstrapped (forked, but with much more work) using another distribution.

Second point: packaged distribution is not fully defined by the static snapshot of its sources.

If on week 4 you use the latest git sources of the distribution, check them out and build packages from them, you will get package A of version v1.0.1 built using a build dependency on the package B of v18. Which may or may not lead to a different result.

And if this doesn’t scare you, please read again.

You can not reproduce the exact state of the Linux distribution from a snapshot of its sources. And not because something is hidden from the sources (dist-git repository state has more information about the package than SRPM of that package does, and we'll talk about it another day). The issue is that distribution is not just its sources. Distribution is its sources and its buildroot with all its complex history.

In my opinion it is a big fail for the entire industry to think about a Linux distribution as a fixed set of RPMs and SRPMs on a DVD(or the iso image). I understand where it comes from, but it is a fail anyway.

When I think about a distribution, I think about the buildroot of a distribution as a sort of git repository: it has history, it has merge requests. When we update a package, we add a new binary package (RPM) to the buildroot. In other words, we make a new commit to the buildroot state.

And it is not just an abstraction, I think we can literally implement the changes to the distribution buildroot as merge requests to a git repository with the list of packages (And if you are interested and want to try and help make it happen – let's talk about it).

Third point: branching of a Linux distribution is not just branching of the sources, it is branching of its binaries.

Again this is something that application developers won't expect. When we create a branch of the distribution we don't just create a branch in every git source of every package included in the distribution. We also create a “branch” of the buildroot. All binary packages built till a certain day in the mainline of a distribution are copied into a branch. They form the buildroot of the branch, which is then can be updated via a standard update procedure.

We do not rebuild a package after branching, unless there is a new change which we want to land in this specific package.

This is a heavy-weight article, and thank you for getting this far.

But let me reiterate the main message:

Packaged Linux distribution is not built from scratch from the snapshot of its sources.

We accumulate changes as they happen in different packages, and we inherit, merge and branch the pool of binary packages the same way we inherit, merge and branch their sources.

We will look into how this applies to RHEL and CentOS conversation in next articles.

As folks pointed out, the process described in this article applies to packaged Linux distributions like Fedora, Debian or RHEL. There are other ways to build a distribution and you can check the article by Colin Walters, where he discusses the alternatives.