dist-git and exploded SRPMS – demystified

In this article we address another topic which appeared in multiple discussions recently. We take a look at the difference between the SRPM and the so called dist-git repository of a package. And why do we indeed prefer the dist-git.

How RPM packages work?

In simple words RPM packages need three things:

When developing an RPM package you treat the upstream sources as a read-only object. You can not change the upstream sources, they should match the exact content upstream has released.

To diverge from upstream, for example to backport a fix or to integrate the software better in the system, you create and maintain patches as separate files next to your upstream sources.

Then to build a package the build system needs to fetch the archive with original sources, unpack it, apply patches as described in the spec, run the build scripts again as described in the spec, arrange the resulting files in a specific way and pack them into archive together with the installation recipe.

This archive is the final “binary RPM” which you can install on your system using rpm or dnf commands.

As we build software for multiple architectures, we can produce several binary RPMs from the same source data by building them on different workers with different architectures (one for x86_64, one for aarch64 and so on).

What is dist-git

dist-git is a git repository with a specific layout, which Fedora, CentOS Stream and RHEL use to develop RPM packages.

The very minimal dist-git repo would look like this:

├── my-app.spec                      // spec file
├── sources                          // reference to the sources
└── patch-for-some-feature.patch     // patch to apply to the sources

The important feature of the dist-git is that it doesn't store the unpacked sources of the application. It only stores a reference to the tarball of original upstream sources in a so-called lookaside cache.

This reference is stored in the file which is called ./sources in the root of the git repository. See for example sources of a glibc package in Fedora Rawhide

The lookaside cache of Fedora and CentOS (Stream or not Stream) is public and you can download any of its content.

Now, since dist-git is the main repository where package development is happening, package maintainers often use it to store all sorts of additional things (scripts, readme files, infra configurations..) which can help them to do the work.

There is also a recommended way to write tests in dist-git (see TMT). These integration tests are not part of the RPM package, but they are used in CI workflows and we recommend to put them in the dist-git repository, so that people can contribute to the package and the test development via the same interface.

Example – keepalived dist-git

Let's take a random package build, for example keepalived-2.2.4-6.el9.

dist-git for the package has the following structure:

├── bz2028351-fix-dbus-policy-restrictions.patch  // patches
├── bz2102493-fix-variable-substitution.patch
├── bz2134749-fix-memory-leak-https-checks.patch
├── gating.yaml         // * CI configuration
├── .gitignore          // * standard gitignore
├── keepalived.init     // additional sources
├── keepalived.service  // additional sources
├── keepalived.spec     // spec file
├── rpminspect.yaml     // * rpminspect checks configuration  
├── sources             // reference to the lookaside cache 
└── tests               // * dist-git test scenarios, run on every merge request
    ├── keepalived.conf.in
    ├── run_tests.sh
    └── tests.yml

Here I marked with asterisk the files which are not relevant to the RPM package build.

What is SRPM

As explained above, RPM package build requires multiple inputs. While the inputs are stored in dist-git and lookaside cache, you need to fetch them and carry around the build system to the build workers.

Instead of fetching data from the internet during the build process (no build systems should ever do this!), we fetch all of the sources at the beginning, pack them in a tarball (SRPM file) and then use that self-contained tarball to run the builds in the isolated build environment.

The SRPM then serves as a record of what build system got as input to produce the binary files.

Example – keepalived SRPM

SRPM for the package contains the following data:

bz2028351-fix-dbus-policy-restrictions.patch	1.58 KB
bz2102493-fix-variable-substitution.patch	929.00 B
bz2134749-fix-memory-leak-https-checks.patch	1.87 KB
keepalived-2.2.4.tar.gz	1.10 MB
keepalived.service	392.00 B
keepalived.spec	20.47 KB

You can see how the SRPM was produced by the build system together with binary RPMs via the Koji build task https://kojihub.stream.centos.org/koji/buildinfo?buildID=27965

The build task used dist-git commit as the input:

Source:  git+https://gitlab.com/redhat/centos-stream/rpms/keepalived#fc07f81c047dca49df2fc9d20513a7f52005a54d

Note how the SRPM contains full tarball of the original upstream sources (1.10 MB of it). This tarball was fetched from the dist-git lookaside cache during the SRPM build step.

What is exploded SRPM

Fedora and RHEL use dist-git repositories for a very long time. Fedora dist-git has always been public, while RHEL dist-git repositories were internal and not available for people outside of Red Hat.

So the only way for CentOS Project to rebuild RHEL code was to take the SRPM files and use them as the source of the rebuild.

Since CentOS Project needed to rebrand or adjust certain packages, they didn't take RHEL SRPMs as is, rather they unpacked them and put the unpacked sources in git repository. This way they got access to at least some history of the changes, were able to apply their own patches and generally increased the visibility of the content.

Example – keepalived exploded SRPM

“Exploded SRPM” at git.centos.org for this package looks like:

├── .gitignore
├── .keepalived.metadata  // same as ./sources in dist-git
│   ├── bz2028351-fix-dbus-policy-restrictions.patch  // patches
│   ├── bz2102493-fix-variable-substitution.patch
│   ├── bz2134749-fix-memory-leak-https-checks.patch
│   └── keepalived.service  // additional sources
    └── keepalived.spec  // spec file

Exploded SRPM git again doesn't store the upstream tarball in the repository and references the lookaside cache via .keepalived.metadata file.

You can see the same files as included in the SRPM, though they are put into a different directory structure. And none of the additional files (tests, scripts, configs) are available.

Take away

dist-git repository is the original source of an RPM package build. Fedora, CentOS Stream and RHEL packages are all built directly from dist-git repositories.

SRPM is an artifact of the build process. It is produced from the commit in dist-git and then stored alongside the binary RPM.

Exploded SRPM is an attempt to recover the original git structure from the SRPM in case there is no access to the dist-git repository. It does contain the same source files and spec as in dist-git, but it is not able to recover additional non-packaged data, like configuration files, tests and so on.

We recommend to use dist-git for any collaboration and development purposes.

P.S. You can also take a look at the Source Git initiative which aims to change the approach to RPM sources to make upstream source code more accessible.