Managing Internal CI Tests with TMT for CentOS Stream Updates

Tuesday, 23, January 2024 Carlos Rodriguez-Fernandez ci 3 Comments

HealthTrio has adopted CentOS Stream 9 for several use cases within our system. We have a few dozen servers supporting operations like sandbox environments, build infrastructure, monitoring, analytics, automation, and mirroring. All of these servers supporting corporate activities necessitate the use of a safe patch management system.

Internal mirrors repositories are a traditional solution that facilitates having greater control over the updates applied to Linux servers. A system pulls the updates from the official repositories and makes them available to the servers once certain criteria are met. The system then applies the updates to the servers at a specific cadence. At HealthTrio, we have implemented this traditional solution with a CI step implemented with the TMT tool, for additional early validation.

The Overall Process

The overall steps are as follows:

The updates are pulled from the official repositories and placed into a staging internal repository.
The updates are applied to a set of CI instances from different EL flavors and major releases, including CentOS Stream 9.
The TMT tests run. If they all pass, it goes to the next step.
The updates are promoted to the stable repository.
AWX applies the updates a week later to a set of canary machines. This serves as another validation mechanism by exposing the updates to servers used in daily operations.
A week later, if everything is OK, AWX applies the updates to the rest of the servers.

The steps from 1 to 4 are executed in a Gitlab pipeline. The steps 5 and 6, in AWX servers.

If a test fails or a canary instance reveals bugs, the process is stopped, and engineers come to it to analyze the issue. Assuming the fault is because of the update, they then report the issue and decide whether to wait for the update and start the process all over; or freeze the affected packages in the last stable version and let the process continue skipping failing tests. This process as it is has room for improvement, which is already a work in progress.

The CI step with TMT

One of the steps that is beneficial to increasing confidence in the stability of updates is the CI part implemented with TMT. This step contains smoke tests that test updates against the way we use the OS and the programs. Since it is impossible to achieve 100% coverage in these complex systems, the CI tests step is complemented by the canary instances step in our process.

TMT stands for Test Management Tool, and it is exactly that: a command-line tool to manage tests. The tool facilitates the handling of the complexity of dealing with test suites. For example, test provisioning, execution, results reporting, test stdout/stderr handling, timing and time constraints, enabling/disabling, selection, documentation, modularization, and even importing/exporting remotely hosted tests. TMT is currently mature and adopted in Fedora and CentOS Stream.

TMT can run shell scripts, but it also integrates with Beakerlib. Beakerlib is another test project that provides a set of functions that facilitate writing tests. It has functions to organize test units with setup, run, and cleanup steps, as well as many other handy ones to do things like backup/restore config files during tests or wait for a command to exit with 0 allowing you to configure timeout, retries, retries intervals, etc., which saves you from writing all that code from scratch. TMT runs the tests and parses the Beakerlib output to detect test failures. Furthermore, TMT is not limited to shell scripts or Beakerlib. You can run test written in other test frameworks like pytest, ginkgo, behave, or anything else by simply letting TMT interpret their exit codes or by implementing custom result files TMT can understand. You can also go further and write a plugin for your favorite test framework for TMT.

As of the time of this post, HealthTrio has used both tools to write a total of 124 tests all written with Beakerlib. The tests verify the basic and not so basic functionality of packages from official repos like bpf tools, podman, selinux policies, or perf; also from EPEL like node-exporter or prometheus; and also from third-party sources like fluent-bit, tenable nessus agent, grafana, k3s, or gitlab runner; as well as internally built programs. Some tests are basic smoke tests, while others are more specific to the way we use the programs and their configurations.

The CentOS stream updates have been fairly stable for us; however, with our current process, we have been able to spot breaking updates before they reach our system on two occasions last year. This has prevented operation interruptions.

We are currently organizing our TMT tests in the following way:

.
├── plans
│   ├── el8.fmf
│   ├── el9.fmf
│   └── main.fmf
└── tests
    ├── main.fmf
    └── program
        ├── features_a
        |   ├── main.fmf
        |   └── test.sh
        └── features_b
            ├── main.fmf
            └── test.sh

The additional sublevel for features is optional and only used when the complexity justifies it.

The `plans` directory

Since we test different flavors of Enterprise Linux and also the major releases 8 and 9, we group the CI servers using the role multi-host feature by major release. We have the el8 role, and the el9. Then we list the tests targeted to the different releases. It looks like this:

summary: full smoke test for el9

discover:
  how: fmf
  where: [el9]
  test:
    - /fluent-bit
    - /node-exporter
    - /podman
    - # more tests

There are other ways to do this, for example, using adjust and letting the test specify where it can run. The main.fmf specifies the actual provisioning and the assigning of servers to roles.

Something important to take into account is that the CI instances we use are rootless, so instead they have a user we access with a key, and this user has sudo privileges without a password prompt. Therefore, for the provisioning, we leverage the new feature become in TMT 1.29. We could run these tests as a non-root user and write sudo everywhere, but that creates unnecessary noise in the tests themselves and hinders our ability to import or donate tests as-is from or to other repositories (e.g., Fedora, or CentOS tests), which are written assuming a root user.

The `tests` directory

The tests themselves go inside this directory, grouped by program. We have a main.fmf where we specify that all tests underneath the directory structure are Beakerlib tests, and they are contained in the ./test.sh file relatively o the test fmf metadata files.

Each program would then have its own main.fmf file specifying the metadata and a test.sh file with the test written using the Beakerlib framework.

Example: Creating Tests for `podman`

Let’s go through a few examples with podman that can illustrate writing tests.

Basic Smoke Tests

First, we create the directory and files structure:

tests/
├── main.fmf
└── podman
    ├── main.fmf
    └── test.sh

Then, we describe it in the main.fmf:

summary: podman tests
description: basic podman smoke tests

Now, let’s write some basic smoke tests with Beakerlib:

#!/bin/bash
. /usr/share/beakerlib/beakerlib.sh || exit 1

rlJournalStart
  rlPhaseStartSetup
  rlRun "dnf -y install podman"
if ! grep /etc/subuid -q -e "${USER}"; then
  rlRun "usermod --add-subuids 100000-165537 ${USER}"
fi
if ! grep /etc/subgid -q -e "${USER}"; then
  rlRun "usermod --add-subgids 100000-165537 ${USER}"
fi
  rlPhaseEnd
  
  rlPhaseStartTest "test get version"
  rlRun "podman version"
  rlPhaseEnd

  rlPhaseStartTest "test get info"
  rlRun "podman info"
  rlPhaseEnd

  rlPhaseStartTest "test pulling image"
  rlRun "podman image pull quay.io/fedora/fedora:latest"
  rlRun "podman image rm quay.io/fedora/fedora:latest"
  rlPhaseEnd

  rlPhaseStartTest "test listing image"
  rlRun "podman image pull quay.io/fedora/fedora:latest"
  rlRun -s "podman image ls quay.io/fedora/fedora:latest"
  rlAssertGrep "quay.io/fedora/fedora" $rlRun_LOG
  rlRun "podman image rm quay.io/fedora/fedora:latest"
  rlPhaseEnd

  rlPhaseStartTest "test container run"
  rlRun "podman run --rm quay.io/fedora/fedora:latest bash -c 'echo HELLO' | grep -q -e 'HELLO'"
  rlPhaseEnd

  rlPhaseStartTest "test system service"
  rlRun "podman system service -t 1"
  rlPhaseEnd

  rlPhaseStartTest "test mounting file"
  rlRun "touch test.txt"
  rlRun "podman run --rm --privileged -v $(pwd)/test.txt:/test.txt quay.io/fedora/fedora:latest bash -c 'echo HELLO > /test.txt'"
  rlRun "grep -q -e 'HELLO' test.txt"
  rlRun "rm -f test.txt"
  rlPhaseEnd

  rlPhaseStartCleanup
  rlRun "dnf remove -y podman"
  rlPhaseEnd

rlJournalEnd

So far, these are very basic smoke tests that ensure that a critical component of our operation works fine.

There is one more basic smoke test that we would like to add, which is building a container. For this one, we will add a new file Containerfile with our image specification:

FROM quay.io/fedora/fedora:latest
CMD echo 'HELLO'

We place the file in the same podman directory and reference it as ./Containerfile. Remember that the test current directory is where its metadata is, i.e., the main.fmf file.

tests
├── main.fmf
└── podman
    ├── Containerfile
    ├── main.fmf
    └── test.sh

Then, we add the test to our test.sh file:

  rlPhaseStartTest "test building"
  rlRun "podman build -t test:latest -f ./Containerfile"
  rlRun "podman image rm localhost/test:latest"
  rlPhaseEnd

A Targeted Test

One of our ways of using podman is through a container that starts, lists, and stops other containers on the host, leveraging what is called systemd socket-based activation. We do this with a non-root user. There is a podman socket systemd unit that specifies that when something connects to that socket, systemd will start the podman service. We have seen this broken before because of some systemd bug, so we created a test for it.

In this case, we write the test setup and execution in the test.sh, but the actual script running the scenario we write in a separate shell script that we run as a non-root user (which is how we run it).

The separate script looks like this:

set -e
export XDG_RUNTIME_DIR=/run/user/$(id -u)
systemctl --user enable --now podman.socket
podman --url unix://run/user/$(id -u)/podman/podman.sock run --name simple-test-with-port-mapping -d -p 8080:80 docker.io/nginx:latest
pid=$(systemctl --user show --property MainPID --value podman.service)
while [ "${pid}" -ne 0 ] && [ -d /proc/${pid} ]; do sleep 1; echo "Waiting for podman to exit"; done
echo "Continuing"
podman --url unix://run/user/$(id -u)/podman/podman.sock ps | grep -q -e simple-test-with-port-mapping
podman --url unix://run/user/$(id -u)/podman/podman.sock container rm -f simple-test-with-port-mapping
systemctl --user disable --now podman.socket

At this point, it is appropriate to place this socket activation test in a separate directory to avoid the noise on the other tests readability, to be able to separate the setup and cleanup steps from the other tests, and to separate podman core features from systemd socket activation. The new structure looks like this:

tests/podman/
├── core
│   ├── Containerfile
│   ├── main.fmf
│   └── test.sh
└── socket-activation
    ├── main.fmf
    ├── remote-socket-test.sh
    └── test.sh

The new test.sh file is:

#!/bin/bash
. /usr/share/beakerlib/beakerlib.sh || exit 1

rlJournalStart
  rlPhaseStartSetup
  rlRun "dnf -y install podman"
  rlRun "useradd podman-remote-test"
if ! grep /etc/subuid -q -e "podman-remote-test"; then
  rlRun "usermod --add-subuids 100000-165537 podman-remote-test"
fi
if ! grep /etc/subgid -q -e "podman-remote-test"; then
  rlRun "usermod --add-subgids 100000-165537 podman-remote-test"
fi
  rlRun "loginctl enable-linger podman-remote-test"
  rlWaitForCmd "loginctl show-user podman-remote-test" -t 10
  rlPhaseEnd

  rlPhaseStartTest "test remote socket"
  rlRun "sudo -i -u podman-remote-test < ./remote-socket-test.sh"

  rlPhaseStartCleanup
  rlRun "loginctl terminate-user podman-remote-test"
  rlRun "loginctl disable-linger podman-remote-test"
  rlWaitForCmd "loginctl show-user podman-remote-test" -t 10 -r 1
  rlRun "userdel -r podman-remote-test"
  rlRun "dnf remove -y podman"
  rlPhaseEnd

rlJournalEnd

Another Targeted Test Involving Third-Party Programs

There is yet another way we use podman that we need to ensure it will still work after an update. That is, as a gitlab runner executor.

For this test, we will setup a repository on gitlab.com with a simple podman scenario, and we will also setup the corresponding tokens to access it.

.
├── Containerfile
└── .gitlab-ci.yml

And the .gitlab-ci.yml is:

image:
  name: quay.io/containers/podman:v4.6

test-build:
  stage: build
  tags:
  - ci
  - gitlab-runner-test
  script:
  - podman build -t gitlab-runner-test .

We will register a gitlab runner with the podman executor and with the specific tags ci and gitlab-runner-test. These are arbritary tags to ensure proper gitlab-runner selection. Then we will use the gitlab API to trigger a pipeline for that repository and wait for the pipeline to be successful (or not). As a result, the only job in that pipeline will be executed within our CI machine with podman, and if everything is successful, the test will pass.

Since this test involves another program, we will put it in a separate directory named gitlab-runner and under the feature podman-exec. While we are at it, we also add smoke tests for the gitlab-runner command as a separate test to make sure the basics are working if this test fails for reasons unrelated to the gitlab-runner and podman integration.

The directory structure looks like this:

tests/gitlab-runner/
├── cmd
│   ├── main.fmf
│   └── test.sh
└── podman-exec
    ├── main.fmf
    └── test.sh

We will skip the details of the cmd test to focus on the podman-exec one.

Our main.fmf is:

summary: gitlab runner with podman tests
description: checks that gitlab runners can still run podman executor
require:
  - "jq" # required to parse the gitlab api output
  - "firewalld" # required by podman and gitlab-runner to setup the networking

And our more complicated test looks like this:

#!/bin/bash
. /usr/share/beakerlib/beakerlib.sh || exit 1

rlJournalStart
  rlPhaseStartSetup
  rlRun "rpmkeys --import \
        https://packages.gitlab.com/runner/gitlab-runner/gpgkey/runner-gitlab-runner-4C80FB51394521E9.pub.gpg"
  rlRun "dnf install -y \
        https://gitlab-runner-downloads.s3.amazonaws.com/${GITLAB_RUNNER_VERSION}/rpm/gitlab-runner_amd64.rpm"
  rlRun "dnf install -y podman"
if ! grep /etc/subuid -q -e "gitlab-runner"; then
  rlRun "usermod --add-subuids 100000-165537 gitlab-runner"
fi
if ! grep /etc/subgid -q -e "gitlab-runner"; then
  rlRun "usermod --add-subgids 100000-165537 gitlab-runner"
fi
  rlRun "loginctl enable-linger gitlab-runner"
  rlWaitForCmd "loginctl show-user gitlab-runner" -t 10
  rlWaitForCmd 'sudo -u gitlab-runner \
       XDG_RUNTIME_DIR=/run/user/$(id -u gitlab-runner) \
       systemctl --user enable podman.socket' -t 60
  rlWaitForFile "/run/user/$(id -u gitlab-runner)/podman/podman.sock" -t 60
  rlRun 'gitlab-runner register -n \
    --url="https://gitlab.com/" \
    --executor="docker" \
    --env="FF_NETWORK_PER_BUILD=1" \
    --tag-list="gitlab-runner-test,ci" \
    --registration-token="${GITLAB_REGISTRATION_TOKEN}" \
    --docker-host="unix:///run/user/$(id -u gitlab-runner)/podman/podman.sock" \
    --docker-privileged="true" \
    --docker-tlsverify="false" \
    --docker-image="quay.io/podman/stable"'
  rlRun "gitlab-runner verify"
  rlPhaseEnd

  rlPhaseStartTest "test podman build in gitlab runner"
  rlRun -s 'curl -s --request POST \
    --form token=${GITLAB_PIPELINE_TOKEN} \
    --form ref=main \
    "https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/trigger/pipeline"'
  rlRun "jq .status $rlRun_LOG | grep -q created" 0 "Verify that the pipeline was created"
  rlRun 'pipeline_id=$(jq .id $rlRun_LOG)'
  rlWaitForCmd 'curl -s --request GET \
    --header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
    "https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/pipelines/${pipeline_id}" | jq .status | grep -q success' -t 60
  rlRun 'curl -s --request GET \
    --header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
    "https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/pipelines/${pipeline_id}" | jq .status | grep -q success'
  rlRun 'curl -s --request POST \
    --header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
    "https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/pipelines/${pipeline_id}/cancel"' 0 'Cancel pipeline'
  rlPhaseEnd

  rlPhaseStartCleanup
  rlRun "gitlab-runner unregister --all-runners"
  rlRun "loginctl terminate-user gitlab-runner"
  rlRun "loginctl disable-linger gitlab-runner"
  rlWaitForCmd "loginctl show-user gitlab-runner" -t 10 -r 1
  rlRun "dnf remove -y gitlab-runner podman"
  rlPhaseEnd
rlJournalEnd

The test involves installing gitlab-runner, setting up podman, registering the runner, triggering a pipeline, and finally tearing everything down.

The final directory structure of our exercise is:

tests
├── main.fmf
├── gitlab-runner
│   ├── cmd
│   │   ├── main.fmf
│   │   └── test.sh
│   └── podman-exec
│       ├── main.fmf
│       └── test.sh
└── podman
    ├── core
    │   ├── Containerfile
    │   ├── main.fmf
    │   └── test.sh
    └── socket-activation
        ├── main.fmf
        ├── remote-socket-test.sh
        └── test.sh

Plans and Tests Linting

After writing the plans, the tests, and their metadata, we can use tmt lint to validate the fmf files, as well as tmt plans ls and tmt tests ls to ensure the plans and the tests are found. For example, if we mispelled description as descripton in the podman/core test, the tmt lint output would show this:

/tests/podman/core
warn C000 key "descripton" not recognized by schema, and does not match "^extra-" pattern
warn C000 fmf node failed schema validation
pass C001 summary key is set and is reasonably long
fail T001 unknown key "descripton" is used
pass T002 test script is defined
pass T003 directory path is absolute
pass T004 test path '/home/****/****/tests/podman/core' does exist
skip T005 legacy relevancy not detected
skip T006 legacy 'coverage' field not detected
skip T007 not a manual test
skip T008 not a manual test
skip T009 library/file requirements not used

To lint the shell files, we use the ShellCheck tool.

Plan Execution

The plan execution is simply a command line in the Gitlab Pipeline, passing down any environment variables with the -e option that are not defined already in the plans with the environment yaml item. We also run it with the flag -vv to get enough verbosity in the pipelines, but in addition to that, we keep the /var/tmp/tmt directory for further investigation if a test fails.

The command line to execute it is like this:

tmt -vv run -e "FOO=1" -e "BAR=1"

The standard output gives us a summary of every stage of the run, including provisioning, preparation, execution, finishing, and a summary report.

This is how the execute and the summary report sections in the output look like for our four example tests, showing their pass or fail status and the time they took to run:

    execute
        queued execute task #1: default-0 on default-0
        
        execute task #1: default-0 on default-0
        how: tmt
            00:00:13 pass /tests/gitlab-runner/cmd (on default-0) [1/4]
            00:02:45 pass /tests/gitlab-runner/podman-exec (on default-0) [2/4]
            00:00:33 pass /tests/podman/core (on default-0) [3/4]
            00:00:10 pass /tests/podman/socket-activation (on default-0) [4/4]

    
        summary: 4 tests executed
    report
        how: display
            pass /tests/gitlab-runner/cmd
            pass /tests/gitlab-runner/podman-exec
            pass /tests/podman/core
            pass /tests/podman/socket-activation
        summary: 4 tests passed

An engineer can run individual tests with greater verbosity. For example, to run only the podman/core test alone, the engineer could setup a clone of a CI instance, use the connect provision to access it, and set a greater verbosity level by incrementing the vs:

tmt -vvv run -a provision --how=connect --guest=THE_GUEST_IP --become --user=ec2-user --key=~/.pem/the_guest_key.pem test --name /podman/core

To run the gitlab/podman-exec one, it would be similar but passing all the environment variables with the -e options.

You have noticed we used rlWaitForCmd in several places, for example in the gitlab-runner/podman-exec, setting the -t option to specify the timeout. TMT allows us to set a timeout for the entire test. The default timeout is 5 minutes. We can change it inside the main.fmf of the specific test, or in a parent main.fmf to set it for each test underneath. For example, we found after several runs that the gitlab-runner/podman-exec should run in no more than 3 minutes and 30 seconds. So, in order to set the timeout on that specific test, we change the main.fmf this way:

summary: gitlab runner with podman tests
description: checks that gitlab runners can still run podman executor
duration: 3m 30s
require:
  - "jq" # required to parse the gitlab api output
  - "firewalld" # required by podman and gitlab-runner to setup the networking

If the test times out, we would get a report like this:

    report
        how: display
            errr /tests/gitlab-runner/podman-exec (timeout)
        summary: 1 error

Yet another thing we can do with the test is to disable it, for example, temporarily. In order to do so, we can simply set enabled as false in the main.fmf:

summary: gitlab runner with podman tests
description: checks that gitlab runners can still run podman executor
enabled: false
duration: 3m 30s
require:
  - "jq" # required to parse the gitlab api output
  - "firewalld" # required by podman and gitlab-runner to setup the networking

Then TMT would skip this test while executing the plan. The enabled flag can also be combined with the adjust yaml item to disable it only in a certain context (e.g., specific distro, arch, etc…).

Test Outputs

TMT stores the test outputs and metadata in /var/tmp/tmt by default. For our last example, we can find the results on this path:

/var/tmp/tmt/run-001/plans/test/execute/data/guest/default-0/tests/gitlab-runner/podman-exec-2
├── ASSERT_STATUSES
├── clbuff
├── cleanup.sh
├── data
├── journal_colored.txt
├── journal.meta
├── journal.txt
├── journal.xml
├── metadata.yaml
├── output.txt
├── PersistentData
├── PHASE_STATUSES
├── TestResults
├── tmt-test-topology.sh
└── tmt-test-topology.yaml

A very helpful file here is output.txt with the stdout and stderr outputs of the test. The entire /var/tmp/tmt can be stored for further investigation when tests do fail.

The /var/tmp/tmt directory contains other important files like the results.yaml which give us a summary of the tests executions and if they passed or failed.

Conclusion

In this post, we have presented the overall process we follow and detailed, with a few examples, the writing and managing of specific tests, which illustrates the value and also the expected complexity of using TMT in combination with Beakerlib to build a solid CI tests solution.

TMT with Beakerlib has facilitated writing and managing tests to verify updates against the way we use programs, whether they come from the official repository, EPEL, or third-party ones. This allows us to detect issues as early as possible in conjunction with the canary instances technique.

3 thoughts on "Managing Internal CI Tests with TMT for CentOS Stream Updates"

Miroslav Vadkerti says:

January 23, 2024 at 4:54 pm

Thanks for the great article! I would just correct as the tests do not have to be written in bash, you can just wrap them around and run your favourite test framework. We have users running pytest, ginkgo, behave, basically anything that runs on Linux.
- Carlos Rodriguez-Fernandez says:
  
  January 23, 2024 at 5:50 pm
  
  It has been corrected. Thank you!
Robby says:

January 28, 2024 at 6:22 pm

This is wonderful! Thank you for this article.

Comments are closed.