HealthTrio has adopted CentOS Stream 9 for several use cases within our system. We have a few dozen servers supporting operations like sandbox environments, build infrastructure, monitoring, analytics, automation, and mirroring. All of these servers supporting corporate activities necessitate the use of a safe patch management system.
Internal mirrors repositories are a traditional solution that facilitates having greater control over the updates applied to Linux servers. A system pulls the updates from the official repositories and makes them available to the servers once certain criteria are met. The system then applies the updates to the servers at a specific cadence. At HealthTrio, we have implemented this traditional solution with a CI step implemented with the TMT tool, for additional early validation.
The Overall Process
The overall steps are as follows:
- The updates are pulled from the official repositories and placed into a staging internal repository.
- The updates are applied to a set of CI instances from different EL flavors and major releases, including CentOS Stream 9.
- The TMT tests run. If they all pass, it goes to the next step.
- The updates are promoted to the stable repository.
- AWX applies the updates a week later to a set of canary machines. This serves as another validation mechanism by exposing the updates to servers used in daily operations.
- A week later, if everything is OK, AWX applies the updates to the rest of the servers.
The steps from 1 to 4 are executed in a Gitlab pipeline. The steps 5 and 6, in AWX servers.
If a test fails or a canary instance reveals bugs, the process is stopped, and engineers come to it to analyze the issue. Assuming the fault is because of the update, they then report the issue and decide whether to wait for the update and start the process all over; or freeze the affected packages in the last stable version and let the process continue skipping failing tests. This process as it is has room for improvement, which is already a work in progress.
The CI step with TMT
One of the steps that is beneficial to increasing confidence in the stability of updates is the CI part implemented with TMT. This step contains smoke tests that test updates against the way we use the OS and the programs. Since it is impossible to achieve 100% coverage in these complex systems, the CI tests step is complemented by the canary instances step in our process.
TMT stands for Test Management Tool, and it is exactly that: a command-line tool to manage tests. The tool facilitates the handling of the complexity of dealing with test suites. For example, test provisioning, execution, results reporting, test stdout/stderr handling, timing and time constraints, enabling/disabling, selection, documentation, modularization, and even importing/exporting remotely hosted tests. TMT is currently mature and adopted in Fedora and CentOS Stream.
TMT can run shell scripts, but it also integrates with Beakerlib. Beakerlib is another test project that provides a set of functions that facilitate writing tests. It has functions to organize test units with setup, run, and cleanup steps, as well as many other handy ones to do things like backup/restore config files during tests or wait for a command to exit with 0 allowing you to configure timeout, retries, retries intervals, etc., which saves you from writing all that code from scratch. TMT runs the tests and parses the Beakerlib output to detect test failures. Furthermore, TMT is not limited to shell scripts or Beakerlib. You can run test written in other test frameworks like pytest, ginkgo, behave, or anything else by simply letting TMT interpret their exit codes or by implementing custom result files TMT can understand. You can also go further and write a plugin for your favorite test framework for TMT.
As of the time of this post, HealthTrio has used both tools to write a total of 124 tests all written with Beakerlib. The tests verify the basic and not so basic functionality of packages from official repos like bpf tools, podman, selinux policies, or perf; also from EPEL like node-exporter or prometheus; and also from third-party sources like fluent-bit, tenable nessus agent, grafana, k3s, or gitlab runner; as well as internally built programs. Some tests are basic smoke tests, while others are more specific to the way we use the programs and their configurations.
The CentOS stream updates have been fairly stable for us; however, with our current process, we have been able to spot breaking updates before they reach our system on two occasions last year. This has prevented operation interruptions.
We are currently organizing our TMT tests in the following way:
.
├── plans
│ ├── el8.fmf
│ ├── el9.fmf
│ └── main.fmf
└── tests
├── main.fmf
└── program
├── features_a
| ├── main.fmf
| └── test.sh
└── features_b
├── main.fmf
└── test.sh
The additional sublevel for features is optional and only used when the complexity justifies it.
The plans
directory
Since we test different flavors of Enterprise Linux and also the major releases 8 and 9, we group the CI servers using the role
multi-host feature by major release. We have the el8
role, and the el9
. Then we list the tests targeted to the different releases. It looks like this:
summary: full smoke test for el9
discover:
how: fmf
where: [el9]
test:
- /fluent-bit
- /node-exporter
- /podman
- # more tests
There are other ways to do this, for example, using adjust
and letting the test specify where it can run. The main.fmf
specifies the actual provisioning and the assigning of servers to roles.
Something important to take into account is that the CI instances we use are rootless, so instead they have a user we access with a key, and this user has sudo privileges without a password prompt. Therefore, for the provisioning, we leverage the new feature become
in TMT 1.29. We could run these tests as a non-root user and write sudo
everywhere, but that creates unnecessary noise in the tests themselves and hinders our ability to import or donate tests as-is from or to other repositories (e.g., Fedora, or CentOS tests), which are written assuming a root user.
The tests
directory
The tests themselves go inside this directory, grouped by program. We have a main.fmf
where we specify that all tests underneath the directory structure are Beakerlib tests, and they are contained in the ./test.sh
file relatively o the test fmf
metadata files.
Each program would then have its own main.fmf
file specifying the metadata and a test.sh
file with the test written using the Beakerlib framework.
Example: Creating Tests for podman
Let’s go through a few examples with podman
that can illustrate writing tests.
Basic Smoke Tests
First, we create the directory and files structure:
tests/
├── main.fmf
└── podman
├── main.fmf
└── test.sh
Then, we describe it in the main.fmf
:
summary: podman tests
description: basic podman smoke tests
Now, let’s write some basic smoke tests with Beakerlib:
#!/bin/bash
. /usr/share/beakerlib/beakerlib.sh || exit 1
rlJournalStart
rlPhaseStartSetup
rlRun "dnf -y install podman"
if ! grep /etc/subuid -q -e "${USER}"; then
rlRun "usermod --add-subuids 100000-165537 ${USER}"
fi
if ! grep /etc/subgid -q -e "${USER}"; then
rlRun "usermod --add-subgids 100000-165537 ${USER}"
fi
rlPhaseEnd
rlPhaseStartTest "test get version"
rlRun "podman version"
rlPhaseEnd
rlPhaseStartTest "test get info"
rlRun "podman info"
rlPhaseEnd
rlPhaseStartTest "test pulling image"
rlRun "podman image pull quay.io/fedora/fedora:latest"
rlRun "podman image rm quay.io/fedora/fedora:latest"
rlPhaseEnd
rlPhaseStartTest "test listing image"
rlRun "podman image pull quay.io/fedora/fedora:latest"
rlRun -s "podman image ls quay.io/fedora/fedora:latest"
rlAssertGrep "quay.io/fedora/fedora" $rlRun_LOG
rlRun "podman image rm quay.io/fedora/fedora:latest"
rlPhaseEnd
rlPhaseStartTest "test container run"
rlRun "podman run --rm quay.io/fedora/fedora:latest bash -c 'echo HELLO' | grep -q -e 'HELLO'"
rlPhaseEnd
rlPhaseStartTest "test system service"
rlRun "podman system service -t 1"
rlPhaseEnd
rlPhaseStartTest "test mounting file"
rlRun "touch test.txt"
rlRun "podman run --rm --privileged -v $(pwd)/test.txt:/test.txt quay.io/fedora/fedora:latest bash -c 'echo HELLO > /test.txt'"
rlRun "grep -q -e 'HELLO' test.txt"
rlRun "rm -f test.txt"
rlPhaseEnd
rlPhaseStartCleanup
rlRun "dnf remove -y podman"
rlPhaseEnd
rlJournalEnd
So far, these are very basic smoke tests that ensure that a critical component of our operation works fine.
There is one more basic smoke test that we would like to add, which is building a container. For this one, we will add a new file Containerfile
with our image specification:
FROM quay.io/fedora/fedora:latest
CMD echo 'HELLO'
We place the file in the same podman
directory and reference it as ./Containerfile
. Remember that the test current directory is where its metadata is, i.e., the main.fmf
file.
tests
├── main.fmf
└── podman
├── Containerfile
├── main.fmf
└── test.sh
Then, we add the test to our test.sh
file:
rlPhaseStartTest "test building"
rlRun "podman build -t test:latest -f ./Containerfile"
rlRun "podman image rm localhost/test:latest"
rlPhaseEnd
A Targeted Test
One of our ways of using podman is through a container that starts, lists, and stops other containers on the host, leveraging what is called systemd socket-based activation. We do this with a non-root user. There is a podman socket systemd unit that specifies that when something connects to that socket, systemd will start the podman service. We have seen this broken before because of some systemd bug, so we created a test for it.
In this case, we write the test setup and execution in the test.sh
, but the actual script running the scenario we write in a separate shell script that we run as a non-root user (which is how we run it).
The separate script looks like this:
set -e
export XDG_RUNTIME_DIR=/run/user/$(id -u)
systemctl --user enable --now podman.socket
podman --url unix://run/user/$(id -u)/podman/podman.sock run --name simple-test-with-port-mapping -d -p 8080:80 docker.io/nginx:latest
pid=$(systemctl --user show --property MainPID --value podman.service)
while [ "${pid}" -ne 0 ] && [ -d /proc/${pid} ]; do sleep 1; echo "Waiting for podman to exit"; done
echo "Continuing"
podman --url unix://run/user/$(id -u)/podman/podman.sock ps | grep -q -e simple-test-with-port-mapping
podman --url unix://run/user/$(id -u)/podman/podman.sock container rm -f simple-test-with-port-mapping
systemctl --user disable --now podman.socket
At this point, it is appropriate to place this socket activation test in a separate directory to avoid the noise on the other tests readability, to be able to separate the setup and cleanup steps from the other tests, and to separate podman core features from systemd socket activation. The new structure looks like this:
tests/podman/
├── core
│ ├── Containerfile
│ ├── main.fmf
│ └── test.sh
└── socket-activation
├── main.fmf
├── remote-socket-test.sh
└── test.sh
The new test.sh
file is:
#!/bin/bash
. /usr/share/beakerlib/beakerlib.sh || exit 1
rlJournalStart
rlPhaseStartSetup
rlRun "dnf -y install podman"
rlRun "useradd podman-remote-test"
if ! grep /etc/subuid -q -e "podman-remote-test"; then
rlRun "usermod --add-subuids 100000-165537 podman-remote-test"
fi
if ! grep /etc/subgid -q -e "podman-remote-test"; then
rlRun "usermod --add-subgids 100000-165537 podman-remote-test"
fi
rlRun "loginctl enable-linger podman-remote-test"
rlWaitForCmd "loginctl show-user podman-remote-test" -t 10
rlPhaseEnd
rlPhaseStartTest "test remote socket"
rlRun "sudo -i -u podman-remote-test < ./remote-socket-test.sh"
rlPhaseStartCleanup
rlRun "loginctl terminate-user podman-remote-test"
rlRun "loginctl disable-linger podman-remote-test"
rlWaitForCmd "loginctl show-user podman-remote-test" -t 10 -r 1
rlRun "userdel -r podman-remote-test"
rlRun "dnf remove -y podman"
rlPhaseEnd
rlJournalEnd
Another Targeted Test Involving Third-Party Programs
There is yet another way we use podman that we need to ensure it will still work after an update. That is, as a gitlab runner executor.
For this test, we will setup a repository on gitlab.com with a simple podman scenario, and we will also setup the corresponding tokens to access it.
.
├── Containerfile
└── .gitlab-ci.yml
And the .gitlab-ci.yml
is:
image:
name: quay.io/containers/podman:v4.6
test-build:
stage: build
tags:
- ci
- gitlab-runner-test
script:
- podman build -t gitlab-runner-test .
We will register a gitlab runner with the podman executor and with the specific tags ci
and gitlab-runner-test
. These are arbritary tags to ensure proper gitlab-runner selection. Then we will use the gitlab API to trigger a pipeline for that repository and wait for the pipeline to be successful (or not). As a result, the only job in that pipeline will be executed within our CI machine with podman, and if everything is successful, the test will pass.
Since this test involves another program, we will put it in a separate directory named gitlab-runner
and under the feature podman-exec
. While we are at it, we also add smoke tests for the gitlab-runner command as a separate test to make sure the basics are working if this test fails for reasons unrelated to the gitlab-runner and podman integration.
The directory structure looks like this:
tests/gitlab-runner/
├── cmd
│ ├── main.fmf
│ └── test.sh
└── podman-exec
├── main.fmf
└── test.sh
We will skip the details of the cmd
test to focus on the podman-exec
one.
Our main.fmf
is:
summary: gitlab runner with podman tests
description: checks that gitlab runners can still run podman executor
require:
- "jq" # required to parse the gitlab api output
- "firewalld" # required by podman and gitlab-runner to setup the networking
And our more complicated test looks like this:
#!/bin/bash
. /usr/share/beakerlib/beakerlib.sh || exit 1
rlJournalStart
rlPhaseStartSetup
rlRun "rpmkeys --import \
https://packages.gitlab.com/runner/gitlab-runner/gpgkey/runner-gitlab-runner-4C80FB51394521E9.pub.gpg"
rlRun "dnf install -y \
https://gitlab-runner-downloads.s3.amazonaws.com/${GITLAB_RUNNER_VERSION}/rpm/gitlab-runner_amd64.rpm"
rlRun "dnf install -y podman"
if ! grep /etc/subuid -q -e "gitlab-runner"; then
rlRun "usermod --add-subuids 100000-165537 gitlab-runner"
fi
if ! grep /etc/subgid -q -e "gitlab-runner"; then
rlRun "usermod --add-subgids 100000-165537 gitlab-runner"
fi
rlRun "loginctl enable-linger gitlab-runner"
rlWaitForCmd "loginctl show-user gitlab-runner" -t 10
rlWaitForCmd 'sudo -u gitlab-runner \
XDG_RUNTIME_DIR=/run/user/$(id -u gitlab-runner) \
systemctl --user enable podman.socket' -t 60
rlWaitForFile "/run/user/$(id -u gitlab-runner)/podman/podman.sock" -t 60
rlRun 'gitlab-runner register -n \
--url="https://gitlab.com/" \
--executor="docker" \
--env="FF_NETWORK_PER_BUILD=1" \
--tag-list="gitlab-runner-test,ci" \
--registration-token="${GITLAB_REGISTRATION_TOKEN}" \
--docker-host="unix:///run/user/$(id -u gitlab-runner)/podman/podman.sock" \
--docker-privileged="true" \
--docker-tlsverify="false" \
--docker-image="quay.io/podman/stable"'
rlRun "gitlab-runner verify"
rlPhaseEnd
rlPhaseStartTest "test podman build in gitlab runner"
rlRun -s 'curl -s --request POST \
--form token=${GITLAB_PIPELINE_TOKEN} \
--form ref=main \
"https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/trigger/pipeline"'
rlRun "jq .status $rlRun_LOG | grep -q created" 0 "Verify that the pipeline was created"
rlRun 'pipeline_id=$(jq .id $rlRun_LOG)'
rlWaitForCmd 'curl -s --request GET \
--header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
"https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/pipelines/${pipeline_id}" | jq .status | grep -q success' -t 60
rlRun 'curl -s --request GET \
--header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
"https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/pipelines/${pipeline_id}" | jq .status | grep -q success'
rlRun 'curl -s --request POST \
--header "PRIVATE-TOKEN: ${GITLAB_API_TOKEN}" \
"https://gitlab.com/api/v4/projects/${GITLAB_PROJECT_ID}/pipelines/${pipeline_id}/cancel"' 0 'Cancel pipeline'
rlPhaseEnd
rlPhaseStartCleanup
rlRun "gitlab-runner unregister --all-runners"
rlRun "loginctl terminate-user gitlab-runner"
rlRun "loginctl disable-linger gitlab-runner"
rlWaitForCmd "loginctl show-user gitlab-runner" -t 10 -r 1
rlRun "dnf remove -y gitlab-runner podman"
rlPhaseEnd
rlJournalEnd
The test involves installing gitlab-runner, setting up podman, registering the runner, triggering a pipeline, and finally tearing everything down.
The final directory structure of our exercise is:
tests
├── main.fmf
├── gitlab-runner
│ ├── cmd
│ │ ├── main.fmf
│ │ └── test.sh
│ └── podman-exec
│ ├── main.fmf
│ └── test.sh
└── podman
├── core
│ ├── Containerfile
│ ├── main.fmf
│ └── test.sh
└── socket-activation
├── main.fmf
├── remote-socket-test.sh
└── test.sh
Plans and Tests Linting
After writing the plans, the tests, and their metadata, we can use tmt lint
to validate the fmf
files, as well as tmt plans ls
and tmt tests ls
to ensure the plans and the tests are found. For example, if we mispelled description
as descripton
in the podman/core test, the tmt lint
output would show this:
/tests/podman/core
warn C000 key "descripton" not recognized by schema, and does not match "^extra-" pattern
warn C000 fmf node failed schema validation
pass C001 summary key is set and is reasonably long
fail T001 unknown key "descripton" is used
pass T002 test script is defined
pass T003 directory path is absolute
pass T004 test path '/home/****/****/tests/podman/core' does exist
skip T005 legacy relevancy not detected
skip T006 legacy 'coverage' field not detected
skip T007 not a manual test
skip T008 not a manual test
skip T009 library/file requirements not used
To lint the shell files, we use the ShellCheck
tool.
Plan Execution
The plan execution is simply a command line in the Gitlab Pipeline, passing down any environment variables with the -e
option that are not defined already in the plans with the environment
yaml item. We also run it with the flag -vv
to get enough verbosity in the pipelines, but in addition to that, we keep the /var/tmp/tmt
directory for further investigation if a test fails.
The command line to execute it is like this:
tmt -vv run -e "FOO=1" -e "BAR=1"
The standard output gives us a summary of every stage of the run, including provisioning, preparation, execution, finishing, and a summary report.
This is how the execute
and the summary report
sections in the output look like for our four example tests, showing their pass or fail status and the time they took to run:
execute
queued execute task #1: default-0 on default-0
execute task #1: default-0 on default-0
how: tmt
00:00:13 pass /tests/gitlab-runner/cmd (on default-0) [1/4]
00:02:45 pass /tests/gitlab-runner/podman-exec (on default-0) [2/4]
00:00:33 pass /tests/podman/core (on default-0) [3/4]
00:00:10 pass /tests/podman/socket-activation (on default-0) [4/4]
summary: 4 tests executed
report
how: display
pass /tests/gitlab-runner/cmd
pass /tests/gitlab-runner/podman-exec
pass /tests/podman/core
pass /tests/podman/socket-activation
summary: 4 tests passed
An engineer can run individual tests with greater verbosity. For example, to run only the podman/core test alone, the engineer could setup a clone of a CI instance, use the connect
provision to access it, and set a greater verbosity level by incrementing the v
s:
tmt -vvv run -a provision --how=connect --guest=THE_GUEST_IP --become --user=ec2-user --key=~/.pem/the_guest_key.pem test --name /podman/core
To run the gitlab/podman-exec one, it would be similar but passing all the environment variables with the -e
options.
You have noticed we used rlWaitForCmd
in several places, for example in the gitlab-runner/podman-exec, setting the -t
option to specify the timeout. TMT allows us to set a timeout for the entire test. The default timeout is 5 minutes. We can change it inside the main.fmf
of the specific test, or in a parent main.fmf
to set it for each test underneath. For example, we found after several runs that the gitlab-runner/podman-exec should run in no more than 3 minutes and 30 seconds. So, in order to set the timeout on that specific test, we change the main.fmf
this way:
summary: gitlab runner with podman tests
description: checks that gitlab runners can still run podman executor
duration: 3m 30s
require:
- "jq" # required to parse the gitlab api output
- "firewalld" # required by podman and gitlab-runner to setup the networking
If the test times out, we would get a report like this:
report
how: display
errr /tests/gitlab-runner/podman-exec (timeout)
summary: 1 error
Yet another thing we can do with the test is to disable it, for example, temporarily. In order to do so, we can simply set enabled
as false
in the main.fmf
:
summary: gitlab runner with podman tests
description: checks that gitlab runners can still run podman executor
enabled: false
duration: 3m 30s
require:
- "jq" # required to parse the gitlab api output
- "firewalld" # required by podman and gitlab-runner to setup the networking
Then TMT would skip this test while executing the plan. The enabled
flag can also be combined with the adjust
yaml item to disable it only in a certain context (e.g., specific distro, arch, etc…).
Test Outputs
TMT stores the test outputs and metadata in /var/tmp/tmt
by default. For our last example, we can find the results on this path:
/var/tmp/tmt/run-001/plans/test/execute/data/guest/default-0/tests/gitlab-runner/podman-exec-2
├── ASSERT_STATUSES
├── clbuff
├── cleanup.sh
├── data
├── journal_colored.txt
├── journal.meta
├── journal.txt
├── journal.xml
├── metadata.yaml
├── output.txt
├── PersistentData
├── PHASE_STATUSES
├── TestResults
├── tmt-test-topology.sh
└── tmt-test-topology.yaml
A very helpful file here is output.txt
with the stdout and stderr outputs of the test. The entire /var/tmp/tmt
can be stored for further investigation when tests do fail.
The /var/tmp/tmt
directory contains other important files like the results.yaml
which give us a summary of the tests executions and if they passed or failed.
Conclusion
In this post, we have presented the overall process we follow and detailed, with a few examples, the writing and managing of specific tests, which illustrates the value and also the expected complexity of using TMT in combination with Beakerlib to build a solid CI tests solution.
TMT with Beakerlib has facilitated writing and managing tests to verify updates against the way we use programs, whether they come from the official repository, EPEL, or third-party ones. This allows us to detect issues as early as possible in conjunction with the canary instances technique.
Thanks for the great article! I would just correct as the tests do not have to be written in bash, you can just wrap them around and run your favourite test framework. We have users running pytest, ginkgo, behave, basically anything that runs on Linux.
It has been corrected. Thank you!
This is wonderful! Thank you for this article.