Latest posts for tag sw

Staticsite redesign

These are some notes about my redesign work in staticsite 2.x.

Maping constraints and invariants

I started keeping notes of constraints and invariants, and this helped a lot in keeping bounds on the cognitive efforts of design.

I particularly liked how mapping the set of constraints added during site generation has helped breaking down processing into a series of well defined steps. Code that handles each step now has a specific task, and can rely on clear assumptions.

Declarative page metadata

I designed page metadata as declarative fields added to the Page class.

I used typed descriptors for the fields, so that metadata fields can now have logic and validation, and are self-documenting!

This is the core of the Field implementation.

Lean core

I tried to implement as much as possible in feature plugins, leaving to the staticsite core only what is essential to create the structure for plugins to build on.

The core provides a tree structure, an abstract Page object that can render to a file and resolve references to other pages, a Site that holds settings and controls the various loading steps, and little else.

The only type of content supported by the core is static asset files: Markdown, RestructuredText, images, taxonomies, feeds, directory indices, and so on, are all provided via feature plugins.

Feature plugins

Feature plugins work by providing functions to be called at the various loading steps, and mixins to be added to site pages.

Mixins provided by feature plugins can add new declarative metadata fields, and extend Page methods: this ends up being very clean and powerful, and plays decently well with mypy's static type checking, too!

See for example the code of the alias feature, that allows a page to declare aliases that redirect to it, useful for example when moving content around.

It has a mixin (AliasPageMixin) that adds an aliases field that holds a list of page paths.

During the "generate" step, when autogenerated pages can be created, the aliases feature iterates through all pages that defined an aliases metadata, and generates the corresponding redirection pages.

Self-documenting code

Staticsite can list loaded features, features can list the page subclasses that they use, and pages can list metadata fields.

As a result, each feature, each type of page, and each field of each page can generate documentation about itself: the staticsite reference is autogenerated in that way, mostly from Feature, Page, and Field docstrings.

Understand the language, stay close to the language

Python has matured massively in the last years, and I like to stay on top of the language and standard library release notes for each release.

I like how what used to be dirty hacks have now found a clean way into the language:

  • what one would implement with metaclass magic one can now mostly do with descriptors, and get language support for it, including static type checking.
  • understanding the inheritance system and method resolution order allows to write type checkable mixins
  • runtime-accessible docstrings help a lot with autogenerating documentation
  • os.scandir and os functions that accept directory file descriptors make filesystem exploration pleasantly fast, for an interpreted language!

Released staticsite 2.x

In theory I wanted to announce the release of staticsite 2.0, but then I found bugs that prevented me from writing this post, so I'm also releasing 2.1 2.2 2.3 :grin:

staticsite is the static site generator that I ended up writing after giving other generators a try.

I did a big round of cleanup of the code, which among other things allowed me to implement incremental builds.

It turned out that staticsite is fast enough that incremental builds are not really needed, however, a bug in caching rendered markdown made me forget about that. Now I fixed that bug, too, and I can choose between running staticsite fast, and ridiculously fast.

My favourite bit of this work is the internal cleanup: I found a way to simplify the core design massively, and now the core and plugin system is simple enough that I can explain it, and I'll probably write a blog post or two about it in the next days.

On top of that, staticsite is basically clean with mypy running in strict mode! Getting there was a great ride which prompted a lot of thinking about designing code properly, as mypy is pretty good at flagging clumsy hacks.

If you want to give it a try, check out the small tutorial A new blog in under one minute.

Updating cbqt for bullseye

Back in 2017 I did work to setup a cross-building toolchain for QT Creator, that takes advantage of Debian's packaging for all the dependency ecosystem.

It ended with cbqt which is a little script that sets up a chroot to hold cross-build-dependencies, to avoid conflicting with packages in the host system, and sets up a qmake alternative to make use of them.

Today I'm dusting off that work, to ensure it works on Debian bullseye.

Resetting QT Creator

To make things reproducible, I wanted to reset QT Creator's configuration.

Besides purging and reinstalling the package, one needs to manually remove:

  • ~/.config/QtProject
  • ~/.cache/QtProject/
  • /usr/share/qtcreator/QtProject which is where configuration is stored if you used sdktool to programmatically configure Qt Creator (see for example this post and see Debian bug #1012561.

Updating cbqt

Easy start, change the distribution for the chroot:

-DIST_CODENAME = "stretch"
+DIST_CODENAME = "bullseye"

Adding LIBDIR

Something else does not work:

Test$ qmake-armhf -makefile
Info: creating stash file /Test/.qmake.stash
Test$ make
[...]
/usr/bin/arm-linux-gnueabihf-g++ -Wl,-O1 -Wl,-rpath-link,…/armhf/lib/arm-linux-gnueabihf -Wl,-rpath-link,…/armhf/usr/lib/arm-linux-gnueabihf -Wl,-rpath-link,…/armhf/usr/lib/ -o Test main.o mainwindow.o moc_mainwindow.o   /armhf/usr/lib/arm-linux-gnueabihf/libQt5Widgets.so /armhf/usr/lib/arm-linux-gnueabihf/libQt5Gui.so /armhf/usr/lib/arm-linux-gnueabihf/libQt5Core.so -lGLESv2 -lpthread
/usr/lib/gcc-cross/arm-linux-gnueabihf/10/../../../../arm-linux-gnueabihf/bin/ld: cannot find -lGLESv2
collect2: error: ld returned 1 exit status
make: *** [Makefile:146: Test] Error 1

I figured that now I also need to set QMAKE_LIBDIR and not just QMAKE_RPATHLINKDIR:

--- a/cbqt
+++ b/cbqt
@@ -241,18 +241,21 @@ include(../common/linux.conf)
 include(../common/gcc-base-unix.conf)
 include(../common/g++-unix.conf)

+QMAKE_LIBDIR += {chroot.abspath}/lib/arm-linux-gnueabihf
+QMAKE_LIBDIR += {chroot.abspath}/usr/lib/arm-linux-gnueabihf
+QMAKE_LIBDIR += {chroot.abspath}/usr/lib/
 QMAKE_RPATHLINKDIR += {chroot.abspath}/lib/arm-linux-gnueabihf
 QMAKE_RPATHLINKDIR += {chroot.abspath}/usr/lib/arm-linux-gnueabihf
 QMAKE_RPATHLINKDIR += {chroot.abspath}/usr/lib/

Now it links again:

Test$ qmake-armhf -makefile
Test$ make
/usr/bin/arm-linux-gnueabihf-g++ -Wl,-O1 -Wl,-rpath-link,…/armhf/lib/arm-linux-gnueabihf -Wl,-rpath-link,…/armhf/usr/lib/arm-linux-gnueabihf -Wl,-rpath-link,…/armhf/usr/lib/ -o Test main.o mainwindow.o moc_mainwindow.o   -L…/armhf/lib/arm-linux-gnueabihf -L…/armhf/usr/lib/arm-linux-gnueabihf -L…/armhf/usr/lib/ …/armhf/usr/lib/arm-linux-gnueabihf/libQt5Widgets.so …/armhf/usr/lib/arm-linux-gnueabihf/libQt5Gui.so …/armhf/usr/lib/arm-linux-gnueabihf/libQt5Core.so -lGLESv2 -lpthread

Making it work in Qt Creator

Time to try it in Qt Creator, and sadly it fails:

/armhf/usr/lib/arm-linux-gnueabihf/qt5/mkspecs/features/toolchain.prf:76: Variable QMAKE_CXX.COMPILER_MACROS is not defined.

QMAKE_CXX.COMPILER_MACROS is not defined

I traced it to this bit in armhf/usr/lib/arm-linux-gnueabihf/qt5/mkspecs/features/toolchain.prf (nonrelevant bits deleted):

isEmpty($${target_prefix}.COMPILER_MACROS) {
    msvc {
        # …
    } else: gcc|ghs {
        vars = $$qtVariablesFromGCC($$QMAKE_CXX)
    }
    for (v, vars) {
        # …
        $${target_prefix}.COMPILER_MACROS += $$v
    }
    cache($${target_prefix}.COMPILER_MACROS, set stash)
} else {
    # …
}

It turns out that qmake is not able to realise that the compiler is gcc, so vars does not get set, nothing is set in COMPILER_MACROS, and qmake fails.

Reproducing it on the command line

When run manually, however, qmake-armhf worked, so it would be good to know how Qt Creator is actually running qmake. Since it frustratingly does not show what commands it runs, I'll have to strace it:

strace -e trace=execve --string-limit=123456 -o qtcreator.trace -f qtcreator

And there it is:

$ grep qmake- qtcreator.trace
1015841 execve("/usr/local/bin/qmake-armhf", ["/usr/local/bin/qmake-armhf", "-query"], 0x56096e923040 /* 54 vars */) = 0
1015865 execve("/usr/local/bin/qmake-armhf", ["/usr/local/bin/qmake-armhf", "…/Test/Test.pro", "-spec", "arm-linux-gnueabihf", "CONFIG+=debug", "CONFIG+=qml_debug"], 0x7f5cb4023e20 /* 55 vars */) = 0

I run the command manually and indeed I reproduce the problem:

$ /usr/local/bin/qmake-armhf Test.pro -spec arm-linux-gnueabihf CONFIG+=debug CONFIG+=qml_debug
…/armhf/usr/lib/arm-linux-gnueabihf/qt5/mkspecs/features/toolchain.prf:76: Variable QMAKE_CXX.COMPILER_MACROS is not defined.

I try removing options until I find the one that breaks it and... now it's always broken! Even manually running qmake-armhf, like I did earlier, stopped working:

$ rm .qmake.stash
$ qmake-armhf -makefile
…/armhf/usr/lib/arm-linux-gnueabihf/qt5/mkspecs/features/toolchain.prf:76: Variable QMAKE_CXX.COMPILER_MACROS is not defined.

Debugging toolchain.prf

I tried purging and reinstalling qtcreator, and recreating the chroot, but qmake-armhf is staying broken. I'll let that be, and try to debug toolchain.prf.

By grepping gcc in the mkspecs directory, I managed to figure out that:

  • The } else: gcc|ghs { test is matching the value(s) of QMAKE_COMPILER
  • QMAKE_COMPILER can have multiple values, separated by space
  • If in armhf/usr/lib/arm-linux-gnueabihf/qt5/mkspecs/arm-linux-gnueabihf/qmake.conf I set QMAKE_COMPILER = gcc arm-linux-gnueabihf-gcc, then things work again.

Sadly, I failed to find reference documentation for QMAKE_COMPILER's syntax and behaviour. I also failed to find why qmake-armhf worked earlier, and I am also failing to restore the system to a situation where it works again. Maybe I dreamt that it worked? I had some manual change laying around from some previous fiddling with things?

Anyway at least now I have the fix:

--- a/cbqt
+++ b/cbqt
@@ -248,7 +248,7 @@ QMAKE_RPATHLINKDIR += {chroot.abspath}/lib/arm-linux-gnueabihf
 QMAKE_RPATHLINKDIR += {chroot.abspath}/usr/lib/arm-linux-gnueabihf
 QMAKE_RPATHLINKDIR += {chroot.abspath}/usr/lib/

-QMAKE_COMPILER          = {chroot.arch_triplet}-gcc
+QMAKE_COMPILER          = gcc {chroot.arch_triplet}-gcc

 QMAKE_CC                = /usr/bin/{chroot.arch_triplet}-gcc

Fixing a compiler mismatch warning

In setting up the kit, Qt Creator also complained that the compiler from qmake did not match the one configured in the kit. That was easy to fix, by pointing at the host system cross-compiler in qmake.conf:

 QMAKE_COMPILER          = {chroot.arch_triplet}-gcc

-QMAKE_CC                = {chroot.arch_triplet}-gcc
+QMAKE_CC                = /usr/bin/{chroot.arch_triplet}-gcc

 QMAKE_LINK_C            = $$QMAKE_CC
 QMAKE_LINK_C_SHLIB      = $$QMAKE_CC

-QMAKE_CXX               = {chroot.arch_triplet}-g++
+QMAKE_CXX               = /usr/bin/{chroot.arch_triplet}-g++

 QMAKE_LINK              = $$QMAKE_CXX
 QMAKE_LINK_SHLIB        = $$QMAKE_CXX

Updated setup instructions

Create an armhf environment:

sudo cbqt ./armhf --create --verbose

Create a qmake wrapper that builds with this environment:

sudo ./cbqt ./armhf --qmake -o /usr/local/bin/qmake-armhf

Install the build-dependencies that you need:

# Note: :arch is added automatically to package names if no arch is explicitly specified
sudo ./cbqt ./armhf --install libqt5svg5-dev libmosquittopp-dev qtwebengine5-dev

Build with qmake

Use qmake-armhf instead of qmake and it works perfectly:

qmake-armhf -makefile
make

Set up Qt Creator

Configure a new Kit in Qt Creator:

  1. Tools/Options, then Kits, then Add
  2. Name: armhf (or anything you like)
  3. In the Qt Versions tab, click Add then set the path of the new Qt to /usr/local/bin/qmake-armhf. Click Apply.
  4. Back in the Kits, select the Qt version you just created in the Qt version field
  5. In Compilers, select the ARM versions of GCC. If they do not appear, install crossbuild-essential-armhf, then in the Compilers tab click Re-detect and then Apply to make them available for selection
  6. Dismiss the dialog with "OK": the new kit is ready

Now you can choose the default kit to build and run locally, and the armhf kit for remote cross-development.

I tried looking at sdktool to automate this step, and it requires a nontrivial amount of work to do it reliably, so these manual instructions will have to do.

Credits

This has been done as part of my work with Truelite.

Systemd containers with unittest

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience.

Unit testing some parts of Transilience, like the apt and systemd actions, or remote Mitogen connections, can really use a containerized system for testing.

To have that, I reused my work on nspawn-runner. to build a simple and very fast system of ephemeral containers, with minimal dependencies, based on systemd-nspawn and btrfs snapshots:

Setup

To be able to use systemd-nspawn --ephemeral, the chroots needs to be btrfs subvolumes. If you are not running on a btrfs filesystem, you can create one to run the tests, even on a file:

fallocate -l 1.5G testfile
/usr/sbin/mkfs.btrfs testfile
sudo mount -o loop testfile test_chroots/

I created a script to setup the test environment, here is an extract:

mkdir -p test_chroots

cat << EOF > "test_chroots/CACHEDIR.TAG"
Signature: 8a477f597d28d172789f06886806bc55
# chroots used for testing transilience, can be regenerated with make-test-chroot
EOF

btrfs subvolume create test_chroots/buster
eatmydata debootstrap --variant=minbase --include=python3,dbus,systemd buster test_chroots/buster

CACHEDIR.TAG is a nice trick to tell backup software not to bother backing up the contents of this directory, since it can be easily regenerated.

eatmydata is optional, and it speeds up debootstrap quite a bit.

Running unittest with sudo

Here's a simple helper to drop root as soon as possible, and regain it only when needed. Note that it needs $SUDO_UID and $SUDO_GID, that are set by sudo, to know which user to drop into:

class ProcessPrivs:
    """
    Drop root privileges and regain them only when needed
    """
    def __init__(self):
        self.orig_uid, self.orig_euid, self.orig_suid = os.getresuid()
        self.orig_gid, self.orig_egid, self.orig_sgid = os.getresgid()

        if "SUDO_UID" not in os.environ:
            raise RuntimeError("Tests need to be run under sudo")

        self.user_uid = int(os.environ["SUDO_UID"])
        self.user_gid = int(os.environ["SUDO_GID"])

        self.dropped = False

    def drop(self):
        """
        Drop root privileges
        """
        if self.dropped:
            return
        os.setresgid(self.user_gid, self.user_gid, 0)
        os.setresuid(self.user_uid, self.user_uid, 0)
        self.dropped = True

    def regain(self):
        """
        Regain root privileges
        """
        if not self.dropped:
            return
        os.setresuid(self.orig_suid, self.orig_suid, self.user_uid)
        os.setresgid(self.orig_sgid, self.orig_sgid, self.user_gid)
        self.dropped = False

    @contextlib.contextmanager
    def root(self):
        """
        Regain root privileges for the duration of this context manager
        """
        if not self.dropped:
            yield
        else:
            self.regain()
            try:
                yield
            finally:
                self.drop()

    @contextlib.contextmanager
    def user(self):
        """
        Drop root privileges for the duration of this context manager
        """
        if self.dropped:
            yield
        else:
            self.drop()
            try:
                yield
            finally:
                self.regain()


privs = ProcessPrivs()
privs.drop()

As soon as this module is loaded, root privileges are dropped, and can be regained for as little as possible using a handy context manager:

   with privs.root():
       subprocess.run(["systemd-run", ...], check=True, capture_output=True)

Using the chroot from test cases

The infrastructure to setup and spin down ephemeral machine is relatively simple, once one has worked out the nspawn incantations:

class Chroot:
    """
    Manage an ephemeral chroot
    """
    running_chroots: Dict[str, "Chroot"] = {}

    def __init__(self, name: str, chroot_dir: Optional[str] = None):
        self.name = name
        if chroot_dir is None:
            self.chroot_dir = self.get_chroot_dir(name)
        else:
            self.chroot_dir = chroot_dir
        self.machine_name = f"transilience-{uuid.uuid4()}"

    def start(self):
        """
        Start nspawn on this given chroot.

        The systemd-nspawn command is run contained into its own unit using
        systemd-run
        """
        unit_config = [
            'KillMode=mixed',
            'Type=notify',
            'RestartForceExitStatus=133',
            'SuccessExitStatus=133',
            'Slice=machine.slice',
            'Delegate=yes',
            'TasksMax=16384',
            'WatchdogSec=3min',
        ]

        cmd = ["systemd-run"]
        for c in unit_config:
            cmd.append(f"--property={c}")

        cmd.extend((
            "systemd-nspawn",
            "--quiet",
            "--ephemeral",
            f"--directory={self.chroot_dir}",
            f"--machine={self.machine_name}",
            "--boot",
            "--notify-ready=yes"))

        log.info("%s: starting machine using image %s", self.machine_name, self.chroot_dir)

        log.debug("%s: running %s", self.machine_name, " ".join(shlex.quote(c) for c in cmd))
        with privs.root():
            subprocess.run(cmd, check=True, capture_output=True)
        log.debug("%s: started", self.machine_name)
        self.running_chroots[self.machine_name] = self

    def stop(self):
        """
        Stop the running ephemeral containers
        """
        cmd = ["machinectl", "terminate", self.machine_name]
        log.debug("%s: running %s", self.machine_name, " ".join(shlex.quote(c) for c in cmd))
        with privs.root():
            subprocess.run(cmd, check=True, capture_output=True)
        log.debug("%s: stopped", self.machine_name)
        del self.running_chroots[self.machine_name]

    @classmethod
    def create(cls, chroot_name: str) -> "Chroot":
        """
        Start an ephemeral machine from the given master chroot
        """
        res = cls(chroot_name)
        res.start()
        return res

    @classmethod
    def get_chroot_dir(cls, chroot_name: str):
        """
        Locate a master chroot under test_chroots/
        """
        chroot_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "test_chroots", chroot_name))
        if not os.path.isdir(chroot_dir):
            raise RuntimeError(f"{chroot_dir} does not exists or is not a chroot directory")
        return chroot_dir


# We need to use atextit, because unittest won't run
# tearDown/tearDownClass/tearDownModule methods in case of KeyboardInterrupt
# and we need to make sure to terminate the nspawn containers at exit
@atexit.register
def cleanup():
    # Use a list to prevent changing running_chroots during iteration
    for chroot in list(Chroot.running_chroots.values()):
        chroot.stop()

And here's a TestCase mixin that starts a containerized systems and opens a Mitogen connection to it:

class ChrootTestMixin:
    """
    Mixin to run tests over a setns connection to an ephemeral systemd-nspawn
    container running one of the test chroots
    """
    chroot_name = "buster"

    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        import mitogen
        from transilience.system import Mitogen
        cls.broker = mitogen.master.Broker()
        cls.router = mitogen.master.Router(cls.broker)
        cls.chroot = Chroot.create(cls.chroot_name)
        with privs.root():
            cls.system = Mitogen(
                    cls.chroot.name, "setns", kind="machinectl",
                    python_path="/usr/bin/python3",
                    container=cls.chroot.machine_name, router=cls.router)

    @classmethod
    def tearDownClass(cls):
        super().tearDownClass()
        cls.system.close()
        cls.broker.shutdown()
        cls.chroot.stop()

Running tests

Once the tests are set up, everything goes on as normal, except one needs to run nose2 with sudo:

sudo nose2-3

Spin up time for containers is pretty fast, and the tests drop root as soon as possible, and only regain it for as little as needed.

Also, dependencies for all this are minimal and available on most systems, and the setup instructions seem pretty straightforward

Building a Transilience playbook in a zipapp

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience.

Mitogen is a great library, but scarily complicated, and I've been wondering how hard it would be to make alternative connection methods for Transilience.

Here's a wild idea: can I package a whole Transilience playbook, plus dependencies, in a zipapp, then send the zipapp to the machine to be provisioned, and run it locally?

It turns out I can.

Creating the zipapp

This is somewhat hackish, but until I can rely on Python 3.9's improved importlib.resources module, I cannot think of a better way:

    def zipapp(self, target: str, interpreter=None):
        """
        Bundle this playbook into a self-contained zipapp
        """
        import zipapp
        import jinja2
        import transilience
        if interpreter is None:
            interpreter = sys.executable

        if getattr(transilience.__loader__, "archive", None):
            # Recursively iterating module directories requires Python 3.9+
            raise NotImplementedError("Cannot currently create a zipapp from a zipapp")

        with tempfile.TemporaryDirectory() as workdir:
            # Copy transilience
            shutil.copytree(os.path.dirname(__file__), os.path.join(workdir, "transilience"))
            # Copy jinja2
            shutil.copytree(os.path.dirname(jinja2.__file__), os.path.join(workdir, "jinja2"))
            # Copy argv[0] as __main__.py
            shutil.copy(sys.argv[0], os.path.join(workdir, "__main__.py"))
            # Copy argv[0]/roles
            role_dir = os.path.join(os.path.dirname(sys.argv[0]), "roles")
            if os.path.isdir(role_dir):
                shutil.copytree(role_dir, os.path.join(workdir, "roles"))
            # Turn everything into a zipapp
            zipapp.create_archive(workdir, target, interpreter=interpreter, compressed=True)

Since the zipapp contains not just the playbook, the roles, and the roles' assets, but also Transilience and Jinja2, it can run on any system that has a Python 3.7+ interpreter, and nothing else!

I added it to the standard set of playbook command line options, so any Transilience playbook can turn itself into a self-contained zipapp:

$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--local LOCAL]
                 [--ansible-to-python role | --ansible-to-ast role | --zipapp file.pyz]
[...]
  --zipapp file.pyz     bundle this playbook in a self-contained executable
                        python zipapp

Loading assets from the zipapp

I had to create ZipFile varieties of some bits of infrastructure in Transilience, to load templates, files, and Ansible yaml files from zip files.

You can see above a way to detect if a module is loaded from a zipfile: check if the module's __loader__ attribute has an archive attribute.

Here's a Jinja2 template loader that looks into a zip:

class ZipLoader(jinja2.BaseLoader):
    def __init__(self, archive: zipfile.ZipFile, root: str):
        self.zipfile = archive
        self.root = root

    def get_source(self, environment: jinja2.Environment, template: str):
        path = os.path.join(self.root, template)
        with self.zipfile.open(path, "r") as fd:
            source = fd.read().decode()
        return source, None, lambda: True

I also created a FileAsset abstract interface to represent a local file, and had Role.lookup_file return an appropriate instance:

    def lookup_file(self, path: str) -> str:
        """
        Resolve a pathname inside the place where the role assets are stored.
        Returns a pathname to the file
        """
        if self.role_assets_zipfile is not None:
            return ZipFileAsset(self.role_assets_zipfile, os.path.join(self.role_assets_root, path))
        else:
            return LocalFileAsset(os.path.join(self.role_assets_root, path))

An interesting side effect of having smarter local file accessors is that I can cache the contents of small files and transmit them to the remote host together with the other action parameters, saving a potential network round trip for each builtin.copy action that has a small source.

The result

The result is kind of fun:

$ time ./provision --zipapp test.pyz

real    0m0.203s
user    0m0.174s
sys 0m0.029s

$ time scp test.pyz root@test:
test.pyz                                                                                                         100%  528KB 388.9KB/s   00:01

real    0m1.576s
user    0m0.010s
sys 0m0.007s

And on the remote:

# time ./test.pyz --local=test
2021-06-29 18:05:41,546 test: [connected 0.000s]
[...]
2021-06-29 18:12:31,555 test: 88 total actions in 0.00ms: 87 unchanged, 0 changed, 1 skipped, 0 failed, 0 not executed.

real    0m0.979s
user    0m0.783s
sys 0m0.172s

Compare with a Mitogen run:

$ time PYTHONPATH=../transilience/ ./provision
2021-06-29 18:13:44 test: [connected 0.427s]
[...]
2021-06-29 18:13:46 test: 88 total actions in 2.50s: 87 unchanged, 0 changed, 1 skipped, 0 failed, 0 not executed.

real    0m2.697s
user    0m0.856s
sys 0m0.042s

From a single test run, not a good benchmark, it's 0.203 + 1.576 + 0.979 = 2.758s with the zipapp and 2.697s with Mitogen. Even if I've been lucky, it's a similar order of magnitude.

What can I use this for?

This was mostly a fun hack.

It could however be the basis for a Fabric-based connector, or a clusterssh-based connector, or for bundling a Transilience playbook into an installation image, or to add a provisioning script to the boot partition of a Raspberry Pi. It looks like an interesting trick to have up one's sleeve.

One could even build an Ansible-based connector(!) in which a simple Ansible playbook, with no facts gathering, is used to build the zipapp, push it to remote systems and run it. That would be the wackiest way of speeding up Ansible, ever!

Next: using Systemd containers with unittest, for Transilience's test suite.

Ansible conditionals in Transilience

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience.

I thought a lot of what I managed to do so far with Transilience would be impossible, but then here I am. How about Ansible conditionals? Those must be impossible, right?

Let's give it a try.

A quick recon of Ansible sources

Looking into Ansible's sources, when expressions are lists of strings AND-ed together.

The expressions are Jinja2 expressions that Ansible pastes into a mini-template, renders, and checks the string that comes out.

A quick recon of Jinja2

Jinja2 has a convenient function (jinja2.Environment.compile_expression) that compiles a template snippet into a Python function.

It can also parse a template into an AST that can be inspected in various ways.

Evaluating Ansible conditionals in Python

Environment.compile_expression seems to really do precisely what we need for this, straight out of the box.

There is an issue with the concept of "defined": for Ansible it seems to mean "the variable is present in the template context". In Transilience instead, all variables are fields in the Role dataclass, and can be None when not set.

This means that we need to remove variables that are set to None before passing the parameters to the compiled Jinjae expression:

class Conditional:
    """
    An Ansible conditional expression
    """
    def __init__(self, engine: template.Engine, body: str):
        # Original unparsed expression
        self.body: str = body
        # Expression compiled to a callable
        self.expression: Callable = engine.env.compile_expression(body)

    def evaluate(self, ctx: Dict[str, Any]):
        ctx = {name: val for name, val in ctx.items() if val is not None}
        return self.expression(**ctx)

Generating Python code

Transilience does not only support running Ansible roles, but also converting them to Python code. I can keep this up by traversing the Jinja2 AST generating Python expressions.

The code is straightforward enough that I can throw in a bit of pattern matching to make some expressions more idiomatic for Python:

class Conditional:
    def __init__(self, engine: template.Engine, body: str):
    ...
        parser = jinja2.parser.Parser(engine.env, body, state='variable')
        self.jinja2_ast: nodes.Node = parser.parse_expression()

    def get_python_code(self) -> str:
        return to_python_code(self.jinja2_ast


def to_python_code(node: nodes.Node) -> str:
    if isinstance(node, nodes.Name):
        if node.ctx == "load":
            return f"self.{node.name}"
        else:
            raise NotImplementedError(f"jinja2 Name nodes with ctx={node.ctx!r} are not supported: {node!r}")
    elif isinstance(node, nodes.Test):
        if node.name == "defined":
            return f"{to_python_code(node.node)} is not None"
        elif node.name == "undefined":
            return f"{to_python_code(node.node)} is None"
        else:
            raise NotImplementedError(f"jinja2 Test nodes with name={node.name!r} are not supported: {node!r}")
    elif isinstance(node, nodes.Not):
        if isinstance(node.node, nodes.Test):
            # Special case match well-known structures for more idiomatic Python
            if node.node.name == "defined":
                return f"{to_python_code(node.node.node)} is None"
            elif node.node.name == "undefined":
                return f"{to_python_code(node.node.node)} is not None"
        elif isinstance(node.node, nodes.Name):
            return f"not {to_python_code(node.node)}"
        return f"not ({to_python_code(node.node)})"
    elif isinstance(node, nodes.Or):
        return f"({to_python_code(node.left)} or {to_python_code(node.right)})"
    elif isinstance(node, nodes.And):
        return f"({to_python_code(node.left)} and {to_python_code(node.right)})"
    else:
        raise NotImplementedError(f"jinja2 {node.__class__} nodes are not supported: {node!r}")

Scanning for variables

Lastly, I can implement scanning conditionals for variable references to add as fields to the Role dataclass:

class FindVars(jinja2.visitor.NodeVisitor):
    def __init__(self):
        self.found: Set[str] = set()

    def visit_Name(self, node):
        if node.ctx == "load":
            self.found.add(node.name)


class Conditional:
    ...
    def list_role_vars(self) -> Sequence[str]:
        fv = FindVars()
        fv.visit(self.jinja2_ast)
        return fv.found

The result in action

Take this simple Ansible task:

---
 - name: Example task
   file:
      state: touch
      path: /tmp/test
   when: (is_test is defined and is_test) or debug is defined

Run it through ./provision --ansible-to-python test and you get:

from __future__ import annotations
from typing import Any
from transilience import role
from transilience.actions import builtin, facts


@role.with_facts([facts.Platform])
class Role(role.Role):
    # Role variables used by templates
    debug: Any = None
    is_test: Any = None

    def all_facts_available(self):
        if ((self.is_test is not None and self.is_test)
                or self.debug is not None):
            self.add(
                builtin.file(path='/tmp/test', state='touch'),
                name='Example task')

Besides one harmless set of parentheses too much, what I wasn't sure would be possible is there, right there, staring at me with a mischievous grin.

Next: Building a Transilience playbook in a zipapp.

Parsing YAML

This is part of a series of posts on ideas for an Ansible-like provisioning system, implemented in Transilience.

The time has come for me to try and prototype if it's possible to load some Transilience roles from Ansible's YAML instead of Python.

The data models of Transilience and Ansible are not exactly the same. Some of the differences that come to mind:

  • Ansible has a big pot of global variables; Transilience has a well defined set of role-specific variables.
  • Roles in Ansible are little more than a chunk of playbook that one includes; Roles in Transilience are self-contained and isolated, support pipelined batches of tasks, and can use full Python logic.
  • Transilience does not have a template action: the equivalent is a copy action that uses the Role's rendering engine to render the template.
  • Handlers in Ansible are tasks identified by a name in a global namespace; handlers in Transilience are Roles, identified by their Python classes.

To simplify the work, I'll start from loading a single role out of Ansible, not an entire playbook.

TL;DR: scroll to the bottom of the post for the conclusion!

Loading tasks

The first problem of loading an Ansible task is to figure out which of the keys is the module name. I have so far failed to find precise reference documentation about what keyboards are used to define a task, so I'm going by guesswork, and if needed a look at Ansible's sources.

My first attempt goes by excluding all known non-module keywords:

        candidates = []
        for key in task_info.keys():
            if key in ("name", "args", "notify"):
                continue
            candidates.append(key)

        if len(candidates) != 1:
            raise RoleNotLoadedError(f"could not find a known module in task {task_info!r}")

        modname = candidates[0]
        if modname.startswith("ansible.builtin."):
            name = modname[16:]
        else:
            name = modname

This means that Ansible keywords like when or with will break the parsing, and it's fine since they are not supported yet.

args seems to carry arguments to the module, when the module main argument is not a dict, as may happen at least with the command module.

Task parameters

One can do all sorts of chaotic things to pass parameters to Ansible tasks: for example string lists can be lists of strings or strings with comma-separated lists, and they can be preprocesed via Jinja2 templating, and they can be complex data structures that might contain strings that need Jinja2 preprocessing.

I ended up mapping the behaviours I encountered in an AST-like class hierarchy which includes recursive complex structures.

Variables

Variables look hard: Ansible has a big free messy cauldron of global variables, and Transilience needs a predefined list of per-role variables.

However, variables are mainly used inside Jinja2 templates, and Jinja2 can parse to an Abstract Syntax Tree and has useful methods to examine its AST.

Using that, I managed with resonable effort to scan an Ansible role and generate a list of all the variables it uses! I can then use that list, filter out facts-specific names like ansible_domain, and use them to add variable definition to the Transilience roles. That is exciting!

Handlers

Before loading tasks, I load handlers as one-action roles, and index them by name. When an Ansible task notifies a handler, I can then loop up by name the roles I generated in the earlier pass, and I have all that I need.

Parsed Abstract Syntax Tree

Most of the results of all this parsing started looking like an AST, so I changed the rest of the prototype to generate an AST.

This means that, for a well defined subset of Ånsible's YAML, there exists now a tool that is able to parse it into an AST and raeson with it.

Transilience's playbooks gained a --ansible-to-ast option to parse an Ansible role and dump the resulting AST as JSON:

$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--ansible-to-python role]
                 [--ansible-to-ast role]

Provision my VPS

optional arguments:
[...]
  -C, --check           do not perform changes, but check if changes would be
                        needed
  --ansible-to-ast role
                        print the AST of the given Ansible role as understood
                        by Transilience

The result is extremely verbose, since every parameter is itself a node in the tree, but I find it interesting.

Here is, for example, a node for an Ansible task which has a templated parameter:

    {
      "node": "task",
      "action": "builtin.blockinfile",
      "parameters": {
        "path": {
          "node": "parameter",
          "type": "scalar",
          "value": "/etc/aliases"
        },
        "block": {
          "node": "parameter",
          "type": "template_string",
          "value": "root: {{postmaster}}\n{% for name, dest in aliases.items() %}\n{{name}}: {{dest}}\n{% endfor %}\n"
        }
      },
      "ansible_yaml": {
        "name": "configure /etc/aliases",
        "blockinfile": {},
        "notify": "reread /etc/aliases"
      },
      "notify": [
        "RereadEtcAliases"
      ]
    },

Here's a node for an Ansible template task converted to Transilience's model:

    {
      "node": "task",
      "action": "builtin.copy",
      "parameters": {
        "dest": {
          "node": "parameter",
          "type": "scalar",
          "value": "/etc/dovecot/local.conf"
        },
        "src": {
          "node": "parameter",
          "type": "template_path",
          "value": "dovecot.conf"
        }
      },
      "ansible_yaml": {
        "name": "configure dovecot",
        "template": {},
        "notify": "restart dovecot"
      },
      "notify": [
        "RestartDovecot"
      ]
    },

Executing

The first iteration of prototype code for executing parsed Ansible roles is a little execise in closures and dynamically generated types:

    def get_role_class(self) -> Type[Role]:
        # If we have handlers, instantiate role classes for them
        handler_classes = {}
        for name, ansible_role in self.handlers.items():
            handler_classes[name] = ansible_role.get_role_class()

        # Create all the functions to start actions in the role
        start_funcs = []
        for task in self.tasks:
            start_funcs.append(task.get_start_func(handlers=handler_classes))

        # Function that calls all the 'Action start' functions
        def role_main(self):
            for func in start_funcs:
                func(self)

        if self.uses_facts:
            role_cls = type(self.name, (Role,), {
                "start": lambda host: None,
                "all_facts_available": role_main
            })
            role_cls = dataclass(role_cls)
            role_cls = with_facts(facts.Platform)(role_cls)
        else:
            role_cls = type(self.name, (Role,), {
                "start": role_main
            })
            role_cls = dataclass(role_cls)

        return role_cls

Now that the parsed Ansible role is a proper AST, I'm considering redesigning that using a generic Role class that works as an AST interpreter.

Generating Python

I maintain a library that can turn an invoice into Python code, and I have a convenient AST. I can't not generate Python code out of an Ansible role!

$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--ansible-to-python role]
                 [--ansible-to-ast role]

Provision my VPS

optional arguments:
[...]
  --ansible-to-python role
                        print the given Ansible role as Transilience Python
                        code
  --ansible-to-ast role
                        print the AST of the given Ansible role as understood
                        by Transilience

And will you look at this annotated extract:

$ ./provision --ansible-to-python mailserver
from __future__ import annotations
from typing import Any
from transilience import role
from transilience.actions import builtin, facts

# Role classes generated from Ansible handlers!
class ReloadPostfix(role.Role):
    def start(self):
        self.add(
            builtin.systemd(unit='postfix', state='reloaded'),
            name='reload postfix')


class RestartDovecot(role.Role):
    def start(self):
        self.add(
            builtin.systemd(unit='dovecot', state='restarted'),
            name='restart dovecot')


# The role, including a standard set of facts
@role.with_facts([facts.Platform])
class Role(role.Role):
    # These are the variables used by Jinja2 template files and strings. I need
    # to use Any, since Ansible variables are not typed
    aliases: Any = None
    myhostname: Any = None
    postmaster: Any = None
    virtual_domains: Any = None

    def all_facts_available(self):
        ...
        # A Jinja2 string inside a string list!
        self.add(
            builtin.command(
                argv=[
                    'certbot', 'certonly', '-d',
                    self.render_string('mail.{{ansible_domain}}'), '-n',
                    '--apache'
                ],
                creates=self.render_string(
                    '/etc/letsencrypt/live/mail.{{ansible_domain}}/fullchain.pem'
                )),
            name='obtain mail.* letsencrypt certificate')

        # A converted template task!
        self.add(
            builtin.copy(
                dest='/etc/dovecot/local.conf',
                src=self.render_file('templates/dovecot.conf')),
            name='configure dovecot',
            # Notify referring to the corresponding Role class!
            notify=RestartDovecot)

        # Referencing a variable collected from a fact!
        self.add(
            builtin.copy(dest='/etc/mailname', content=self.ansible_domain),
            name='configure /etc/mailname',
            notify=ReloadPostfix)
        ...

Conclusion

Transilience can load a (growing) subset of Ansible syntax, one role at a time, which contains:

  • All actions defined in Transilience's builtin.* namespace
  • Ansible's template module (without block_start_string, block_end_string, lstrip_blocks, newline_sequence, output_encoding, trim_blocks, validate, variable_end_string, variable_start_string)
  • Jinja2 templates in string parameters, even when present inside lists and dicts and nested lists and dicts
  • Variables from facts provided by transilience.actions.facts.Platform
  • Variables used in jitsi templates, both in strings and in files, provided by host vars, group vars, role parameters, and facts
  • Notify using handlers defined within the role. Notifying handlers from other roles is not supported, since roles in Transilience are self-contained

The role loader in Transilience now looks for YAML when it does not find a Python module, and runs it pipelined and fast!

There is code to generate Python code from an Ansible module: you can take an Ansible role, convert it to Python, and then work on it to add more complex logic, or clean it up for adding it to a library of reusable roles!

Next: Ansible conditionals

Transilience check mode

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience.

I added check mode to Transilience, to do everything except perform changes, like Ansible does:

$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--to-python role]

Provision my VPS

optional arguments:
  -h, --help        show this help message and exit
  -v, --verbose     verbose output
  --debug           verbose output
  -C, --check       do not perform changes, but check if changes would be   NEW!
                    needed                                                  NEW!

It was quite straightforwad to add a new field to the base Action class, and tweak the implementations not to perform changes if it is True:

# Shortcut function to annotate dataclass fields with documentation metadata
def doc(default: Any, doc: str, **kw):
    return field(default=default, metadata={"doc": doc})


@dataclass
class Action:
    ...
    check: bool = doc(False, "when True, check if the action would perform changes, but do nothing")

Like with Ansible, check mode takes about the same time as a normal run which does not perform changes.

Unlike Ansible, with Transilience this is actually pretty fast! ;)

Next step: parsing YAML!

Playbooks, host vars, group vars

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience.

Host variables

Ansible allows to specify per-host variables, and I like that. Let's try to model a host as a dataclass:

@dataclass
class Host:
    """
    A host to be provisioned.
    """
    name: str
    type: str = "Mitogen"
    args: Dict[str, Any] = field(default_factory=dict)

    def _make_system(self) -> System:
        cls = getattr(transilience.system, self.type)
        return cls(self.name, **self.args)

This should have enough information to create a connection to the host, and can be subclassed to add host-specific dataclass fields.

Host variables can then be provided as default constructor arguments when instantiating Roles:

    # Add host/group variables to role constructor args
    host_fields = {f.name: f for f in fields(host)}
    for field in fields(role_cls):
        if field.name in host_fields:
            role_kwargs.setdefault(field.name, getattr(host, field.name))

    role = role_cls(**role_kwargs)

Group variables

It looks like I can model groups and group variables by using dataclasses as mixins:

@dataclass
class Webserver:
    server_name: str = "www.example.org"

@dataclass
class Srv1(Webserver):
    ...

Doing things like filtering all hosts that are members of a given group can be done with a simple isinstance or issubclass test.

Playbooks

So far Transilience is executing on one host at a time, and Ansible can execute on a whole host inventory.

Since the most part of running a playbook is I/O bound, we can parallelize hosts using threads, without worrying too much about the performance impact of GIL.

Let's introduce a Playbook class as the main entry point for a playbook:

class Playbook:
    def setup_logging(self):
        ...

    def make_argparser(self):
        description = inspect.getdoc(self)
        if not description:
            description = "Provision systems"

        parser = argparse.ArgumentParser(description=description)
        parser.add_argument("-v", "--verbose", action="store_true",
                            help="verbose output")
        parser.add_argument("--debug", action="store_true",
                            help="verbose output")
        return parser

    def hosts(self) -> Sequence[Host]:
        """
        Generate a sequence with all the systems on which the playbook needs to run
        """
        return ()

    def start(self, runner: Runner):
        """
        Start the playbook on the given runner.

        This method is called once for each system returned by systems()
        """
        raise NotImplementedError(f"{self.__class__.__name__}.start is not implemented")

    def main(self):
        parser = self.make_argparser()
        self.args = parser.parse_args()
        self.setup_logging()

        # Start all the runners in separate threads
        threads = []
        for host in self.hosts():
            runner = Runner(host)
            self.start(runner)
            t = threading.Thread(target=runner.main)
            threads.append(t)
            t.start()

        # Wait for all threads to complete
        for t in threads:
            t.join()

And an actual playbook will now look like something like this:

from dataclasses import dataclass
import sys
from transilience import Playbook, Host


@dataclass
class MyServer(Host):
    srv_root: str = "/srv"
    site_admin: str = "enrico@enricozini.org"


class VPS(Playbook):
    """
    Provision my VPS
    """

    def hosts(self):
        yield MyServer(name="server", args={
            "method": "ssh",
            "hostname": "host.example.org",
            "username": "root",
        })

    def start(self, runner):
        runner.add_role("fail2ban")
        runner.add_role("prosody")
        runner.add_role(
                "mailserver",
                postmaster="enrico",
                myhostname="mail.example.org",
                aliases={...})


if __name__ == "__main__":
    sys.exit(VPS().main())

It looks quite straightforward to me, works on any number of hosts, and has a proper command line interface:

./provision  --help
usage: provision [-h] [-v] [--debug]

Provision my VPS

optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose  verbose output
  --debug        verbose output

Next step: check mode!

Reimagining Ansible variables

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience.

While experimenting with Transilience, I've been giving some thought about Ansible variables.

My gripes

I like the possibility to define host and group variables, and I like to have a set of variables that are autodiscovered on the target systems.

I do not like to have everything all blended in a big bucket of global variables.

Let's try some more prototyping.

My fiddlings

First, Role classes could become dataclasses, too, and declare the variables and facts that they intend to use (typed, even!):

class Role(role.Role):
    """
    Postfix mail server configuration
    """
    # Postmaster username
    postmaster: str = None
    # Public name of the mail server
    myhostname: str = None
    # Email aliases defined on this mail server
    aliases: Dict[str, str] = field(default_factory=dict)

Using dataclasses.asdict() I immediately gain context variables for rendering Jinja2 templates:

class Role:
    # [...]
    def render_file(self, path: str, **kwargs):
        """
        Render a Jinja2 template from a file, using as context all Role fields,
        plus the given kwargs.
        """
        ctx = asdict(self)
        ctx.update(kwargs)
        return self.template_engine.render_file(path, ctx)

    def render_string(self, template: str, **kwargs):
        """
        Render a Jinja2 template from a string, using as context all Role fields,
        plus the given kwargs.
        """
        ctx = asdict(self)
        ctx.update(kwargs)
        return self.template_engine.render_string(template, ctx)

I can also model results from fact gathering into dataclass members:

# From ansible/module_utils/facts/system/platform.py
@dataclass
class Platform(Facts):
    """
    Facts from the platform module
    """
    ansible_system: Optional[str] = None
    ansible_kernel: Optional[str] = None
    ansible_kernel: Optional[str] = None
    ansible_kernel_version: Optional[str] = None
    ansible_machine: Optional[str] = None
    # [...]
    ansible_userspace_architecture: Optional[str] = None
    ansible_machine_id: Optional[str] = None

    def summary(self):
        return "gather platform facts"

    def run(self, system: transilience.system.System):
        super().run(system)
        # ... collect facts

I like that this way, one can explicitly declare what variables a Facts action will collect, and what variables a Role needs.

At this point, I can add machinery to allow a Role to declare what Facts it needs, and automatically have the fields from the Facts class added to the Role class. Then, when facts are gathered, I can make sure that their fields get copied over to the Role classes that use them.

In a way, variables become role-scoped, and Facts subclasses can be used like some kind of Role mixin, that contributes only field members:

# Postfix mail server configuration
@role.with_facts([actions.facts.Platform])
class Role(role.Role):
    # Postmaster username
    postmaster: str = None
    # Public name of the mail server
    myhostname: str = None
    # Email aliases defined on this mail server
    aliases: Dict[str, str] = field(default_factory=dict)
    # All fields from actions.facts.Platform are inherited here!

    def have_facts(self, facts):
        # self.ansible_domain comes from actions.facts.Platform
        self.add(builtin.command(
            argv=["certbot", "certonly", "-d", f"mail.{self.ansible_domain}", "-n", "--apache"],
            creates=f"/etc/letsencrypt/live/mail.{self.ansible_domain}/fullchain.pem"
        ), name="obtain mail.* certificate")

        # the template context will have the Role variables, plus the variables
        # of all the Facts the Role uses
        with self.notify(ReloadPostfix):
            self.add(builtin.copy(
                dest="/etc/postfix/main.cf",
                content=self.render_file("roles/mailserver/templates/main.cf"),
            ), name="configure /etc/postfix/main.cf")

One can also fill in variables when instantiating Roles, making parameterized generic Roles possible and easy:

    runner.add_role(
            "mailserver",
            postmaster="enrico",
            myhostname="mail.enricozini.org",
            aliases={
                "me": "enrico",
            },
    )

Outcomes

I like where this is going: having well defined variables for facts and roles, means that the variables that get into play can be explicitly defined, well known, and documented.

I think this design lends itself quite well to role reuse:

  • Roles can use variables without risking interfering with each other.
  • Variables from facts can have well defined meanings across roles.
  • Roles are classes, and can easily be made inheritable.

I have a feeling that, this way, it may be much easier to create generic libraries of Roles that one can reuse to compose complex playbooks.

Since roles are just Python modules, we even already know how to package and distribute them!

Next step: Playbooks, host vars, group vars.

© 2023 Enrico Zini. Generated with staticsite on 2024-12-21 02:00 CET.