Jun 26, 2020

GitOps controllers: a design and a pattern

I’ve talked before about how Kubernetes is a kind of equational system. In a Kubernetes system, you alter the object declarations in the database, and Kubernetes takes action to make the running objects match the declarations, maintaining an equivalence between the declarations and the system.

Using Flux, this equivalence is extended to source control – you put the declarations in files in git, and Flux along with Kubernetes act to make the running objects match what the files say. Flux is just a mechanism for maintaining the extra leg of the equivalence:

system == declarations == git

You could regard that as the fundamental equation of gitops.

In Kubernetes, there are types and processes that deal with higher-level declarations, and it’s possible to add your own higher-level types and controllers. Is there an analogue in gitops to these controllers?

What changes when you use git

A regular Kubernetes controller observes some kinds of objects, and takes action by updating those or other objects. The natural extension to gitops is this simple formulation:

A gitops controller commits changes to git according to observations of the cluster state.

Most of the time, a Kubernetes controller takes some high-level declaration and implements in terms of some lower level objects. For example, the Deployment controller observes Deployment objects, and updates ReplicaSet objects to keep the right number of pods running, do rolling updates, and so on.

In those cases, there’s no work for the gitops controller to do – you can just commit the high-level declaration, and let the usual controllers do their work.

The question is really about extending Kubernetes. I can think of three reasons to add types and controllers:

You want to alter the system based on higher-order observations, e.g., the load on the cluster (something like what the HorizontalPodAutoscaler does);
You want to affect external systems based on observations of the objects in the cluster – this is more or less the (original, narrow) definition of an operator;
You want to affect the cluster based on observations of external systems.

Of these, the first can be tricky to map into the gitops world. In some cases it is similar to the third item, discussed below, with higher-order observations taking the place of external systems, and the techniques will surely be similar. In some cases though, like the HPA, it’s more like a special case of equivalence where writing all changes to git isn’t appropriate, and some other mechanism is needed (I have seen a decent suggestion though).

The second is already well-served in gitops, because it amounts to adding another type of declaration, and dealing with arbitrary types of declaration doesn’t go outside the mechanism already described.

That last kind of extension is demonstrated by Flux itself, with its image update automation. This feature observes which images are being used in the cluster, scans image registries (the external systems), and updates git so that those images are at their most recent versions. Abstractly, it observes resources within the cluster, consults external systems, and takes action by changing declarations in git.

For a controller that works like that, but still follows the formulation given above, you need an extra ingredient: something to reflect the external system as objects in the cluster (a “reflection controller”). Flux doesn’t do this; it maintains a database disjoint from Kubernetes' database. I will show how it would play out if it did work this way, below.

Image update automation

Here is a design sketch of a component that does the same things as Flux’s update automation, but fits the “gitops controller” definition.

The ImageRepository type declares that a particular image repository – say, docker.io/fluxcd/flux – should be scanned.

There can be thousands of individual images in a repository, and it doesn’t make sense to try and record them all in Kubernetes' database (either as individual objects, or in a data field in a Kubernetes object). So these objects will just record the scanning status, such that it can be examined and monitored, and make the data available by other means (e.g., its own HTTP API).

The important piece of data for the update automation is the most recent image, according to some policy. Since workloads might refer to the same image but use different policies, another type ImagePolicy declares a specific (update) policy for an image repository – semver, or filtering out certain tags, for example – and refer to the ImageRepository in question.

A reflection controller uses the above declarations to keep each ImagePolicy current with the latest image that matches the policy. How it actually does this might depend on the policy, and may require the controller to keep a cache off to the side (as Flux’s automation does).

Lastly, the place where the action happens. To enrol a workload in automation, the ImageUpdateAutomation type ties a workload to one or more policies (in each instance giving the particular container, or path to an image field, to be updated).

A gitops controller reconciles the git repository with the declarations above, by examining each ImageUpdateAutomation, finding its targets amongst the files in git, and updating them to the most recent image as given by the ImagePolicy.

As mentioned this is a sketch of a design, and not intended to be backward-compatible with Flux. There are many things present in Flux’s image update feature that are missing here:

the set of images used by workloads is discovered automatically
the list of images, ordered according to policy, can be requested for a workload (e.g., the ten most recent images for each container in such and such a deployment)
the policies are declared in a workload definition using annotations
there’s a command-line tool for selecting workloads and images, and doing an update ad-hoc
each update, either automated or requested, also records its particulars as a git note tied to the commit it makes, which is used to send a notification when the commit is applied.

Most of those can be covered off with compatibility-bridging components that interpret the annotations given, and can look at the ImageRepository cache to answer queries or do impromptu updates. An ImageUpdateJob would be a way to bring the ad-hoc releases into the controller’s purview.

Some might be deprecated in favour of more modern mechanisms (I am thinking of the notifications).

The general pattern

This design above arrives back at the central equation of gitops: update the declarations given in git in order to effect changes. Speculatively, I think there is a general pattern in how it’s arranged.

The ImageRepository and ImagePolicy types and controller reflect an external system into the cluster. The ImageUpdateAutomation type specifies a particular job to do with that information. Its controller runs a reconciliation loop similar to that in Kubernetes' own controllers, with the reconciling actions being enacted on a git repository rather than Kubernetes' database.

The general pattern is:

reflect data about external systems into the cluster

create a view on the data, with a policy object

use the policy to calculate updates and apply them to git

Why keep these separate; for instance, why not provide the policy in the same object as the automation?

The reason is that separate objects can be remixed to do other tasks – for example, ImagePolicy objects could be used as the basis for a user interface, or to inform another kind of automation not anticipated by the design (updating the values of a Helm chart, say). Similarly, ImagePolicy objects are separate from the reflected ImageRepository objects, because the latter can be used in their own right; for example, as the access point for ad-hoc querying of image repository data.

Open questions

How does the gitops controller get access to the git repository?

It could just be given the URL and credentials, as part of the ImageUpdateAutomation object. Following the pattern given though, it would use a GitRepository1 object as the access point to the external system (the git repository). In this case, there’s no need for a policy object since it doesn’t need a view onto a git repo, just access.

The ImageUpdateAutomation objects refer to things in the git repo; shouldn’t they be in the repo?

Yes, arguably. Since they refer to making updates in files, rather than resources in the cluster, you might expect them to live with the files. On the other hand, the controller is driven by resources in the cluster, and the secondary resources Image and ImagePolicy rightly belong in the cluster where they can be accessible to cluster processes too.

A compromise might be to declare the basic fact of automation as an object, and leave the particulars (e.g., the targets) to be specified amongst, or in, the files.

A related concern is that an automation can be left hanging if its targets are removed from the git repository. Specifying the targets in the files themselves gets around this problem, since the specification goes away when the target goes away (or if in a separate file, at least it’s in the same place).

How do the Image objects get created?

The ImageRepository and ImagePolicy objects stand on their own, but are also related to automation – you can’t run the automation without having scanned the images used in the workloads in question.

This suggests that the image update automation controller create its own ImageRepository and ImagePolicy objects, based on the automation it needs to run.

1. This is similar in spirit to GitRepository here, but separates the concerns of access and policy.