This is a talk I gave at Oliver-Wyman Labs in London, to a largely software engineer audience (some of whom I worked with a long time ago at LShift).
This is the script I worked from – I diverged a little from it, but this is probably more fluent.
Doing operations more like programming
This talk is about GitOps. That’s the last time I will say that word in the talk. It’s also about how cloud computing took the joy from our lives and how we can get it back.
To get the conclusion out in front of you: cloud computing makes systems more mediated and complex and operational, and you can make your life easier by treating it more like programming.
It’s already time for the first diagram. This shows the working life of a software engineer around 2005, that’s before the relaunch of AWS, which I’m taking, a little arbitrarily, as the “cloud computing epoch”.
+-----------------------------------------------------+
| Gathering requirements | Programming |
+-----------------------------------------------------+
It’s not to scale, exactly. But you can see what a simple time it was.
Here is the working life of a software engineer in 2019:
+-----------------------------------------------------+
| Soft skills | Clicking buttons |P|
+-----------------------------------------------------+
That “P” is the bit that was “Programming” in the previous diagram. I will zoom in on that:
- - ---+------------------------------------------ - -
| Programming
- - ---+------------------------------------+----- - -
| YAMLs | Why isn't CI passing? | API docs
- - ---+------------------------------------+----- - -
What I am trying to get across here is that much less of your working life is taken up with programming, which implicitly I take to be the enjoyable bit.
Not to be too down on clicking buttons.
Partly this is because lots of things you might program have already been programmed – like distributed databases, which are fun to program but also very difficult, so it’s probably best that we don’t do it too much.
That bit I call “progress”.
Partly it is because there is an awful lot more going on; that part I call, perhaps unfairly, “entropy”.
Here’s an example of “entropy”: I used to be able to type make test
in a terminal and it would build a program, then run its tests. If the
tests were successful, I’d commit the change to git. I could give you
the executable or you could build it yourself.
You run it, and are thereby made happy.
Nowadays, that’s just the beginning.
That executable has to be put in a container image, and that container image has
-
some infrastructural prerequisites that must accompany it when it’s run
-
plus a description of how it is supposed to be run
-
and how to hook it up with a bunch of other executables all with their own adjuncts
-
and a dashboard and alerts so that I can find out when the behaviour of the whole assembly isn’t what I expected.
Which I used to be able to tell by running the program and seeing it crash, by the way.
There are two forces at play:
-
cloud computing systems are remote, therefore mediated, and they are complexifying
-
in reaction, the reach of developers is extended towards operations
Or: I have to run my tests in the cluster now, how does that happen huh?
You may have noticed that we don’t really talk about correctness any more so much as efficacy, since there’s no way to reason conclusively about the behaviour of systems – we just have to try things and observe what happens.
“Soak it and see”.
There are plenty of things we gain by having all this complication. I am not here to harp on about the good old days. But it is a lot of complication, and I am arguing that it’s got well in advance of our tried and trusted practices as software developers.
I would like to make up some of that ground.
Another diagram:
Write code (main.go) and commit -->
CI pipeline (.travis-ci.yml) =
[ run tests -->
build image (Dockerfile) -->
update in cluster ]
This is a pretty typical CI/CD pipeline, connecting a code change with updating a running system.
As a developer your creative input pretty much ends at “Write code”; the rest is automation.
In this case, the automation is doing a bunch of what’s essentially operations – OK the CI system runs your tests – but the rest of it is about packaging your code and running it somewhere.
In fact this is a massive simplification: a lot of the testing is done after that CI pipeline. I’ll come back to that later.
An important thing to realise is that the CI pipeline is doing double duty here: it’s a venue for builds, and it’s driving some actions which deploy your code.
Up to the point it deploys the code, you have “plausible deniability” – nothing has officially taken place yet, you’ve just put some new artefacts on the table.
Diagram update:
Write code (main.go) and commit -->
CI pipeline (.travis-ci.yml) =
[ run tests -->
build image (Dockerfile) ] -->
Deployment =
[ update in cluster ]
I’m dividing it into a repeatable bit – “repeatable” as in, you can run it as many times as you want – and an unrepeatable effect – “_un_repeatable” as in, you can’t unfire a missile.
The repeatable bit has some nice properties, chief among them: no-one has to see your mistakes. You can bash away at it until you get it right.
Since the CI system is usually at arms length, and usually you can only test it by changing a file and committing that, that’s often what you have to do, that particular verb.
Especially if it’s the CI configuration itself that you’re doing battle with.
Obviously at some point you have to have an effect outside your CI system, otherwise why bother. But there are some things we can do to extend the boundary of repeatability as far as we can.
And the less you do outside that boundary, the lower the risk that it will cause a problem. And the more opportunity you have for checking that it won’t be a problem ahead of time. You can think of that as a “low stakes zone”, and we want to make it as big as possible.
To review, now I have two goals:
- expand the low stakes zone
- bring joy back into our lives
I’m going to show how these are really the same.
To start with the first, I’m going to propose this technique: describe an intended state, then have an automated system take actions to maintain that state.
If you think about it, that’s what make
does, or what it does with a
well-constructed Makefile anyway – you give it the recipe for some
artefact, and it does all the necessary things to construct that
artefact when something has changed.
In fact, in a CI pipeline, you are aiming at an outcome something like:
There is always an executable built from
the most recent code that passes tests
The CI system automatically takes action to keep that statement true (as much as possible).
The operating principle of Kubernetes is a similar kind of statement:
The running system will converge on the
description given by the resources
The resources – Deployments, HorizontalPodAutoscalers and such – are a complex mix of definitions and constraints, operating at different levels, but nonetheless, that’s what Kubernetes is doing – interpreting an intended state, and constantly taking action to bring it about.
You’ll notice these statements have a timeless property to them – “always”, “converge”. That is a signal that they are some kind of invariant.
When you see an invariant, or something approaching an invariant, you know you have one fewer thing to worry about – it’s like an equation from which you can pick the easier side to deal with.
This is a huge help for the first goal above, because it is a lot easier to figure out what is a correct (or effective) state, and let an automated system figure out how to get there, and how to stay there.
It expands the kinds of change you can make with confidence.
Let’s revisit the CI pipeline from before. If we’re using the Kubernetes API to deploy our executable artefact, then we are recording an intended state, and letting Kubernetes make it so.
+----record in git----+ +-record in Kubernetes--+
| | | |
| run tests | | |
| build image | | |
| ----update in Kubernetes API---> |
| | | update in runtime |
| | | |
+---------------------+ +-----------------------+
This diagram shows the pipeline before, but in terms of “intended states”. We can see there’s two records, and two systems maintaining their own invariant. The deployment action at the end of the pipeline crosses from one system into another.
Would it be better to have one record and, effectively, one system?
Flux takes this position. It uses git as the record of intended state, and extends it to be Kubernetes' record too. You can see it as eliding between those two boxes:
+----record in git--------------------------------- +
| : : |
| run tests : : |
| build image : : |
| update in manifests : : |
| ----update in Kubernetes API---> |
| : : update in runtime |
| : : |
+---------------------------------------------------+
Flux is that middle bit – it takes what is in git, and updates it in Kubernetes. This is an incredibly simple idea. I am surprised it took me so long to build up to it in this talk.
It’s important that git
is at the centre of it. This is to serve the
second goal: using git is a daily practice for software developers.
Making the runtime system obey git makes the system amenable to software engineering practice.
This is all post-rationalisation, of course. Flux really came about because we had a specific itch to scratch.
Its history goes like this:
-
Updating the image used in a ReplicationController and rolling it out, by hand, is really error prone, let’s automate it
-
Committing back to git after the fact seems to have some problems
-
I wonder what happens if we just apply everything that’s in git
So we arrived at the core idea by … backing into it accidentally while trying to get a clear shot of something in the other direction.
This is elsewhere called “serendipity”, but as a technology startup we are obliged to call it “design”.
Anyway this was the point at which Flux became something other people might want to use, and the rest is history, as they say. But it’s not the end of the itching.
This is how the configuration for running https://cloud.weave.works/ is organised:
:
+- k8s/
+- dev/
:
+- flux-ns.yaml
+- flux/
+- nats-dep.yaml
+- nats-svc.yaml
:
+- prod/
:
+- flux-ns.yaml
+- flux/
+- nats-dep.yaml
+- nats-svc.yaml
:
+- local/
:
+- flux-ns.yaml
+- flux/
+- nats-dep.yaml
+- nats-svc.yaml
:
The majority of it is YAML files. The directories dev/
and prod/
are configuration for the development (or staging) environment and the
production environment, respectively.
These have close to identical contents, mainly differing in references
to external services, and versions of container images. There’s a
local/
which is a bit more distinct, but still shares an awful lot
with the other two.
There are some bits of program:
+- jobs
:
+- flux.rules
+- prometheus-config.yaml.py
+- dashboards
:
+- flux-services.dashboard.py
So yes, there’s some code there as well. Those programs generate YAMLs, either by aggregating a bunch of files into the content of a resource, or by constructing a resource and its content programmatically.
Typically, the way the programs are used is that they are run when changed, and the output committed to the repo; and the invariant that the program always generates the output as committed is verified during CI.
The reason for having the program and the output is that Flux only knows how to apply YAMLs. That makes it sound like a trivial lack, but believe me it was not obvious how to do anything else.
You can no doubt identify a slew of problems here, which I’ll selectively sum up:
-
there’s duplication between the programs and the generated YAMLs
-
there’s duplication between the environments, dev and prod
-
YAML is a terrible programming language
Of these, the first is adequately managed by the schema mentioned: generate and verify.
The latter two are related: everything must be realised as YAMLs before presenting to Kubernetes. YAMLs are hard to write, especially YAMLs that are Kubernetes manifests.
In particular, you don’t have the power of abstraction, and with that goes an awful lot of reuse and most of the other things you find in a programmer’s intellectual pencil case.
I don’t think I need to hammer on that point for this audience. It’s more interesting to look at why it’s difficult to have the nice things.
Could we do the same trick with the things that are just YAMLs as with the things that are programs first? Absolutely.
Here is the first obstacle, which I will call the “getting engineers to agree on basically anything” problem. If we generated Kubernetes manifests, what should we use?
- {j,k}sonnet
- Helm charts
- Kustomize
- a Python program
- ..
OK, if someone used any one of those, and it was even a modest improvement, no-one would complain too much. Still, there are faults to find in each of the alternatives, some distasteful, and all (evidently) discouraging.
The second, technical, obstacle is that Flux also writes configuration changes back to the git repository, to do automated upgrades. If your configuration is generated by some program, how will Flux know where to make changes?
The answer should not be surprising: just write another program.
Look what has happened already – we went from this, to paraphrase an earlier diagram:
+------programming-------+ +--------operations-------+
| | | |
| write some code | | |
| ----------CI/CD--------> |
| | | update running system |
| | | |
+------------------------+ +-------------------------+
To something a bit more like this:
+----programming-----------------------------------------+
| : : |
| write some code : : |
| generate system desc. <-----> update running system |
| : : |
+--------------------------------------------------------+
Continuous integration and continuous deployment – automation – let developers reach over into the realm of operations. We can go further than that, and drag stuff back towards us.
The double-headed arrow there is in recognition of this fact of cloud computing, mentioned earlier, which I’ll now come back to: the proof of an effective program is Trial by Deployment.
You may have seen this before:
O ------> O
^ |
| v
A <------ D
It’s the OODA loop! Here is the cloud computing OODA loop in full:
Observe ------> Opine
^ |
| v
Apologise <---- Deploy
That’s just my little joke. The point is, we are now dealing with a feedback loop, or a cybernetic system, if you prefer.
As one consequence, to be confident when making changes in the normal course of development, they must be attempted in media res.
We are now in the business of regulating a living system. Testing is ongoing – continuous, but continuous continuous. So the arrow has to go back the other way!
But, look, there’s some extra stuff we can do now:
+----programming-----------------------------------------+
| : : |
| write some code : : |
| generate system desc. : : |
| verify system desc. : : |
| update system desc. === update running system |
| : : |
+--------------------------------------------------------+
So this is also expanding the low stakes zone, by letting us apply tooling to the description of the system.
- typed programming languages
- IDEs
- linting
- code review
- libraries
- versioning
We can’t guarantee perfection, but we can lower the stakes and catch more problems early, by doing more programming.
I will finish here, while there are still untended points suggesting themselves.
I haven’t talked much about actually running Flux, or about other
projects at Weaveworks that fit in here, like jk
and flagger
, but
I’m very happy to answer questions on those things.
Bonus section: anticipated question
-
But we already do things like blue/green deployments to mitigate deployment risks; are you suggesting we retreat from that?
No; you can see blue/green deployments as extending testing into the live system, but we can still think about some or all of it as being “intended state”. It does get quite fuzzy, though, what you consider a persistent state and what is an ephemeral.
For example: say we are deploying a new version of a service for “user acceptance” testing. I’m using user acceptance because the testing is mediated by humans, or at least signed off by humans, so it operates on human time scales. You probably want to record the state of the experiment – that might just be “run the new version” – so that you can come back to it, if it’s interrupted.
If you were rolling out a new version of a service which had some modest bug fix though, you might use a canary, and just run a small amount of traffic through the new version until you’re confident it’s not a step backwards. It doesn’t matter as much if you have to repeat that; you could record the fact of the update, and let automated systems with only runtime state take care of it.
Tooling like Flagger and service meshes (in combination) will let you do this sort of thing.