A not-so-great way to start into a new week, is to figure out that the certificate of your API server expired on the weekend. Fixing and expired OpenShift certificate should be straight forward, but it wasn’t. Here is what happened, or you can directly scroll down for the solution.
I am running my own OpenShift cluster for a while now, playing with IoT stuff, and using Let’s Encrypt certificates to secure the API server endpoints and the application domain. Most of that is automated, only the certificate renewal is not. So every 60 days, I am refreshing the certificates, which gives me a buffer of 30 days, should something go wrong. And while I have that scripted as well, I need to manually trigger that.
So when you start a new week, try to log in an see the following output, you know that you forgot about something important:
error: x509: certificate has expired or is not yet valid: current time 2021-01-18T11:21:55+01:00 is after 2021-01-17T09:35:54Z
Then again, I do recall refreshing the certificates at the end of 2020. So what went wrong?
I only found out about the root cause later, once I fixed the issue. However, it is an important piece of the puzzle, because my case was a bit different from most cases of an expired OpenShift certificate.
OpenShift allows you to manage the certificates using a custom resource, and has an operator to roll out those certificates. In a nutshell, you need to provide two Secrets, containing a signed certificate and key each. OpenShift will do the rest for you. One combination is for the API server, and the other one is for the application domain, the default ingress mechanism.
I did make a change to my script, and that introduced a bug. The result was that the API server certificates was renewed, however the application domain certificate was not.
What needed to be done?
All that I would needed to do in order to fix this, was to update the Secret which contains the key/cert combination. Providing a newer, non-expired version, would trigger a re-rollout and all things would be back to normal.
But how you can update a Secret if you no longer have access to your cluster? oc login failed with the certificate error.
Why not skip TLS validation?
True, the oc command, as well as the kubectl command, offer you to provide --insecure-skip-tls-verify=true and just skip TLS validation. And that would have worked, if the issue was with the API server certificate.
However, the situation here was different. I didn’t have a valid access token anymore. In order to get a new one, you simply do oc login. That didn’t work out, reporting the same X.509 certificate error. In the background, the oc commands tries to refresh the token, but not using the API server, but using an OAuth endpoint. Which is hosted on the standard ingress endpoints, and not the API server endpoints. Unfortunately, --insecure-skip-tls-verify only works for the API server endpoints. I would call that a bit inconsistent, but hey.
Other cases of expired OpenShift certificates
Searching on the internet for a solution, all kinds of “expired certificate” cases with Kubernetes showed up. Many of the on the control plane, but that wasn’t what I was looking for. Also some of the API server certificate fixes sounded rather invasive. I am glad I didn’t give any of them a try, as that might have actually caused more harm in the end.
Remember, all I needed was a way to replace some Secrets.
After some debugging, I found the root cause, mentioned above. The API endpoint certificate was working, the OAuth one was expired. Making oc login fail, despite adding --insecure-skip-tls-verify=true to calls.
Fortunately, when creating an OpenShift cluster you will also get a cluster certificate, which you can use to access the API server as an admin user. You are supposed to keep this key/cert combination, for cases like this.
Setting the KUBECONFIG environment variable to the generated, original configuration gives you direct access to your cluster, without the need to log in:
After that, I was automatically “logged in”, and could re-run my custom scripts for replacing the Secrets necessary to fix the expired OpenShift certificates. The operator rolled them out, and the cluster was operational again.
The actual commands for re-creating the ingress certificates might be different in your case, depending on your settings and environment. Here is what works for me:
Since OpenShift 4, updates are rather trivial. You wait for the new update to appear, press the button (or use the CLI), wait a bit, and the update is done. True, in production you might want to complicate that process a bit, for good reason.
Running an OpenShift 4 cluster now for a while myself, and developing apps on top of Kubernetes on my day job, I am curious about the next release. Is it GA already? Can I deploy it? Is there an upgrade for my current version? Is that in “candidate”, “fast”, or “stable”? Checking that turned out to be no as easy as it should be.
Cincinnati is an update protocol designed to facilitate automatic updates. It describes a particular method for representing transitions between releases of a project and allowing a client to perform automatic updates between these releases.
Which means, that this tool has all the information available, to show which upgrade is available. And from which version I can upgrade. It also has the information about all the different channels (like “fast”, “candidate”, or “stable”). And it is written in Rust, I love it already.
There is a little bit of information on the data format, but what about the data itself? It is available from the endpoint https://api.openshift.com/api/upgrades_info. So now have everything that we need.
To be honest, there is nothing too difficult about it. You have all the data. There already is a nice example using graphviz in the repository of OpenShift Cincinnati. The problem with the example is, you need to run it every time an update gets posted, for every channel. It will generate a static graph representation, so you have no ability to zoom, or re-arrange. It would be nice to have bit more interactive visualization of that graph.
As always, it should be simple. But in real life, nothing is. Of course I encountered a few obstacles to work around …
Hosting a web page
Hosting a simple, static, webpage used to be so simple. Put a file into a directory that Apache publishes, and your are done … Today, I also could set up a Tekton pipeline, build a new image, and “run” the webpage in a container, on a cluster, proxied by one or two more HTTP reverse proxies.
The truth is, there are now so many options when it comes to hosting a simple web page, it can become a tough decision. I wanted to keep things together closely. Most likely I will build it, use it, and don’t want to actively maintain it (forget about it). Git and GitHub are obvious choices for me, so why not simply host this with GitHub pages, using the same repository I use for coding. As it turned out later, that was the right choice and helped with a few other obstacles as well.
CORS & headers
Unfortunately, the update API is rather picky when it comes to doing requests. It’s missing CORS headers, and using a general purpose CORS proxy turned out to lack the correct HTTP headers that the API requires. I wanted to focus on the visualization and wanted to stick to the plan of simply hosting a static web page, not running a CORS proxy myself for this.
Then again, the only thing I do to with the API, is to perform an HTTP GET request. As there is nothing dynamic about it, I could as well host a JSON file, and fetch that. I would only need a process to update the JSON data.
Now I was glad that I chose GitHub for all of this. Setting up a GitHub Actions workflow for updating the data is rather simple. The command was already part of the original example, with the only difference, that I don’t need to run graphviz. The workflow will fetch the data, and when git detects a change in the data, the workflow will commit and push the changes to the same repository. Great plan, but …
… the data format is not stable. Doing multiple GET operations on the endpoint give you back different content. True, the information is the same, but the “byte content” is different. The data format describes updates as nodes and edges, very simple and a perfect match for our purpose. However, the edges reference the nodes by their position in the list of nodes, and not by some stable identifier. Assume the following two examples:
Both examples contain the same information (A → B → C), however the raw bytes are different (can you spot the difference). And git diff only works with the bytes, and not the actual information conveyed by those.
So my plan of periodically fetching the data, and letting git diff check for differences wouldn’t work work. Unless I would create a small script that normalizes the data. Running that as part of the update job isn’t complicated at all. And now the diff can check if the normalized data changed, and only act on that.
Why do I keep the non-normalized data? Yes, I could let the visualizer use the normalized data. However, I would like to use the original data format. In the hope that some day, I would be able to use the API endpoint directly.
No API for channels
I wanted to visualize all channels, with the different versions. Turns out, that there is no API for that: openshift/cincinnati#171. But I also didn’t want to maintain an update-to-date list of channels myself. A thing that I will forget about sooner or later. Nothing a little shell script can fix. In GitHub actions, performing a checkout is simply yet another action, and you can check out multiple repositories. Of course, you don’t get triggered from those repositories. But we are running a periodic workflow anyway, so why not checkout the graph data repository, and check for the channels in there.
Why didn’t I use the data contained in the channel directory of that repository? I wanted to really stick to the original API. There is all kinds of processing done with the data in there, and I simply didn’t want to replicate that. Finding all the channels as a workaround seemed fined though.
Once the graphs grow a bit, they can get rather complex. visj has a “physics” model built in, which helps to balance the layout when you drag around nodes. However, every now and then the layouter and the physics model seemed to produce funny, but useless results. Depending on the model, a simple reload, re-starting the layouting algorithm with a different seed fixes the problem. But that is a bad experience.
Luckily, you can configure all kinds of settings in the physics model, and playing around a bit with the settings lead to some settings, that seem to be fun, but stable enough, even for the bigger graphs.
What is next?
Not much really. It is a tool for me that just works, showing OpenShift updates with a click. The itch is scratched, and I learned a few things in the process. And I hope that by sharing, it becomes useful for someone other than just me.
Of course, if you think that something is missing, broken or could be done in a better way: Open Source is all about contributing ;-) → ctron/openshift-update-graph
Quarkus is advertised as a “Kubernetes Native Java stack, …”, so we took it to a test, and checked what benefits we can get, by replacing an existing service from the IoT components of EnMasse, the cloud-native, self-service messaging system.
For quite a while, I wanted to try out Quarkus. I wanted to see what benefits it brings us in the context of EnMasse. The IoT functionality of EnMasse is provided by Eclipse Hono™, which is a micro-service based IoT connectivity platform. Hono is written in Java, makes heavy use of Vert.x, and the application startup and configuration is being orchestrated by Spring Boot.
EnMasse provides the scalable messaging back-end, based on AMQP 1.0. It also takes care of the Eclipse Hono deployment, alongside EnMasse. Wiring up the different services, based on an infrastructure custom resource. In a nutshell, you create a snippet of YAML, and EnMasse takes care and deploys a messaging system for you, with first-class support for IoT.
This system requires a service called the “tenant service”. That service is responsible for looking up an IoT tenant, whenever the system needs to validate that a tenant exists or when its configuration is required. Like all the other services in Hono, this service is implemented using the default stack, based on Java, Vert.x, and Spring Boot. Most of the implementation is based on Vert.x alone, using its reactive and asynchronous programming model. Spring Boot is only used for wiring up the application, using dependency injection and configuration management. So this isn’t a typical Spring Boot application, it is neither using Spring Web or any of the Spring Messaging components. And the reason for choosing Vert.x over Spring in the past was performance. Vert.x provides an excellent performance, which we tested a while back in our IoT scale test with Hono.
The goal was simple: make it use fewer resources, having the same functionality. We didn’t want to re-implement the whole service from scratch. And while the tenant service is specific to EnMasse, it still uses quite a lot of the base functionality coming from Hono. And we wanted to re-use all of that, as we did with Spring Boot. So this wasn’t one of those nice “greenfield” projects, where you can start from scratch, with a nice and clean “Hello World”. This is code is embedded in two bigger projects, passes system tests, and has a history of its own.
So, change as little as possible and get out as much as we can. What else could it be?! And just to understand from where we started, here is a screenshot of the metrics of the tenant service instance on my test cluster:
Around 200MiB of RAM, a little bit of CPU, and not much to do. As mentioned before, the tenant service only gets queries to verify the existence of a tenant, and the system will cache this information for a bit.
Step #1 – Migrate to Quarkus
To use Quarkus, we started to tweak our existing project, to adopt the different APIs that Quarkus uses for dependency injection and configuration. And to be fair, that mostly meant saying good-bye to Spring Boot specific APIs, going for something more open. Dependency Injection in Quarkus comes in the form of CDI. And Quarkus’ configuration is based on Eclipse MicroProfile Config. In a way, we didn’t migrate to Quarkus, but away from Spring Boot specific APIs.
Starting with adding the Quarkus Maven plugin and some basic dependencies to our Maven build, and off we go.
And while replacing dependency inject was a rather smooth process, the configuration part was a bit more tricky. Both Hono and Microprofile Config have a rather opinionated view on the configuration. Which made it problematic to enhance the Hono configuration in the way that Microprofile was happy. So for the first iteration, we ended up wrapping the Hono configuration classes to make them play nice with Microprofile. However, this is something that we intend to improve in Hono in the future.
Packaging the JAR into a container was no different than with the existing version. We only had to adapt the EnMasse operator to provide application arguments in the form Quarkus expected them.
From a user perspective, nothing has changed. The tenant service still works the way it is expected to work and provides all the APIs as it did before. Just running with the Quarkus runtime, and the same JVM as before:
We can directly see a drop of 50MiB from 200MiB to 150MiB of RAM, that isn’t bad. CPU isn’t really different, though. There also is a slight improvement of the startup time, from ~2.5 seconds down to ~2 seconds. But that isn’t a real game-changer, I would say. Considering that ~2.5 seconds startup time, for a Spring Boot application, is actually not too bad, other services take much longer.
Step #2 – The native image
Everyone wants to do Java “native compilation”. I guess the expectation is that native compilation makes everything go much faster. There are different tests by different people, comparing native compilation and JVM mode, and the outcomes vary a lot. I don’t think that “native images” are a silver bullet to performance issues, but still, we have been curious to give it a try and see what happens.
Native image with Quarkus
Enabling native image mode in Quarkus is trivial. You need to add a Maven profile, set a few properties and you have native image generation enabled. With setting a single property in the Maven POM file, you can also instruct the Quarkus plugin to perform the native compilation step in a container. With that, you don’t need to worry about the GraalVM installation on your local machine.
Native image generation can be tricky, we knew that. However, we didn’t expect this to be as complex as being “Step #2”. In a nutshell, creating a native image compiles your code to CPU instruction, rather than JVM bytecode. In order to do that, it traces the call graph, and it fails to do so when it comes to reflection in Java. GraalVM supports reflection, but you need to provide the information about types, classes, and methods that want to participate in the reflection system, from the outside. Luckily Quarkus provides tooling to generate this information during the build. Quarkus knows about constructs like de-serialization in Jackson and can generate the required information for GraalVM to compile this correctly.
However, the magic only works in areas that Quarkus is aware of. So we did run into some weird issues, strange behavior that was hard to track down. Things that worked in JVM mode all of a sudden were broken in native image mode. Not all the hints are in the documentation. And we also didn’t read (or understand) all of the hints that are there. It takes a bit of time to learn, and with a lot of help from some colleagues (many thanks to Georgios, Martin, and of course Dejan for all the support), we got it running.
What is the benefit?
After all the struggle, what did it give us?
So, we are down another 50MiB of RAM. Starting from ~200MiB, down to ~100MiB. That is only half the RAM! Also, this time, we see a reduction in CPU load. While in JVM mode (both Quarkus and Spring Boot), the CPU load was around 2 millicores, now the CPU is always below that, even during application startup. Startup time is down from ~2.5 seconds with Spring Boot, to ~2 seconds with Quarkus in JVM mode, to ~0.4 seconds for Quarkus in native image mode. Definitely an improvement, but still, neither of those times is really bad.
Pros and cons of Quarkus
Switching to Quarkus was no problem at all. We found a few areas in the Hono configuration classes to improve. But in the end, we can keep the original Spring Boot setup and have Quarkus at the same time. Possibly other Microprofile compatible frameworks as well, though we didn’t test that. Everything worked as before, just using less memory. And except for the configuration classes, we could pretty much keep the whole application as it was.
Native image generation was more complex than expected. However, we also saw some real benefits. And while we didn’t do any performance tests on that, here is a thought: if the service has the same performance as before, the fact that it requires only half the of memory, and half the CPU cycles, this allows us to run twice the amount of instances now. Doubling throughput, as we can scale horizontally. I am really looking forward to another scale test since we did do all other kinds of optimizations as well.
You should also consider that the process of building a native image takes quite an amount of time. For this, rather simple service, it takes around 3 minutes on an above-than-average machine, just to build the native image. I did notice some decent improvement when trying out GraalVM 20.0 over 19.3, so I would expect some more improvements on the toolchain over time. Things like hot code replacement while debugging, are things that are not possible with the native image profile though. It is a different workflow, and that may take a bit to adapt. However, you don’t need to commit to either way. You can still have both at the same time. You can work with JVM mode and the Quarkus development mode, and then enable the native image profile, whenever you are ready.
Taking a look at the size of the container images, I noticed that the native image isn’t smaller (~85 MiB), compared to the uber-JAR file (~45 MiB). Then again, our “java base” image alone is around ~435 MiB. And it only adds the JVM on top of the Fedora minimal image. As you don’t need the JVM when you have the native image, you can go directly with the Fedora minimal image, which is around ~165 MiB, and end up with a much smaller overall image.
Switching our existing Java project to Quarkus wasn’t a big deal. It required some changes, yes. But those changes also mean, using some more open APIs, governed by the Eclipse Foundation’s development process, compared to using Spring Boot specific APIs. And while you can still use Spring Boot, changing the configuration to Eclipse MicroProfile opens up other possibilities as well. Not only Quarkus.
Just by taking a quick look at the numbers, comparing the figures from Spring Boot to Quarkus with native image compilation: RAM consumption was down to 50% of the original, CPU usage also was down to at least 50% of original usage, and the container image shrank to ~50% of the original size. And as mentioned in the beginning, we have been using Vert.x for all the core processing. Users that make use of the other Spring components should see more considerable improvement.
Going forward, I hope we can bring the changes we made to the next versions of EnMasse and Eclipse Hono. There is a real benefit here, and it provides you with some awesome additional choices. And in case you don’t like to choose, the EnMasse operator has some reasonable defaults for you 😉
This work is based on the work of others. Many thanks to:
When you want to containerize your Rust application, you might be using a prepared Rust image. But maybe you are a bit more paranoid when it comes to trusting base layers and you want to create your own Rust base image. Or maybe you are just curios and want to try it yourself.
Getting cargo, the Rust build tool, into the image is probably one of the first tasks in your Dockerfile. And it is rather easy on an interactive command line:
curl https://sh.rustup.rs -sSf | sh
Automated container build
However, running inside a container build, you will be greeted by the nice little helper script, asking you for some input:
In a terminal window this is no problem. But in an automated build, you want the script to proceed without the need for manual input.
The solution is rather simple. If you take a look at the script, then you will figure out that it actually allows you to pass an argument -y, assuming defaults without the need to input any more details.
And you can still keep the “one liner” for installing:
curl https://sh.rustup.rs -sSf | sh -s -- -y
The -s will instruct the shell to process the script from “standard input”, rather than reading the script from a file. In the original command it already did that, but implicitly, because there was no other argument to the shell.
The double dash (--) indicates to the shell that everything which comes after, it not an argument to the shell, but to the shell script instead.
And finally -y is passed to the script, which is the cargo installer.
I hope this comes in handy to you. It took me a bit to figure it out. Of course, not only in the context of containers, but for any headless/silent installation of Rust.
The Eclipse IoT ecosystem consists of around 40 different projects, ranging from embedded devices, to IoT gateways and up to cloud scale solutions. Many of those projects stand alone as “building blocks”, rather than ready to run solutions. And there is a good reason for that: you can take these building blocks, and incorporate them into your own solution, rather than adopting a complete, pre-built solution.
This approach however comes with a downside. Most people will understand the purpose of building blocks, like “Paho” (an MQTT protocol library) and “Milo” (an OPC UA protocol library) and can easily integrate them into their solution. But on the cloud side of things, building blocks become much more complex to integrate, and harder to understand.
Of course, the “getting started” experience is extremely important. You can simply download an Eclipse IDE package, tailored towards your context (Java, Modelling, Rust, …), and are up and running within minutes. We don’t want you to design your deployment descriptors first, and then let you figure out how to start up your distributed cluster. Otherwise “getting started” will become a week long task. And a rather frustrating one.
Getting started. Quickly!
During the Eclipse IoT face-to-face meeting in Berlin, early this year, the Eclipse IoT working group discussed various ideas. How can we enable interested parties to get started, with as little effort as possible. And still, give you full control. Not only with a single component, which doesn’t provide much benefit on its own. But get you started with a complete solution, which solves actual IoT related problems.
The goal was simple. Take an IoT use case, which is easy to understand by IoT related people. And provide some form of deployment, which gets people up and running in less than 15 minutes. With as little as possible external requirements. At best, run everything on your local laptop. Still, create everything in a close-to-production style of deployment. Not something completely stripped down. But use a way of deployment, that you could actually use as a basis for extending it further.
Kubernetes & Helm
We quickly agreed on Kubernetes as the runtime platform, and Helm as the way to perform the actual deployments. With Kubernetes being available even on a local machine (using minikube on Linux, Windows and Mac) and being available, at the same time, in several enterprise ready environments, it seemed like an ideal choice. Helm charts seemed like an ideal choice as well. Helm designed directly for Kubernetes. And it also allows you to generate YAML files, from the Helm charts. So that the deployment only requires you to deploy a bunch of YAML files. Maintaining the charts, is still way easier than directly authoring YAML files.
Challenges, moving towards an IoT solution
A much tougher question was: how do we structure this, from a project perspective. During the meeting, it soon turned out, there would be two good initial candidates for “stacks” or “groups of projects”, which we would like to create.
It also turned out that we would need some “glue” components for a package like that. Even though it may only be a script here, or a “readme” file there. Some artifacts just don’t fit into either of those projects. And what about “in development” versions of the projects? How can you point people towards a stable deployment, only using a stable (released) group of projects, when scripts and readme’s are spread all over the place, in different branches.
A combination of “Hono, Ditto & Hawkbit” seemed like an ideal IoT solution to start with. People from various companies already work across those three projects, using them in combination for their own purpose. So, why not build on that?
But in addition to all those technical challenges, the governance of this effort is an aspect to consider. We did not want to exclude other Eclipse IoT projects, simply by starting out with “Hono, Ditto, and Hawkbit”. We only wanted to create “an” Eclipse IoT solution, and not “the” Eclipse IoT solution. The whole Eclipse IoT ecosystem is much too diverse, to force our initial idea on everyone else. So what if someone comes up with an additional group of Eclipse IoT projects? Or what if someone would like to add a new project to an existing deployment?
A home for everyone
Luckily, creating an Eclipse Foundation project solves all those issues. And the Eclipse Packaging project already proves that this approach works. We played with the idea, to create some kind of a “meta” project. Not a real project in the sense of having a huge code base. But more a project, which makes use of the Eclipse Foundations governance framework. Allowing multiple, even competing companies, to work upstream in a joint effort. And giving all the bits and pieces, which are specific to the integration of the projects, a dedicated home.
A home, not only for the package of “Hono, Ditto and Hawkbit”, but hopefully for other packages as well. If other projects would like to present their IoT solution, by combining multiple Eclipse IoT projects, this is their chance. You can easily become a contributor to this new project, and publish your scripts, documentation and walk-throughs, alongside the other packages.
Of course everything will be open source, licensed under the EPL. So go ahead and fork it, add your custom application on top of it. Or replace an existing component with something, you think is even better than what we put it. We want to enable you to deploy what we provide in a few minutes. Offer you an explanation, what to expect from it, and what this IoT solution can do for you. And encourage you to play around with it. And enable you to extend it, and build something bigger.
Let’s get started
We created a new project proposal for the Eclipse IoT packages project. The project is currently in the community review phase. Once we pass the creation review, we will start publishing the content for the first package we have.
Today I wanted to change the owner of an OpenShift project. It actually is rather trivial. However finding out how, wasn’t so easy. Googling didn’t help much, and also the documentation has room for improvement. So I took a few minutes to document how it works.
A while back I wrote a blog post about OPC UA, using Milo and added a bunch of examples, in order to get you started. Time passed by and now Milo 0.3.x is released, with a changed API and so those examples no longer work. Not too much has changed, but the experience of running into compile errors isn’t a good one. Finally I found some time to update the examples.
Red Hat AMQ Online 1.1 was recently announced, and I am excited about it because it contains a tech preview of our Internet of Things (IoT) support. AMQ Online is the “messaging as service solution” from Red Hat AMQ. Leveraging the work we did on Eclipse Hono allows us to integrate a scalable, cloud-native IoT personality into this general-purpose messaging layer. And the whole reason why you need an IoT messaging layer is so you can focus on connecting your cloud-side application with the millions of devices that you have out there.
I have been working with ESPs, for playing around in the space of IoT, for a while now. Mostly using the ESP8266 and Espressif, through platform.io. In recent times, I have also started to really like Rust as programming language. And I really believe that all Rust has to offer, would be great match for embedded development. So when I had a bit of time, I wanted to give it a try. And here is what came out of it …