How to set up Kubernetes service discovery in Prometheus

NextCommit.careers

Elevate your DevOps & infrastructure career! Embrace remote possibilities and shape the future of tech. Discover job openings now and ignite your journey to success!

The ability to monitor production services is fundamental to both understand how the system behaves, troubleshoot bugs and incidents, and try to prevent them from happening. In this post we’re going to cover how Prometheus can be installed and configured in a Kubernetes cluster so that it can automatically discover the services using custom annotations. We’ll start by providing an overview of Prometheus and its capabilities. After that, we’ll deep-dive into how to configure auto-discovery and we'll conclude with the monitoring of a sample application in action.

To test what follows you need a Kubernetes cluster. For local testing, I’d recommend using minikube that lets you install a Kubernetes cluster on your local machine. Otherwise, feel free to use any other Kubernetes provider out there, I personally use GKE.

The content of this post has been tested on the following environments:

Locally:
- minikube v1.25.1
- kubectl:
  - client: v1.23.3
  - server: v1.23.1
GKE:
- kubectl:
  - client: v1.21.9
  - server: v1.21.9

All the code shown here Is available on Github.

Let’s get started!

What is Prometheus?

Prometheus is an open-source software for monitoring your services by pulling metrics from them through HTTP calls. These metrics are exposed using a specific format, and there are many client libraries written in many programming languages that allow exposing these metrics easily. On top of that, there are also many existing so-called exporters that allow third-party data to be plumbed into Prometheus.

Prometheus is an end-to-end monitoring solution as it provides also configurable alerting, a powerful query language, and a dashboard for data visualization.

Data model and Query Language

The data model implemented by Prometheus is highly dimensional. All the data are stored as time series identified by a metric name and a set of key-value pairs called labels:

<metric name>{<label_1 name>=<label_1 value>, ..., <label_n name>=<label_n value>} <value>

Each sample of a time series consists of a float64 value and a millisecond-precision timestamp.

As an example borrowed from the official documentation if we want to expose a metric that counts the HTTP requests it can be expressed as:

api_http_requests_total{method=<method>, handler=<handler>} <value>

In the case of POST requests to a /messages endpoint that has been called 3 times, it becomes:

api_http_requests_total{method="POST", handler="/messages"} 3.0

If we assume that we track only for methods “POST” and “GET” and for handlers “/messages” and the root “/”, in Prometheus this corresponds to four distinct time series:

api_http_requests_total with method=“GET” and handler="/"
api_http_requests_total with method=“GET” and handler="/messages"
api_http_requests_total with method=“POST” and handler="/"
api_http_requests_total with method=“POST” and handler="/messages"

Here’s a screenshot of the metrics exposed by the sample application that we’ll see in more detail in the next sections:

As you can see there’s a requests_count_total that counts the number of requests for two endpoints, /foo and /bar, along with some other default metrics automatically exported by the client.

Architecture

Prometheus consists of multiple components as you can see from the documentation. Many of them are optional, and for what concerns what we’re covering in this post we have:

the main Prometheus server that scrapes and stores the times series data by pulling them,
the service discovery that discovers the Kubernetes targets,
the Prometheus web UI for querying and visualizing the data.

Prometheus in Kubernetes

Let’s now have a look at how to install Prometheus in Kubernetes! In order to do it, there are many Kubernetes objects needed, we’ll have a look at each of them to understand exactly what we’re shipping.

Alternatively to the “manual” installation that we’re showing here, you can also install Prometheus with Helm. Helm is like a package manager, but for Kubernetes components. It has the advantage of making things easier, but it hides all the details of what is being shipped under the hood and the goal of this post is to dissect exactly what lies underneath.

Namespace

Nothing fancy here, we’re just going to have our Prometheus application running in the monitoring namespace.

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

Service account and RBAC

For Prometheus to discover the pods, services, and endpoints in our Kubernetes cluster, it needs some permissions to be granted. By default, a Kubernetes pod is assigned with a default service account in the same namespace. A service account is simply a mechanism to provide an identity for processes that run in a Pod. In our case, we’re going to create a prometheus service account that will be granted some extra permissions.

apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: monitoring
  name: prometheus

One of the ways Kubernetes has for regulating permissions is through RBAC Authorization. The RBAC API provides a set of Kubernetes objects that are used for defining roles with corresponding permissions and how those rules are associated with a set of subjects. In our specific case, we want to grant to the prometheus service account (the subject) the permissions for querying the Kubernetes API to get information about its resources (the rules). This is done by creating a ClusterRole Kubernetes object.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: discoverer
rules:
- apiGroups: [""]
  resources:
  - nodes
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]

Check the documentation if you want to know about the difference between Role and ClusterRole.

As you might have guessed from the yaml definition, we’re creating a role called discoverer that has the permissions to perform the operations:

get,
list,
watch,

over the resources:

nodes,
services,
endpoints,
pods.

These are enough for what is covered by this post, but depending on what you’re going to monitor you might need to extend the permissions granted to the discoverer role by adding other rules. Here you can see the possible verbs that can be used.

To recap: now we have the promethues ServiceAccount as our subject and we have our discoverer ClusterRole carrying the rules. The only missing thing is to bind the discoverer role to the prometheus service account and this can be easily done using a ClusterRoleBinding.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-discoverer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: discoverer
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

The yaml is very straightforward: we’re assigning the role defined in roleRef, which in this case is the ClusterRole named discoverer that we previously defined, to the subjects listed in subjects which for us is only the prometheus ServiceAccount.

All permissions are set!

I've got the power!

Deployment and Service

Now the easy part: the Deployment and the Service objects!

They’re both very straightforward. The deployment has a single container running Prometheus v2.33.4 and has a ConfigMap mounted that we’ll see in detail in the next section. It also assigns the previously defined service account and it exposes port 9099 that will be used to see the Prometheus web application.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v2.33.4
        ports:
        - containerPort: 9090
        volumeMounts:
          - name: config
            mountPath: /etc/prometheus
      volumes:
        - name: config
          configMap:
            name: prometheus-server-conf

The Service object as well doesn’t have anything special and it simply exposes port 9090.

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  selector:
    app: prometheus
  ports:
  - port: 9090

ConfigMap

Here’s the hard part!

Complicated

In Prometheus, the configuration file defines everything related to scraping jobs and their instances. In the context of Prometheus, an instance is an endpoint that can be scraped, and a job is a collection of instances with the same purpose. For example, if you have an API server running with 3 instances, then you could have a job called api-server whose instances are the host:port endpoints.

Let’s now deep dive into how to write the configuration file to automatically discover Kubernetes targets!

Prometheus has a lot of configurations, but for targets discovery, we’re interested in the scrape_configs configuration. This section describes the set of targets and parameters and also how to scrape them. In this post we’re only interested in automatically scraping the Kubernetes services, so we’ll create a single entry for scrape_configs.

Let’s see how the final ConfigMap looks like and then we’ll go through each relevant part.

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  namespace: monitoring
  labels:
    name: prometheus-server-conf
data:
  prometheus.yml: |-
    scrape_configs:
      - job_name: 'kubernetes-service-endpoints'

        scrape_interval: 1s
        scrape_timeout: 1s

        kubernetes_sd_configs:
        - role: endpoints

        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_se7entyse7en_prometheus_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_se7entyse7en_prometheus_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_se7entyse7en_prometheus_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_se7entyse7en_prometheus_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_service
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod

As anticipated, there’s only a job in scrape_configs that is called kubernetes-service-endpoints. Along with the job_name, we have scrape_interval that defines how often a target for this job has to be scraped, and then we have scrape_timeout that defines the timeout for each scrape request.

The interesting part comes now. The kubernetes_sd_configs configuration describes how to retrieve the list of targets to scrape using the Kubernetes REST API. Kubernetes has a component called API server that exposes a REST API that lets end-users, different parts of your cluster, and external components communicate with one another.

To discover targets, multiple roles can be chosen. In our case, we chose the role endpoints as that covers the majority of the cases. There are many other roles such as node, service, pod, etc., and each of them discovers targets differently, for example (quoting the documentation):

the node role discovers one target per cluster node,
the service role discovers a target for each service port for each service,
the pod role discovers all pods and exposes their containers as targets,
the endpoints role discovers targets from listed endpoints of a service. If the endpoint is backed by a pod, all additional container ports of the pod, not bound to an endpoint port, are discovered as targets as well,
etc.

As you can see it should cover the majority of the cases. I’d invite you to have a deeper look at the documentation here.

The difference between roles is not only in how targets are discovered but also in which labels are automatically attached to those targets. For example, with role node each target has a label called __meta_kubernetes_node_name that contains the name of the node object, which is not available with role pod. With role pod each target has a label called __meta_kubernetes_pod_name that contains the name of the pod object, which is not available with role node. If you think about it, it’s obvious, because if your target is a node it doesn’t have any pod name simply because it’s not a pod and vice versa.

The nice thing about the role endpoints is that Prometheus provides different labels depending on the target: if it’s a pod, then the labels provided are those of the role pod, if it’s a service, then those of the role service. In addition, there’s also a set of extra labels that are available independently from the target.

After this introduction, we’re ready to tackle the relabel_config configuration.

`relabel_config`

Let’s recall for a moment what’s our final goal: we’d like Kubernetes services to be discovered automatically by Prometheus using custom annotations. We want something that will look like this:

  annotations:
    prometheus.io/scrape: "true"
    prometheus.io.scheme: "https"
    prometheus.io/path: "/metrics"
    prometheus.io/port: "9191"

This would mean that the corresponding Kubernetes object will be scraped thanks to the annotation prometheus.io/scrape value of true, that the metrics can be reached at port 9191 at path /metrics. It is worth noticing that the name of the annotation can be anything you want. To showcase this, we’ll then use the followings:

  annotations:
    se7entyse7en.prometheus/scrape: "true"
    se7entyse7en.prometheus/scheme: "https"
    se7entyse7en.prometheus/path: "/metrics"
    se7entyse7en.prometheus/port: "9191"

As previously mentioned, each target that is scraped comes with some default labels depending on the role and on the type of target. The relabel_config provides the ability to rewrite the set of labels of a target before it gets scraped.

What does this mean? Let’s say for example that thanks to our kubernetes-service-endpoints scraping job configured with role: endpoints Prometheus discovers a Service object by using the Kubernetes API. For each target, the list of rules in relabel_config is applied to that target.

Let’s consider a service as follows:

apiVersion: v1
kind: Service
metadata:
  name: app
  annotations:
    se7entyse7en.prometheus/scrape: "true"
    se7entyse7en.prometheus/scheme: "https"
    se7entyse7en.prometheus/path: "/metrics"
    se7entyse7en.prometheus/port: "9191"
spec:
  selector:
    app: app
  ports:
  - port: 9191

Let’s now try to apply each relabelling rule one by one to really understand what Prometheus does. Again, when applying the relabelling rules, Prometheus has just discovered the target, but it didn’t yet scrape the metrics. Indeed, we’ll now see that the way the metrics are going to be scraped, will depend on the relabelling rules. For a full reference of relabel_config see here.

But no more talk and let's see it in practice!

To scrape or not to scrape

The first rule controls whether the target has to be scraped at all or not:

        - source_labels: [__meta_kubernetes_service_annotation_se7entyse7en_prometheus_scrape]
          action: keep
          regex: true

As you can see the source_labels is a list of labels. This list of labels is first concatenated by using a separator that can be configured and that is ; by default. Given that in this rule there’s only one item, there’s no concatenation happening.

By reading the documentation, we can see that for the role: service there’s a meta label called __meta_kubernetes_service_annotation_<annotationname> that maps to the corresponding (slugified) annotation in the service object. In our example then, the concatenated source_labels is simply equal to the string true thanks to se7entyse7en.prometheus/scrape: "true".

The action: keep makes Prometheus ignore all the targets whose concatenated source_labels don’t match the regex that in our case is equal to true. Since according to our example the regex true matches the value true, the target is not ignored. Don't confuse true with being a boolean here, you can even decide to use a regex that matches an annotation value of "yes, please scrape me".

Where are the metrics?

The second rule controls what is the scheme to use when scraping the metrics from the target:

        - source_labels: [__meta_kubernetes_service_annotation_se7entyse7en_prometheus_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)

According to previous logic, the concatenated source_labels is equal to https thanks to the se7entyse7en.prometheus/scheme: "https" annotation. The action: replace replaces the label in target_label with the concatenated source_labels if the concatenated source_labels matches the regex. In our case, the regex (https?) matches the concatenated source_labels that is https. The outcome is that the label __scheme__ now has the value of https.

But what is the label __scheme__? The label __scheme__ is a special one that indicates to Prometheus what is the URL that should be used to scrape the target's metrics. After the relabelling, the target's metrics will be scraped at __scheme__://__address____metrics_path__ where __address__ and __metrics_path__ are two other special labels similarly to __scheme__. The next rules will indeed deal with these.

The third rule controls what is the path that exposes the metrics:

        - source_labels: [__meta_kubernetes_service_annotation_se7entyse7en_prometheus_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)

This rule works exactly like the previous one, the only difference is the regex. With this rule, we replace __metrics_path__ with whatever is in our custom Kubernetes annotation. In our case, it will be then equal to /metrics thanks to the se7entyse7en.prometheus/path: "/metrics" annotation.

The fourth rule finally controls the value of __address__ that is the missing part to have the final URL to scrape:

        - source_labels: [__address__, __meta_kubernetes_service_annotation_se7entyse7en_prometheus_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2

This rule is very similar to the previous ones, the differences are that it also has a replacement key and that we have multiple source_labels. Let’s start with the source_labels. As previously explained, the values are concatenated and the separator ; is used. By default the label __address__ has the form <host>:<port> and is the address that Prometheus used to discover the target. I don’t know exactly what port is used for that purpose, but it's not important for our goal, so let’s just assume that is 1234 and that the host is something like se7entyse7en_app_service. Thanks to the se7entyse7en.prometheus/port: "9191" annotation, we obtain that the concatenated source_labels is equal to: se7entyse7en_app_service:1234;9191. From this string, we want to keep the host but use the port coming from the annotation. The regex and the replacement configurations are exactly meant for this: the regex uses 2 capturing groups, one for the host, and one for the port, and the replacement is set up in a way so that the output is $1:$2 that corresponds to the captured host and port separated by :.

So now we finally have __scheme__, __address__ and __metrics_path__! We said that the target URL that will be used for scraping the metrics is given by:

__scheme__://__address____metrics_path__

If we replace each part we have:

https://se7entyse7en_app_service:9191/metrics

Extra labels

The remaining rules are simply adding some default labels to the metrics when they'll be stored:

        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_service
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod

In this case, we're adding the labels kubernetes_namespace, kubernetes_service, and kubernetes_pod from the corresponding meta labels.

To recap, these are the steps to automatically discover the targets to scrape with the configured labels:

Prometheus discovers the targets using the Kubernetes API according to the kubernetes_sd_config configuration,
Relabelling is applied according to relabel_config,
Targets are scraped according to special labels __address__, __scheme__, __metrics_path__,
Metrics are stored with the labels according to relabel_config and all the labels starting with __ are stripped.

Sample Application

To see Kubernetes service discovery of Prometheus in action we need of course a sample application that we want to be automatically discovered.

The sample application is very simple. It’s a Python API server that exposes a /foo and a /bar endpoint. Then the /metrics endpoint exposes the metrics that we’ll be pulled and collected by Prometheus.

For the sake of keeping things easy, there’s only a metric that counts the number of requests with the label endpoint.

Code

I wrote a very simple Python server that does exactly what was mentioned above:

/foo: returns status code 200, and shows the current counter value for /foo,
/bar: returns status code 200, and shows the current counter value for /bar,
/metrics: returns the metrics in the format suitable for Prometheus,
everything else returns 404.

import http.server

from prometheus_client import Counter, exposition

COUNTER_NAME = 'requests_count'
REQUESTS_COUNT = Counter(COUNTER_NAME, 'Count', ['endpoint'])

class RequestsHandler(exposition.MetricsHandler):

    def do_GET(self):
        if self.path not in ('/foo', '/bar', '/metrics'):
            self.send_response(404)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write('404: Not found'.encode())
            return

        if self.path in ('/foo', '/bar'):
            REQUESTS_COUNT.labels(endpoint=self.path).inc()
            current_count = self._get_current_count()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(
                f'Current count: [{self.path}] {current_count}'.encode())
        else:
            return super().do_GET()

    def _get_current_count(self):
        sample = [s for s in REQUESTS_COUNT.collect()[0].samples
                  if s.name == f'{COUNTER_NAME}_total' and
                  s.labels['endpoint'] == self.path][0]
        return sample.value


if __name__ == '__main__':
    for path in ('/foo', '/bar'):
        REQUESTS_COUNT.labels(endpoint=path)

    server_address = ('', 9191)
    httpd = http.server.HTTPServer(server_address, RequestsHandler)
    httpd.serve_forever()

The only dependency is prometheus_client which provides the utilities for tracking the metrics, and exposing them in the format required. Prometheus provides clients for many programming languages, and you can have a look here.

Dockerfile

The Dockerfile is very straightforward as well: it simply installs Poetry as the dependency manager, installs them, and sets the CMD to simply execute the server.

FROM python:3.9

WORKDIR /app

RUN pip install poetry
COPY pyproject.toml .
RUN poetry config virtualenvs.create false && poetry install
COPY ./sample_app ./sample_app

CMD [ "python", "./sample_app/__init__.py" ]

Deployment and Service

The deployment object for the sample app is straightforward: it has a single container and exposes port 9191.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
      - name: app
        image: sample-app:latest
        imagePullPolicy: Never
        ports:
        - containerPort: 9191

You might have noticed that the imagePullPolicyis set to Never. This is simply because when testing locally with minikube I’m actually building directly the image inside the container and this avoids Kubernetes to try looking for the image from a remote registry. When running this on a remote Kubernetes cluster, you'll need to remove it and make the image available from a registry.

There are many ways to provide a Docker image to the local minikube cluster. In our case you can do it as follows:

cd sample-app && minikube image build --tag sample-app .

The service object is straightforward as well, but here we can see the annotations in actions as explained in the previous sections.

apiVersion: v1
kind: Service
metadata:
  name: app
  annotations:
    se7entyse7en.prometheus/scrape: "true"
    se7entyse7en.prometheus/path: "/metrics"
    se7entyse7en.prometheus/port: "9191"
spec:
  selector:
    app: app
  ports:
  - port: 9191

Action!

Let's move to the fun part now!

Finally

I'm assuming that you have a Kubernetes cluster up and running either locally, or in any other way you prefer. Let's start by deploying the namespace first for Prometheus and then by deploying everything else:

kubectl apply -f kubernetes/prometheus/namespace.yaml && \
  kubectl apply -f kubernetes/prometheus/

At some point you should be able to see your Prometheus deployment up and running:

kubectl get deployment --namespace monitoring

We should now be able to access the Prometheus web UI. In order to access it, you can port-forward the port:

kubectl port-forward --namespace monitoring deployment/prometheus 9090:9090

If you go to localhost:9090 you should see be able to see it and play with it! But until you don't deploy the sample application there's actually not much to do, indeed if you check the target section at http://localhost:9090/targets you'll see that there are no targets.

Ok, let's fix this! Let's now deploy the sample application:

kubectl apply -f kubernetes/app/

As previously mentioned, if you're using minikube, you'd need to provide the image to your local cluster unless you pushed it in some registry similarly to how you'd usually do in a production environment.

Now you should be able to see the deployment of the sample application up and running as well:

kubectl get deployment

If you now go to the list of Prometheus targets you should see that your sample app has been automatically discovered!

We can now query the metrics from the graph section at http://localhost:9090/graph and ask for the requests_count_total metric. You should be able to see two metrics: one for endpoint /foo and the other one for endpoint /bar, both at 0.

Let's try to call those endpoints then and see what happens! Let's first set up the port-forward to the sample app:

kubectl port-forward deployment/app 9191:9191

Now let's call 3 times /foo and 5 times /bar, and you can do it simply by curl-ing:

curl localhost:9191/foo  // 3 times
curl localhost:9191/bar  // 5 times

Lets's check the graph again:

Hurray! The metrics are ingested as expected to Prometheus with the proper labels!

Victory

NextCommit.careers

https://nextcommit.careers/

Elevate your DevOps & infrastructure career! Embrace remote possibilities and shape the future of tech. Discover job openings now and ignite your journey to success!

Conclusions

In this post, we covered how to automatically discover Prometheus targets to scrape with Kubernetes service discovery. We went in-depth through all the Kubernetes objects by putting particular attention on RBAC authorization and on the semantic of Prometheus configuration. We've finally been able to see everything in action by deploying a sample application that has been successfully discovered automatically.

Having a properly monitored infrastructure in a production environment is fundamental to ensure both the availability and the quality of a service. It's easy to assess that being able to automatically discover new targets as new services are being added makes this task easier.

In addition, being able to expose metrics to Prometheus is also useful for other tasks such as alerting, and also to enable autoscaling through Kubernetes HPA using custom metrics, but these are all topics that would need a separate post.

I hope that this post has been useful and if you have any comments, suggestions, or need any clarification feel free to comment!