DevOps | Global Monitoring using Prometheus and Thanos

Prometheus

Prometheus was originally conceived  at Soundcloud Since its inception in 2012, many companies and organisations have adopted Prometheus.

Prometheus has become the standard tool for monitoring and alerting in the Cloud and container world’s.

Prometheus uses time series data model for metrics and events. Following are the key features of prometheus :-

  • a multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
  • a flexible query language to leverage this dimensionality
  • no dependency on distributed storage; single server nodes are autonomous
  • time series collection happens via a pull model over HTTP
  • pushing time series is supported via an intermediary gateway
  • targets are discovered via service discovery or static configuration
  • multiple modes of graphing and dash boarding support
  • support for hierarchical and horizontal federation
Prometheus Architecture overview

Thanos

Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity, which can be added seamlessly on top of existing Prometheus deployments.

Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies. Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus HA pairs on the fly.

Following are the key features of Thanos :-

  1. Global query view of metrics.
  2. Unlimited retention of metrics.
  3. High availability of components, including Prometheus.
Thanos Architecture overview

Thanos components 

Thanos is made of a set of components with each filling a specific role.

  • Sidecar: connects to Prometheus and reads its data for query and/or upload it to cloud storage
  • Store Gateway: exposes the content of a cloud storage bucket
  • Compactor: compact and downsample data stored in remote storage
  • Receiver: receives data from Prometheus’ remote-write WAL, exposes it and/or upload it to cloud storage
  • Ruler: evaluates recording and alerting rules against data in Thanos for exposition and/or upload
  • Query Gateway: implements Prometheus’ v1 API to aggregate data from the underlying components

Sidecar

Thanos integrates with existing Prometheus servers through a sidecar process which runs in the same pod as the Prometheus server.

The purpose of the Sidecar is to backup Prometheus data into an Object Storage bucket, and giving other Thanos components access to the Prometheus instance the Sidecar is attached to via a gRPC API.

Application Kubernetes Clusters

To be able to get a global view of all the different environments ranging from dev through to prod we configure and install the Prometheus Operator, Prometheus components, Thanos Sidecar and ingress into each cluster.

Prometheus Operator

Configuring Thanos Object Storage

Thanos expects a Kubernetes Secret containing the Thanos configuration. Inside this secret you configure how to run Thanos with your object storage.

Once you have written your configuration save it to a file called thanos-storage-config.yaml


Here’s are a few examples for the major cloud providers:-

type: s3
config:
  bucket: thanos
  endpoint: aws.polarpoint.io
  access_key: XXX
  secret_key: XXX
type: GCS
config:
  bucket: ""
  service_account: ""
type: AZURE
 config:
   storage_account: "XXX"
   storage_account_key: "XXX"
   container: "thanos"
kubectl create secret generic thanos-storage-config --from-file=thanos.yaml=thanos-storage-config.yaml --namespace default

As well as the Blob storage configuration we want to ensure all communication is secured using mTLS creating a tls secret signed with the same CA certificate as will be used for the ingress controller and another for the CA certificate.

kubectl create secret tls -n default thanos-ingress-secret --key dev-client.key --cert dev-client.cert

kubectl create secret generic -n default thanos-ca-secret --from-file=ca.crt=cacerts.cer

Using the helm chart for Prometheus Operator and the following values file (prometheus-operator-thanos-values.yaml)

prometheus:
   prometheusSpec:
     replicas: 2      
     retention: 12h   # we only need a few hours of retention, since the rest is uploaded to blob
     image:
       tag: v2.10.0    
     serviceMonitorNamespaceSelector:  # find target config from multiple namespaces
       any: true
     thanos:         # add Thanos Sidecar
       tag: v0.5.0   
       objectStorageConfig: # blob storage to upload metrics
         key: thanos.yaml
         name: thanos-storage-config
 grafana:          
   enabled: false
helm install  --name dev-prom stable/prometheus-operator -f prometheus-operator-thanos-values.yaml  --tiller-namespace=default

We now have the Prometheus Operator installed in the application cluster.

kubectl get svc -n default -o wide

dev-prom-kube-state-metrics                        ClusterIP      xxxx              8080/TCP                     29d   app=kube-state-metrics,release=int-prom
 dev-prom-prometheus-node-exporter                  ClusterIP      xxxx               9100/TCP                     29d   app=prometheus-node-exporter,release=dev-prom
 dev-prom-prometheus-operat-alertmanager            ClusterIP      xxxx             9093/TCP                     29d   alertmanager=dev-prom-prometheus-operat-alertmanager,app=alertmanager
 dev-prom-prometheus-operat-operator                ClusterIP      xxxx              8080/TCP                     29d   app=prometheus-operator-operator,release=dev-prom
 dev-prom-prometheus-operat-prometheus              ClusterIP      xxxx             9090/TCP                     29d   app=prometheus,prometheus=dev-prom-prometheus-operat-prometheus
 kubernetes                                         ClusterIP      xxxx                 443/TCP                      29d   
 prometheus-operated                                ClusterIP      None                     9090/TCP                     11d   app=prometheus
 thanos-sidecar-0                                   ClusterIP      xxxx              10901/TCP                    29d   statefulset.kubernetes.io/pod-name=prometheus-dev-prom-prometheus-operat-prometheus-0

To enable Thanos to be able to scrape the application cluster metrics we need to allow Thanos Store Gateway access to the  Thanos Sidecars running in each application cluster, we need to expose them via an Ingress secured with MTLS using the tls and cacert secrets defined above.

kubectl apply -f thanos-ingress-rules.yaml -n default
kubectl get ingress

NAME HOSTS ADDRESS PORTS AGE
thanos-sidecar-0 prom.dev-polarpoint.local 80, 443 29d

DevOps | Continuous Integration, Continuous Delivery and Continuous Deployment…

Historically the Maven repository format provided a mechanism to easily resolve dependencies and store artefacts for Java applications supporting users of Apache Maven, Apache Ant/Ivy, Aether, Gradle, and others. However it can be utilised to offer a robust artefact store for other artefacts we wish to store such as zip files as part of our Continuous Delivery pipelines.

Maven Repository Format

Maven repository format can be configured for each repository as follows:-

Snapshot

Continuous development is typically performed with snapshot versions supported by the Snapshot version policy. These version values have to end with -SNAPSHOT. This allows repeated uploads where the actual number used is composed of a date/timestamp and an enumerator and the retrieval can still use the -SNAPSHOT version string.

i.e. springBootApp-1.12.0-SNAPSHOT.jar

Release

A Maven repository can be configured for final release artefacts with only one artefact allowed per version. ( the -RELEASE part isn’t originally part of the Maven2 repository format)

springBootApp-1.12.0-RELEASE.jar

Any efforts to try to update this version result in an error.

Failed to deploy artifacts: Could not transfer artifact io.polarpoint.spring:springBootApp:jar:1.12.0-RELEASE 
from/to releases (https://xxxxx/repository/releases): Failed to transfer file: 
https://xxxx/repository/releases/io/polarpoint/spring/springBootApp/1.12.0-RELEASE/springBootApp-1.12.0-RELEASE.jar.
Return code is: 400, ReasonPhrase:Repository does not allow updating assets: releases.

Maven in Continuous Integration and Deployment

Using the Maven repository format to store our artefacts and using a modified version of git flow we can start to ensure we have only one stable build (-RELEASE) when the development branch is merged to the master branch.

Semantic Versioning

Semantic Versioning is a convention used to provide a meaning to versions. The Semantic Versioning concept is simple: all versions have 3 digits: x.y.z.

Our global shared library maintains the version numbers for all artefacts, to a set of rules:

  • increment the major version when you make breaking changes
  • increment the minor version when you add functionality in a backward-compatible manner
  • increment the patch version when you make backward-compatible bug fixes

Jenkins Global Libraries

The requirement for a common pipeline that can be used in multiple projects does not only emerge in microservice architectures. 

It’s advantageous in ensuring projects and workflows follow a set blueprint maintaining a consistent and repeatable process and eliminating duplicate code.

Using libraries

To access shared libraries, the Jenkinsfile needs to use the @Library annotation, specifying the library’s name:

#!/bin/groovy

@Library(‘pipeline-library@development')

import io.polarpoint.workflow.*

/* use the global workflow library

*/

With the library loaded we can specify the pipeline we wish to use.

properties([

 	buildDiscarder(logRotator(artifactDaysToKeepStr: '30', artifactNumToKeepStr: '10', daysToKeepStr: '30', numToKeepStr: '10')),

 	[$class: 'RebuildSettings', autoRebuild: true, rebuildDisabled: false]

 ])


def configuration = "pipelines/conf/configuration.json"
invokePipeline('springBootApp', configuration )

InvokePipeline() is loaded from the Global shared library and the following stages are loaded.

…

if (env.BRANCH_NAME =~ /^master/) {

    echo("[Pipeline] master branch being build:" + application)

    versioningWorkflow(configurationContext, 'master',scmVars)

} else if (env.BRANCH_NAME =~ /^(development$|hotfix\/)/) {

    echo("[Pipeline] development or hotfix branch being build and tagged:" + application)

    versioningWorkflow(configurationContext, env.BRANCH_NAME,scmVars)

} else if (env.BRANCH_NAME =~ /(SB|NT)-\d*/) {

    echo("[Pipeline] feature branch being built:" + application)

    buildWorkflow(configurationContext, env.BRANCH_NAME, scmVars)

…

Feature branches builds call buildWorkflow() containing stages such as unit tests, code coverage, code quality and quality gates.

A Pull Request is raised requiring a peer review before being merged to the development. The merge to the development branch triggers Jenkins to execute the versioningWorkflow() which runs through tests before git tags the artefacts and updates the version number of the artefact.

14:43:48 + git add pipelines/conf/configuration.json
[Pipeline] sh
14:43:51 + git commit -m [jenkins-versioned] v1.12.0 + jenkins-development-257
[Pipeline] sh
14:43:53 + git tag -a v1.12.0 -m v1.12.0
[Pipeline] sh
14:43:55 + git push https://****:****@xxxxx/springBootApp.git
14:44:02 To https://xxxxxx/springBootApp.git
14:44:02    06c8098..203930e  development -> development
14:44:02 + git push https://****:****@xxxx/springBootApp.git --tags
14:44:09 To https://****:****@xxxxx/springBootApp.git
14:44:09  * [new tag]         v1.12.0 -> v1.12.0

A snapshot release of the build artefact is stored as a Maven2 snapshot.

Uploading artifact springBootApp-1.12.0-SNAPSHOT.jar started....

GroupId: io.polarpoint.spring

ArtifactId: springBootApp

Classifier: null

Type: jar

Version: 1.12.0-SNAPSHOT

File: springBootApp-1.12.0-SNAPSHOT.jar

Repository:packages

Once all tests are completed we can confidently merge development to master to create a final release.

git checkout development

Merge it with the master branch with “ours” strategy.  

git merge -s ours master

Checkout to the master branch

git checkout master

Merge the development branch to master

git merge development 
git push

The merge to the master branch triggers Jenkins to execute the versioningWorkflow()which runs through tests before uploading the stable version of our code.

Although we are running the versioningWorkflow() on the master branch we only apply versioning for development or hotfix branches.

Using Jenkins it is possible to create commonly used pipelines for some of the more mundane but essential stages of Continuous Integration.

References

Shared Libraries

https://jenkins.io/doc/book/pipeline/shared-libraries/index.html

Nexus Repository

https://www.sonatype.com/nexus-repository-oss

Semantic Versioning

https://github.com/vdurmont/semver4j

DevOps | Continuous Integration, Continuous Delivery and Continuous Deployment…

Docker

Docker provides a user space runtime environment allowing applications and services to be executed with a smaller footprint of the operating system via containers. With containers, it becomes easier for teams across different units, such as Development, UAT and Operations to work seamlessly across applications. Docker containers are lightweight and are easily scalable.

Dockerfile

This simple Dockerfile encapsulates a java Spring boot micro service.

FROM azul/zulu-openjdk:11.0.2 

EXPOSE 8080

RUN mkdir -p /app/

ADD build/libs/springBootApp*.jar /app/springBootApp.jar

ENTRYPOINT exec java -jar /app/springBootApp.jar

Nexus Private Registry

Having our images pushed to Dockerhub.io is satisfactory for publicly available images, however we want to use our own private registry. Private Docker registries are supported in Nexus(3). We can ensure only one version of each image is supported. We have avoided the use of ‘latest’ when working with Docker images and applied the same semantic versioning strategy we apply for other artefacts (https://www.linkedin.com/pulse/devops-part-2-continuous-integration-delivery-deployment-surjit-bains/).

Create a Repository in Nexus

No alt text provided for this image

Disabling redeploy for this repository, so that each version of a Docker Image is immutable. Create another Docker Repository (docker-stage), with redeploy enabled to stage our Docker Images and apply updates. We have defined DNS and reverse proxies image.polarpoint.io and stage-image.polarpoint.io that go to the repository and port for our two new Docker repositories.

Building Docker images 

Building docker images is started as follows:

docker build -t static-image.polarpoint.io/springBootApp:0.13.0-SNAPSHOT .
Sending build context to Docker daemon  768.3MB
Step 1/5 : FROM azul/zulu-openjdk:11.0.2
11.0.2: Pulling from azul/zulu-openjdk
6cf436f81810: Already exists
987088a85b96: Already exists
b4624b3efe06: Already exists
d42beb8ded59: Already exists
d15285a66a34: Pull complete
9734df7bad40: Pull complete
Digest: sha256:3e6fac72ac772ffcc78044dc2088805b6324213cc7624c5ae5b4fa7b1fcd5e47
Status: Downloaded newer image for azul/zulu-openjdk:11.0.2
 ---> 2426b4e13c8c
Step 2/5 : EXPOSE 8080
 ---> Running in ef64a8b668ff
Removing intermediate container ef64a8b668ff
 ---> d30a49157890
Step 3/5 : RUN mkdir -p /app/
 ---> Running in c641ac156f73
Removing intermediate container c641ac156f73
 ---> 9b094a4c5a8e
Step 4/5 : ADD build/libs/springBootApp-*.jar /app/springBootApp.jar
 ---> 07c12d8ff1c6
Step 5/5 : ENTRYPOINT exec java -jar /app/springBootApp.jar
 ---> Running in 09ab672c55fb
Merge branch 'development' of https://xxx into development
Removing intermediate container 09ab672c55fb
 ---> 21d04fa1f1f5
Successfully built 21d04fa1f1f5
Successfully tagged static-image.polarpoint.io/springBootApp:0.13.0-SNAPSHOT

Helm

Helm is the package manager for Kubernetes. Applications are packaged in Charts that are made up of collections of Kubernetes components that have been templated out including the definition and configuration of resources to be deployed to a Kubernetes cluster. Helm is made up of client and server side elements. Helm refers to the client-side command. Tiller is the Helm server side that runs in Kubernetes and handles the Helm packages.

Installing Anchore Engine

Anchore is an open source container compliance platform that ensures security and stability of production container deployments. Installing it with default values in one line:

helm install --name polar-io-comp-anchore stable/anchore-engine

After a few minutes, Anchore Engine is installed into our cluster and has updated it’s CVE vulnerability data:

helm ls
NAME            	REVISION	UPDATED                 	STATUS  	CHART                	NAMESPACE

pol-io-comp-anchore	1       	Sun Mar 17 18:51:58 2019	DEPLOYED	anchore-engine-0.11.0	jenkins

we can list the services running in our cluster the anchore-engine-api is listening on port 32199:

kubectl get services 
NAME                                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          
pol-io-comp-anchore-anchore-engine-api           NodePort    xxxx            <none>        8228:32199/TCP  
pol-io-comp-anchore-anchore-engine-catalog       ClusterIP   xxxx            <none>        8082/TCP        
pol-io-comp-anchore-anchore-engine-policy        ClusterIP   xxxx            <none>        8087/TCP        
pol-io-comp-anchore-anchore-engine-simplequeue   ClusterIP   xxxx            <none>        8083/TCP        
pol-io-comp-anchore-postgresql                   ClusterIP   xxxx            <none>        5432/TCP  1     

Configuring the Anchore Plugin in Jenkins

Install the Anchore plugin into Jenkins and configure it to use the Anchore api service ClusterIP and port:

No alt text provided for this image

Adding additional stages to our Jenkins pipeline

With Anchore engine installed we add a number of extra stages in our Jenkins pipeline.

Build Docker image using the staging Docker registry:

...

def imageLine = "${stagingDockerRegistry}/${dockerName}"
sh """
         docker login -u ${USER} -p ${PASSWORD} ${stagingDockerRegistry}
   """
appContainer = docker.build("${imageLine}")

...

Push the created image to our staging Docker registry:

...
def imageLine = "${stagingDockerRegistry}/${dockerName}"
echo "Build: about to call docker publish to staging repository ${imageLine}"
appContainer.push()
...

Scan the newly created image using our Anchore engine running in the cluster:

...
def imageLine = "${stagingDockerRegistry}/${dockerName}"
writeFile file: 'anchore_images', text: imageLine
anchore name: 'anchore_images'
...

This generates a report within Jenkins that summaries common exposure and vulnerabilities lists:

No alt text provided for this image
No alt text provided for this image

As we have passed the security gate we pull the Docker image from our staging repository, retag and push to our stable private registry:

...
    sh """
              docker login -u ${USER} -p ${PASSWORD} ${stagingDockerRegistry}
              docker pull  ${stagingImageLine}
              docker tag  ${stagingImageLine}  ${imageLine}

              docker login -u ${PRIV_USER} -p ${PRIV_PASSWORD} ${dockerRegistry}
              docker push ${imageLine}

       """
...
}

Nightly Scanning

In addition to running scans against our images as they are being built, we want to scan all the -FINAL stored in our private Docker registry each night. This allows us to create a strategy for patching Docker images that have been deployed to production. Ensuring we can identify new vulnerabilities in our production deployed images.

Adding a Jenkins pipeline to run every night and push it to Anchore engine to be scanned

    try {
          sh """
          curl https://image.polarpoint.io:443/v2/_catalog  | jq '.repositories[]'  | sort  | xargs -I _ curl -s -k -X GET https://image.polarpoint.io:443/v2/_/tags/list | jq -M '.["name"] + ":" + .["tags"][]' | grep 'FINAL' >>anchore_images
          """
            anchore name: 'anchore_images'
        }
       }
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1531  100  1531    0     0   2161      0 --:--:-- --:--:-- --:--:--  2159

...
"terraform:0.6.0-FINAL"
"visualiser:0.8.0-FINAL"
"visualiser:0.9.0-FINAL"
"springBootApp:0.15.0-RELEASE"
...
02:03:16  2019-03-31T02:03:16.229 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/terraform:0.1.0-RELEASE
02:03:16  2019-03-31T02:03:16.334 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/terraform:0.2.0-RELEASE
02:03:16  2019-03-31T02:03:16.443 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/terraform:0.3.0-RELEASE
02:03:16  2019-03-31T02:03:16.565 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/terraform:0.4.0-RELEASE
02:03:16  2019-03-31T02:03:16.691 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/terraform:0.5.0-RELEASE
02:03:16  2019-03-31T02:03:16.799 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/terraform:0.6.0-RELEASE
02:03:17  2019-03-31T02:03:17.541 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.1.0-RELEASE
02:03:17  2019-03-31T02:03:17.679 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.2.0-RELEASE
02:03:17  2019-03-31T02:03:17.805 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.3.0-RELEASE
02:03:17  2019-03-31T02:03:17.920 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.4.0-RELEASE
02:03:18  2019-03-31T02:03:18.035 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.5.0-RELEASE
02:03:18  2019-03-31T02:03:18.149 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.6.0-RELEASE
02:03:18  2019-03-31T02:03:18.273 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.7.0-RELEASE
02:03:18  2019-03-31T02:03:18.389 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.8.0-RELEASE
02:03:18  2019-03-31T02:03:18.506 INFO   AnchoreWorker   Querying vulnerability listing for image.mycnets.com/visualiser:0.9.0-RELEASE
02:03:18  Archiving artifacts

References

Nexus:

https://www.sonatype.com/nexus-repository-oss

Helm:

https://helm.sh/

Anchore container scanner plugin:

https://wiki.jenkins.io/display/JENKINS/Anchore+Container+Image+Scanner+Plugin

Anchore engine

DevOps | Continuous Integration, Continuous Delivery and Continuous Deployment…

At the centre of most modern software development is a branching strategy that fits comfortably, is adaptable and can easily be enforced. Although the git flow lends itself to projects that have a scheduled release it is still pivotal even when working with microservices with shorter times and quicker releases.

Master and Development branches

This workflow uses two branches to record the history of the project. The master branch stores the official stable release history, and the development branch serves as an integration branch for features and bug fixes.

Bugfix and Feature Branches

Bugfix and Feature branches are created from the development branch. When a bugfix or feature is complete, it gets merged back into the development branch.

$ git checkout development
$ git checkout -b feature/SB-123

Once the local development of code has been completed.

$ git pull
$ git add *
$ git commit -m "#SB-123 New dog enum for Pets"
$ git push --set-upstream feature/SB-123
No alt text provided for this image

Production Hotfix

No alt text provided for this image

A production Hotfix is very similar to a feature branch release except that you do your work in a branch taken directly off the master branch. 

Create a Hotfix branch based off the master branch that is currently deployed to production.

$ git checkout master
$ git checkout -b hotfix/SB-1121 
$ git push --set-upstream origin hotfix/SB-1121

Fix the bug, and commit.

$ git add *
$ git commit -m "Add missing cat Enum"
$ git push

Following a workflow based around git flow the following table can summarise the relationship between branches.

References

git flow

http://nvie.com/posts/a-successful-git-branching-model/