org/monitoring.org

#+TITLE: Monitoring Concept

* Framing

This concept will focus on monitoring a microservice in a Kubernetes cluster.

* Why do we want to monitor?

Monitoring service metrics enables us to

- track and estimate long term trends, e.g. demand grows x%
- compare if and how changes affect the service
- know what usual behaviour is
- alert unusual behaviour that might lead to a total failure of the service and needs maintainance now
- track if and how often Service Level Agreements (SLAs) are not met
- understand why a failure or defect of the service happend by corellating metrics

* What do we want to monitor?

#+BEGIN_QUOTE
Four golden signals of monitoring are latency, traffic, errors, and
saturation. --- [[https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals][SRE Book - Google]]
#+END_QUOTE

** Application metrics

*** Latency

How long does a service take to serve a request. The monitoring has to
distiguish between different routes and parameters as the workload might
differ.

It's also important to distiguish between successfull and failed requests as
the responst time might heavily differ.

**** Metrics
 - 50, 90 and 95 percentile of the response time of successfull request per path

*** Traffic
How much demand is the service receiving.

**** Metrics

 - number of request per path (or nature of request, like simple database read
   actions and database write actions)

*** Errors
Which and how many errors does a service emit. This can help in tracking down
internal problems of the service, misconfigured clients or too slow responses.

**** Metrics

- HTTP status 5xx (internal errors)
- HTTP status 4xx (client errors)
- HTTP status 2xx violating a service level agreement (e.g. answered to slow)

*** Saturation
How occupied is your service? This can be one or multiple of the following,
depending of the nature of your service. These could be derived by tracking
the resources most demanded.

**** Metrics

- used memory vs. available memory
- free disc space
- free database connections when a pool is used
- free threads when serving request in a threaded system

In complex systems the request time of the 99th percentile can be a good
metric to show how the system saturation currently is.


** Service metrics

*** Traffic
**** Metrics

- Network Traffic inbound and outbound

*** Saturation
**** Metrics

- number of pods so scaling by the cloud environment is transparent

*** Errors
**** Metrics

- service availability

*** Infrastructure
**** Metrics

- deployment in progress
- configuration change applied

** Business metrics

Depending on the service it might be worth to monitor the business goals.

- KPIs, e.g. number of crashs reported
- images uploaded

* Alerting

Alerting can be done via ticket systems, team chat channels, SMS or systems like
Pager Duty.

It's important to find the right balance between to many and too less alerts.
This needs to be be constantly refined based on the changes and demand on the
service.

** Who is watching the watchers ?

If the monitoring or the alerting infrastructure has an issue a failover
system needs to be able to detect that an alert about the monitoring/alerting
being down (so called meta monitoring).