3.2 KiB
Monitoring Concept
Framing
This concept will focus on monitoring a microservice in a Kubernetes cluster.
Why do we want to monitor?
Monitoring service metrics enables us to
- track and estimate long term trends, e.g. demand grows x%
- compare if and how changes affect the service
- know what usual behaviour is
- alert unusual behaviour that might lead to a total failure of the service and needs maintainance now
- track if and how often Service Level Agreements (SLAs) are not met
- understand why a failure or defect of the service happend by corellating metrics
What do we want to monitor?
Four golden signals of monitoring are latency, traffic, errors, and saturation. — SRE Book - Google
Application metrics
Latency
How long does a service take to serve a request. The monitoring has to distiguish between different routes and parameters as the workload might differ.
It's also important to distiguish between successfull and failed requests as the responst time might heavily differ.
Metrics
- 50, 90 and 95 percentile of the response time of successfull request per path
Traffic
How much demand is the service receiving.
Metrics
- number of request per path (or nature of request, like simple database read actions and database write actions)
Errors
Which and how many errors does a service emit. This can help in tracking down internal problems of the service, misconfigured clients or too slow responses.
Metrics
- HTTP status 5xx (internal errors)
- HTTP status 4xx (client errors)
- HTTP status 2xx violating a service level agreement (e.g. answered to slow)
Saturation
How occupied is your service? This can be one or multiple of the following, depending of the nature of your service. These could be derived by tracking the resources most demanded.
Metrics
- used memory vs. available memory
- free disc space
- free database connections when a pool is used
- free threads when serving request in a threaded system
In complex systems the request time of the 99th percentile can be a good metric to show how the system saturation currently is.
Service metrics
Traffic
Metrics
- Network Traffic inbound and outbound
Saturation
Metrics
- number of pods so scaling by the cloud environment is transparent
Errors
Metrics
- service availability
Infrastructure
Metrics
- deployment in progress
- configuration change applied
Business metrics
Depending on the service it might be worth to monitor the business goals.
- KPIs, e.g. number of crashs reported
- images uploaded
Alerting
Alerting can be done via ticket systems, team chat channels, SMS or systems like Pager Duty.
It's important to find the right balance between to many and too less alerts. This needs to be be constantly refined based on the changes and demand on the service.
Who is watching the watchers ?
If the monitoring or the alerting infrastructure has an issue a failover system needs to be able to detect that an alert about the monitoring/alerting being down (so called meta monitoring).