mirror of
https://github.com/tomru/org.git
synced 2026-03-03 06:27:22 +01:00
117 lines
3.2 KiB
Org Mode
117 lines
3.2 KiB
Org Mode
#+TITLE: Monitoring Concept
|
|
|
|
* Framing
|
|
|
|
This concept will focus on monitoring a microservice in a Kubernetes cluster.
|
|
|
|
* Why do we want to monitor?
|
|
|
|
Monitoring service metrics enables us to
|
|
|
|
- track and estimate long term trends, e.g. demand grows x%
|
|
- compare if and how changes affect the service
|
|
- know what usual behaviour is
|
|
- alert unusual behaviour that might lead to a total failure of the service and needs maintainance now
|
|
- track if and how often Service Level Agreements (SLAs) are not met
|
|
- understand why a failure or defect of the service happend by corellating metrics
|
|
|
|
* What do we want to monitor?
|
|
|
|
#+BEGIN_QUOTE
|
|
Four golden signals of monitoring are latency, traffic, errors, and
|
|
saturation. --- [[https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals][SRE Book - Google]]
|
|
#+END_QUOTE
|
|
|
|
** Application metrics
|
|
|
|
*** Latency
|
|
|
|
How long does a service take to serve a request. The monitoring has to
|
|
distiguish between different routes and parameters as the workload might
|
|
differ.
|
|
|
|
It's also important to distiguish between successfull and failed requests as
|
|
the responst time might heavily differ.
|
|
|
|
**** Metrics
|
|
- 50, 90 and 95 percentile of the response time of successfull request per path
|
|
|
|
*** Traffic
|
|
How much demand is the service receiving.
|
|
|
|
**** Metrics
|
|
|
|
- number of request per path (or nature of request, like simple database read
|
|
actions and database write actions)
|
|
|
|
*** Errors
|
|
Which and how many errors does a service emit. This can help in tracking down
|
|
internal problems of the service, misconfigured clients or too slow responses.
|
|
|
|
**** Metrics
|
|
|
|
- HTTP status 5xx (internal errors)
|
|
- HTTP status 4xx (client errors)
|
|
- HTTP status 2xx violating a service level agreement (e.g. answered to slow)
|
|
|
|
*** Saturation
|
|
How occupied is your service? This can be one or multiple of the following,
|
|
depending of the nature of your service. These could be derived by tracking
|
|
the resources most demanded.
|
|
|
|
**** Metrics
|
|
|
|
- used memory vs. available memory
|
|
- free disc space
|
|
- free database connections when a pool is used
|
|
- free threads when serving request in a threaded system
|
|
|
|
In complex systems the request time of the 99th percentile can be a good
|
|
metric to show how the system saturation currently is.
|
|
|
|
|
|
** Service metrics
|
|
|
|
*** Traffic
|
|
**** Metrics
|
|
|
|
- Network Traffic inbound and outbound
|
|
|
|
*** Saturation
|
|
**** Metrics
|
|
|
|
- number of pods so scaling by the cloud environment is transparent
|
|
|
|
*** Errors
|
|
**** Metrics
|
|
|
|
- service availability
|
|
|
|
*** Infrastructure
|
|
**** Metrics
|
|
|
|
- deployment in progress
|
|
- configuration change applied
|
|
|
|
** Business metrics
|
|
|
|
Depending on the service it might be worth to monitor the business goals.
|
|
|
|
- KPIs, e.g. number of crashs reported
|
|
- images uploaded
|
|
|
|
* Alerting
|
|
|
|
Alerting can be done via ticket systems, team chat channels, SMS or systems like
|
|
Pager Duty.
|
|
|
|
It's important to find the right balance between to many and too less alerts.
|
|
This needs to be be constantly refined based on the changes and demand on the
|
|
service.
|
|
|
|
** Who is watching the watchers ?
|
|
|
|
If the monitoring or the alerting infrastructure has an issue a failover
|
|
system needs to be able to detect that an alert about the monitoring/alerting
|
|
being down (so called meta monitoring).
|