mirror of
https://github.com/tomru/org.git
synced 2026-03-03 14:37:26 +01:00
well, moster commit
This commit is contained in:
116
monitoring.org
Normal file
116
monitoring.org
Normal file
@@ -0,0 +1,116 @@
|
||||
#+TITLE: Monitoring Concept
|
||||
|
||||
* Framing
|
||||
|
||||
This concept will focus on monitoring a microservice in a Kubernetes cluster.
|
||||
|
||||
* Why do we want to monitor?
|
||||
|
||||
Monitoring service metrics enables us to
|
||||
|
||||
- track and estimate long term trends, e.g. demand grows x%
|
||||
- compare if and how changes affect the service
|
||||
- know what usual behaviour is
|
||||
- alert unusual behaviour that might lead to a total failure of the service and needs maintainance now
|
||||
- track if and how often Service Level Agreements (SLAs) are not met
|
||||
- understand why a failure or defect of the service happend by corellating metrics
|
||||
|
||||
* What do we want to monitor?
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Four golden signals of monitoring are latency, traffic, errors, and
|
||||
saturation. --- [[https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals][SRE Book - Google]]
|
||||
#+END_QUOTE
|
||||
|
||||
** Application metrics
|
||||
|
||||
*** Latency
|
||||
|
||||
How long does a service take to serve a request. The monitoring has to
|
||||
distiguish between different routes and parameters as the workload might
|
||||
differ.
|
||||
|
||||
It's also important to distiguish between successfull and failed requests as
|
||||
the responst time might heavily differ.
|
||||
|
||||
**** Metrics
|
||||
- 50, 90 and 95 percentile of the response time of successfull request per path
|
||||
|
||||
*** Traffic
|
||||
How much demand is the service receiving.
|
||||
|
||||
**** Metrics
|
||||
|
||||
- number of request per path (or nature of request, like simple database read
|
||||
actions and database write actions)
|
||||
|
||||
*** Errors
|
||||
Which and how many errors does a service emit. This can help in tracking down
|
||||
internal problems of the service, misconfigured clients or too slow responses.
|
||||
|
||||
**** Metrics
|
||||
|
||||
- HTTP status 5xx (internal errors)
|
||||
- HTTP status 4xx (client errors)
|
||||
- HTTP status 2xx violating a service level agreement (e.g. answered to slow)
|
||||
|
||||
*** Saturation
|
||||
How occupied is your service? This can be one or multiple of the following,
|
||||
depending of the nature of your service. These could be derived by tracking
|
||||
the resources most demanded.
|
||||
|
||||
**** Metrics
|
||||
|
||||
- used memory vs. available memory
|
||||
- free disc space
|
||||
- free database connections when a pool is used
|
||||
- free threads when serving request in a threaded system
|
||||
|
||||
In complex systems the request time of the 99th percentile can be a good
|
||||
metric to show how the system saturation currently is.
|
||||
|
||||
|
||||
** Service metrics
|
||||
|
||||
*** Traffic
|
||||
**** Metrics
|
||||
|
||||
- Network Traffic inbound and outbound
|
||||
|
||||
*** Saturation
|
||||
**** Metrics
|
||||
|
||||
- number of pods so scaling by the cloud environment is transparent
|
||||
|
||||
*** Errors
|
||||
**** Metrics
|
||||
|
||||
- service availability
|
||||
|
||||
*** Infrastructure
|
||||
**** Metrics
|
||||
|
||||
- deployment in progress
|
||||
- configuration change applied
|
||||
|
||||
** Business metrics
|
||||
|
||||
Depending on the service it might be worth to monitor the business goals.
|
||||
|
||||
- KPIs, e.g. number of crashs reported
|
||||
- images uploaded
|
||||
|
||||
* Alerting
|
||||
|
||||
Alerting can be done via ticket systems, team chat channels, SMS or systems like
|
||||
Pager Duty.
|
||||
|
||||
It's important to find the right balance between to many and too less alerts.
|
||||
This needs to be be constantly refined based on the changes and demand on the
|
||||
service.
|
||||
|
||||
** Who is watching the watchers ?
|
||||
|
||||
If the monitoring or the alerting infrastructure has an issue a failover
|
||||
system needs to be able to detect that an alert about the monitoring/alerting
|
||||
being down (so called meta monitoring).
|
||||
Reference in New Issue
Block a user