well, moster commit

2026-03-03 14:37:26 +01:00 · 2020-02-02 22:28:00 +01:00
parent c1bb7336e3
commit 09a6e5e848
8 changed files with 178 additions and 103 deletions
--- a/monitoring.org
+++ b/monitoring.org
@@ -0,0 +1,116 @@
+#+TITLE: Monitoring Concept
+
+* Framing
+
+This concept will focus on monitoring a microservice in a Kubernetes cluster.
+
+* Why do we want to monitor?
+
+Monitoring service metrics enables us to
+
+- track and estimate long term trends, e.g. demand grows x%
+- compare if and how changes affect the service
+- know what usual behaviour is
+- alert unusual behaviour that might lead to a total failure of the service and needs maintainance now
+- track if and how often Service Level Agreements (SLAs) are not met
+- understand why a failure or defect of the service happend by corellating metrics
+
+* What do we want to monitor?
+
+#+BEGIN_QUOTE
+Four golden signals of monitoring are latency, traffic, errors, and
+saturation. --- [[https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals][SRE Book - Google]]
+#+END_QUOTE
+
+** Application metrics
+
+*** Latency
+
+How long does a service take to serve a request. The monitoring has to
+distiguish between different routes and parameters as the workload might
+differ.
+
+It's also important to distiguish between successfull and failed requests as
+the responst time might heavily differ.
+
+**** Metrics
+ - 50, 90 and 95 percentile of the response time of successfull request per path
+
+*** Traffic
+How much demand is the service receiving.
+
+**** Metrics
+
+ - number of request per path (or nature of request, like simple database read
+   actions and database write actions)
+
+*** Errors
+Which and how many errors does a service emit. This can help in tracking down
+internal problems of the service, misconfigured clients or too slow responses.
+
+**** Metrics
+
+- HTTP status 5xx (internal errors)
+- HTTP status 4xx (client errors)
+- HTTP status 2xx violating a service level agreement (e.g. answered to slow)
+
+*** Saturation
+How occupied is your service? This can be one or multiple of the following,
+depending of the nature of your service. These could be derived by tracking
+the resources most demanded.
+
+**** Metrics
+
+- used memory vs. available memory
+- free disc space
+- free database connections when a pool is used
+- free threads when serving request in a threaded system
+
+In complex systems the request time of the 99th percentile can be a good
+metric to show how the system saturation currently is.
+
+
+** Service metrics
+
+*** Traffic
+**** Metrics
+
+- Network Traffic inbound and outbound
+
+*** Saturation
+**** Metrics
+
+- number of pods so scaling by the cloud environment is transparent
+
+*** Errors
+**** Metrics
+
+- service availability
+
+*** Infrastructure
+**** Metrics
+
+- deployment in progress
+- configuration change applied
+
+** Business metrics
+
+Depending on the service it might be worth to monitor the business goals.
+
+- KPIs, e.g. number of crashs reported
+- images uploaded
+
+* Alerting
+
+Alerting can be done via ticket systems, team chat channels, SMS or systems like
+Pager Duty.
+
+It's important to find the right balance between to many and too less alerts.
+This needs to be be constantly refined based on the changes and demand on the
+service.
+
+** Who is watching the watchers ?
+
+If the monitoring or the alerting infrastructure has an issue a failover
+system needs to be able to detect that an alert about the monitoring/alerting
+being down (so called meta monitoring).