tomru/org

Fork 0

mirror of https://github.com/tomru/org.git synced 2026-03-03 06:27:22 +01:00

Files

Thomas Ruoff 09a6e5e848 well, moster commit

2020-02-02 22:28:00 +01:00

3.2 KiB

Raw Permalink Blame History

Monitoring Concept

Framing
Why do we want to monitor?
What do we want to monitor?
- Application metrics
  - Latency
    - Metrics
  - Traffic
    - Metrics
  - Errors
    - Metrics
  - Saturation
    - Metrics
- Service metrics
  - Traffic
    - Metrics
  - Saturation
    - Metrics
  - Errors
    - Metrics
  - Infrastructure
    - Metrics
- Business metrics
Alerting
- Who is watching the watchers ?

Framing

This concept will focus on monitoring a microservice in a Kubernetes cluster.

Why do we want to monitor?

Monitoring service metrics enables us to

track and estimate long term trends, e.g. demand grows x%
compare if and how changes affect the service
know what usual behaviour is
alert unusual behaviour that might lead to a total failure of the service and needs maintainance now
track if and how often Service Level Agreements (SLAs) are not met
understand why a failure or defect of the service happend by corellating metrics

What do we want to monitor?

Four golden signals of monitoring are latency, traffic, errors, and saturation. — SRE Book - Google

Application metrics

Latency

How long does a service take to serve a request. The monitoring has to distiguish between different routes and parameters as the workload might differ.

It's also important to distiguish between successfull and failed requests as the responst time might heavily differ.

Metrics

50, 90 and 95 percentile of the response time of successfull request per path

Traffic

How much demand is the service receiving.

Metrics

number of request per path (or nature of request, like simple database read actions and database write actions)

Errors

Which and how many errors does a service emit. This can help in tracking down internal problems of the service, misconfigured clients or too slow responses.

Metrics

HTTP status 5xx (internal errors)
HTTP status 4xx (client errors)
HTTP status 2xx violating a service level agreement (e.g. answered to slow)

Saturation

How occupied is your service? This can be one or multiple of the following, depending of the nature of your service. These could be derived by tracking the resources most demanded.

Metrics

used memory vs. available memory
free disc space
free database connections when a pool is used
free threads when serving request in a threaded system

In complex systems the request time of the 99th percentile can be a good metric to show how the system saturation currently is.

Service metrics

Traffic

Metrics

Network Traffic inbound and outbound

Saturation

Metrics

number of pods so scaling by the cloud environment is transparent

Errors

Metrics

service availability

Infrastructure

Metrics

deployment in progress
configuration change applied

Business metrics

Depending on the service it might be worth to monitor the business goals.

KPIs, e.g. number of crashs reported
images uploaded

Alerting

Alerting can be done via ticket systems, team chat channels, SMS or systems like Pager Duty.

It's important to find the right balance between to many and too less alerts. This needs to be be constantly refined based on the changes and demand on the service.

Who is watching the watchers ?

If the monitoring or the alerting infrastructure has an issue a failover system needs to be able to detect that an alert about the monitoring/alerting being down (so called meta monitoring).

3.2 KiB Raw Permalink Blame History

Monitoring Concept

Framing

Why do we want to monitor?

What do we want to monitor?

Application metrics

Latency

Metrics

Traffic

Metrics

Errors

Metrics

Saturation

Metrics

Service metrics

Traffic

Metrics

Saturation

Metrics

Errors

Metrics

Infrastructure

Metrics

Business metrics

Alerting

Who is watching the watchers ?

3.2 KiB

Raw Permalink Blame History