We monitor to produce:
- Alerts: A human needs to be paged to take action now
- Tickets: A human needs to do something one day
- Logging: No need to look at this except for diagnostic purposes
For Performance
Includes:
Challenges
- When you observe a system, you perturb it. There is performance overhead that is incurred by monitoring
- You increase volume and complexity of monitoring data
For DevOps
- Health checks
- Can our services do useful work
- Maybe do this in a way that shows performance problems
Things to Monitor
- CPU Load
- Memory Utilization
- Disk Space
- Disk I/O
- Network Traffic
- Clock Skew
- Queue lengths
- Application Response Times
Alert Conditions
- CPU usage exceeding threshold for a certain period of time
- Increased rate of error logs over a period of time
- A service has restarted many times recently
- Queue length very long
- Taking too long to complete a workflow
- Setting minimum thresholds might also held identify errors (things are too quiet…)
Note
Monitoring software is important if you offer a free service, since we want to make sure that we aren’t being taken advantage of.