We monitor to produce:

  • Alerts: A human needs to be paged to take action now
  • Tickets: A human needs to do something one day
  • Logging: No need to look at this except for diagnostic purposes

For Performance

Includes:

Challenges

  • When you observe a system, you perturb it. There is performance overhead that is incurred by monitoring
  • You increase volume and complexity of monitoring data

For DevOps

  • Health checks
  • Can our services do useful work
  • Maybe do this in a way that shows performance problems

Things to Monitor

  • CPU Load
  • Memory Utilization
  • Disk Space
  • Disk I/O
  • Network Traffic
  • Clock Skew
  • Queue lengths
  • Application Response Times

Alert Conditions

  • CPU usage exceeding threshold for a certain period of time
  • Increased rate of error logs over a period of time
  • A service has restarted many times recently
  • Queue length very long
  • Taking too long to complete a workflow
  • Setting minimum thresholds might also held identify errors (things are too quiet…)

Note

Monitoring software is important if you offer a free service, since we want to make sure that we aren’t being taken advantage of.