Cronic Disease

There’s no denying Cron is one of the most useful tools in any *nix O/S but it’s Every so often, there’s a cron job going haywire. Jobs fail to run and nobody notices, the sterile environment (no sane default PATH) eludes users and from time to time, a really bizarre critical issue - like cron filling up the disk with its mail output buffer because a job is looping and spewing an endless stream of error messages.

Cron’s common usage is an omnipresent pain-point in my production systems, In particular, it’s major shortcomings are:

  • No job history. When did a job run? did it succeed? where is the output? (and no, mail or the local syslog are not acceptable)
  • output goes to a spool directory and will fill out your diskspace if the job is misbehaving
  • Job run environment is very different from the user’s and not trivial to recreate
  • No triggers other than schedule. I see many jobs that wake up every minute just to check a trigger/condition.
  • No support for “test run” or manual trigger
  • Hard to monitor
  • No built-in locks
  • No support for distributed operations; many cron jobs today needs to be run periodically on one node in a cluster.

The main reason for these issues is that, like so many other *nix tools, cron was designed for a single instance, both from the technical and usage perspectives. It makes sense to have job results sent by mail when you have 3 servers, but if you get emails from 100 servers…. good luck with that. Same goes for logging the output; it’s sane (although inconvenient) to lookup a job’s history in a log on one server but not on a cluster.

The cron use case demonstrates an often forgotten truth about scaling: scaling includes the human component. Does cron scale computation-wise? absolutely. Is it still maintainable? not so much.

I would like a tool that runs maintenance jobs on a cluster and its nodes and tends to itself as much as possible while still being pluggable. It’s rare (and wrong) for a tool to be able to do everything, so pluggability and a good API are essential.

What can be done

use a build server

The easiest resolution is not to use cron for anything other than simple internal node maintenance tasks. As a replacement, use a build server (like Jenkins) - build servers support distributed scheduled jobs (or on a trigger), maintain a job’s history and output and have many plugins and triggers.

Submit passive checks to Nagios

Every cron job you care about should do this. By using Nagios you get instant limited job history, failure monitoring and the chance to respond automatically if a job failed to complete.

Use a generic wrapper

All but one of cron’s major issues can be easily fixed with a single generic wrapper. The code for sending passive checks to nagios, handle output and save it somewhere for later, write a history entry, check for lock files, etc. - it’s all very generic. Write a decent wrapper now and thank yourself later.