The Problem With Configurations

Originally published in Netocratic

Configuration is one of those things we usually take for granted. Yet, as more and more configurations are added the use of configurations becomes hard and complicated. With enough configuration parameters, keeping track of what the current configuration “is” can be quite problematic. Often, people misconfigure systems due to various reasons such as poor documentation or unexpected interactions between different parameters. Sometimes, the configuration itself can be complex and requires a language to express it, raising the bar of required knowledge for defining the configuration. Despite the “boring” nature of configurations they can and do obliterate entire companies in matter of hours.

Yet another config format

If you take a quick look in your /etc/ directory you will find a pile of configuration files using a large variety of format. From key/value shell include files through ini, toml, json, yaml, xml – and those are the relatively standard ones; If you care to look at sudoers, apt.conf, apache or postfix for example you will find more weird looking formats.

Unfortunately we humans need to remember the syntax for each format, not to mention other pecularities of each program configurations. For example, there is no standard for config file validation and some programs don’t have validation at all!

The vast variety of config formats makes system administration harder. It is hard to keep track and maintain a decent skill; some formats are very hard to generate from configuration management tools and scripts so we must resort to using error prone text templates.

Configuration drop bomb

The conf.d pattern has emerged many years ago as a way to allow different modules and packages to inject configuration snippets to other programs. In the absent of CM tools, it served humans to split large configurations to make management easier.

This pattern is generally considered a good practice yet it causes surprising amount of difficulty in maintenance and automation of systems. Since configurations are merged, values can be overridden from other files and are sometimes merged in unexpected ways. You change a configuration parameter, yet the program behaves as if you changed nothing; you grep through the directory and find this value has been set in a different file, one which you do not control. You must reason about the merge order and move your file to a higher priority, possibly overriding other parameters… and now you need to debug again.

When using CM tools this pattern is particularly annoying – if your CM tool manage files in the conf.d directory and you later remove these file resources the configuration files are not removed, they are simply no longer managed. Moreover, this allows overriding the automatic configurations by dropping files into conf.d manually, circumventing our effort to standardise configurations in servers.

Dude, did you restart just now?

Traditionally, config reload is done by sending a POSIX signal to the process, usually USR1 or HUP. The problem is that signal have no output and there is no obvious feedback telling us whether our reload succeeded. Perhaps the config file is malformed or we have set some parameter to an invalid value; Perhaps the signal was blocked or the program couldn’t handle it. We simply don’t know until we start digging through log files and even then we can’t be sure since the absent of “reload” message in the log file doesn’t actually mean the reload didn’t happen. On top of that it’s extremely difficult to automate such checks, so in most cases we simply give up and assume the configuration made it to our program or take the brutal approach and restart our process needlessly.

When in doubt, nuke from orbit

Some parameters that require a restart to change and some require a reload. CM tools have no way to identify what parameters changed in your configuration file and whether a reload is sufficient to activate the changes. As a result, we are forced to always use the nuclear option – restart. Unfortunately sometimes even the nukes are not enough, such as when changing MySQL innodb ibdata file size – which requires a stop – maintenance – start cycle; This forces us to resort to “nuke from orbit” methods of tearing down compute instances to support a configuration change.

I have no idea what’s going on

Just because the config file contains value X doesn’t actually mean that’s what loaded in the process. Perhaps the file was changed without reloading the configuration or the configuration path is wrong – this is actually very common. So how do you know what configuration is loaded in your server? how do you validate that all servers are configured properly? Most programs do not provide a good mechanism for this.

A better way

Like most operational aspects of programs, configuration issues can and should be resolved by grassroots engineering work rather then after-the-fact makeshift solutions. A good example of an attempt to tackle this at the core is Netflix Archaius project and many others have followed suite.

There are several simple design principles that can help make the configuration of your program much easier to work with. To some degree, you can even apply these principle to 3rd party programs using CM tools:

  • Separate configuration to 2-3 files based on the impact of configuration parameter change: 1st file contains parameters that require a restart to change, 2nd contains parameters that require a reload and so on.
  • Avoid using the conf.d pattern. Instead, have your CM tools merge values and create a small number of config files – making debugging and validation easier
  • Create a configuration API. If using REST, GET method should return a complete dump of configuration parameters with e-tag header, and HEAD should return the e-tag header without a body. The e-tag header should be a checksum of the configuration in canonical order allowing for easy comparison between in-ram configuration of all nodes and reference version in CM
  • If possible, use the REST configuration API to reload configurations using a POST or PUT requests. This allows your CM tools to easily validate that the configuration was successfully loaded and whether any values were updated (200/201/202 response). If not possible, write a small reload wrapper that verifies configuration was reloaded using whatever feedback the program provides.
  • When writing new programs, choose a serialiazed format for configuration like json, yaml, edn, etc. Although they are not particularly comfortable for users to work with directly, remember that using CM tools and simple utilities people can work with whatever format they feel comformtable with as a long as a conversion utility exists.
  • Some programs require advanced configuration employing logic (e.g. logstash) which doesn’t easily map to serialized formats; For these, treat the configuration as a plugin and extract variables to an external configuration file.