Better logging for helping to resolve production issues


#1

Over the past few days the OFN UK site has had a critical issue that @lin_d_hop has been doing an awesome job of investigating. However, to dig into this kind of thing requires accessing the server directly and parsing various logs to deduce the cause. @lin_d_hop and myself would like to see if we can improve this.

Here’s some ideas we could look at:

Improve logging flexibility

We could:

  • Make the log level configurable via an environment variable, with logical defaults, so the log level can be increased for a time easily on a server to get more information.
  • Add rails_semantic_logger gem that has a number of additional benefits over Rails standard logging, including structured logging - see http://rocketjob.github.io/semantic_logger/index.html for more info. (there are some other gem candidates that do similar things, but this one is particularly well documented)
  • Consider if existing custom log messages could have more structured attributes added.

Cleanup the log output where possible

One small example is to prevent ActionController::RoutingErrors - I’ve raised #1362 to capture it.

A review of other common log messages that dont add much value to see if they could be reduced/removed might be useful.

Support aggregating logs externally

A flexible configuration for rails_semantic_logger that allows different log appenders could be built, including ways to enable centralized logging to another system. Ideally this would be done in a way that either a hosted log aggregator, such as Loggly or Logentries could be used or we might look at building an install of the Elastic (ELK) stack.

Beats could be added to servers via ofn-install and configured to also send logs for nginx, pg and other systems to the same aggregator.

The log scanning could be looked at in a couple of ways:

  • Direct all logs to STDOUT and capture that with Beats. This would fit with 12-factor logging
  • Review file logs and setup proper logrotate config to manage them better. Ensure all tools have a logging and capturing strategy.

Bugsnag

@maikal has a couple of suggestions for improvement in this issue about Bugsnag “creeping into the app”.

Are there exceptions you can throw that are not causing a server for the user, but are reported by error tracking services?

Could this be handled by configuring the Bugsnag appender in semantic_logger to report error log statements to Bugsnag? Then the RuntimeErrors wrapped in a call to Bugsnag.notify were converted to error log statements.

Alternatively, we could introduce another layer

This is also covered quite well by using semantic_logger. An alternative could be to add the errbase gem to abstract the reporting tool.


Replace error tracking with a way to monitor cache misses
#2

Ping @maikel @RohanM, any thoughts?


#3

Nice research. It would be great to cleanup the logs so that we get only important information. Another example are the cache misses that are coming up all the time.


#4

@maikel any thoughts on how I might break this down into a set of tasks? Is it worth creating issues for specific items for further discussion? Or just create PRs for some bits?

As the server log aggregation stuff is a bigger change to server config probably needing changes to ofn-install, are you in favour of this?


#5

any thoughts on how I might break this down into a set of tasks?

You structured it quite well already.

Log level change

Reading the log-level from the environment needs restarting the app. Is it really much more convenient to hook into starting the app with a different environment? Currently you change one line in application.yml and call service unicorn_ofn restart, right?

Add rails_semantic_logger

That could be a pull request. The desired format (old, structured, …) can be specified in application.yml.

Support aggregating logs externally

This can be another pull request. It should keep it optional if centralised logging is used or not. Ideally we could even choose which service to use.

Direct all logs to STDOUT

This needs to be optional, because old instances would need to change the startup scripts to capture that output and write it into a log file if desired. But that could be another pull request.

Logrotate

This would be a pull request for ofn-install. And there are two approaches.

  • Trigger logrotate on deploy. This is good for staging servers. It can be used for production as well (forced rotate) to separate logs of different code versions.
  • Trigger logrotate with cron. This should definitely be done on production servers. The Ansible vars can have a little section about logging.

Bugsnag

A pull request to replace the direct calls to the Bugsnag API would be good.

Is it worth creating issues for specific items for further discussion?

Depends on the person doing the tasks. If you are doing it, you can probably work of this list. If you would like to delegate, issues could be nice to organise that. But I would just care about the pull request as long as they reference to the section that describes the purpose.

As the server log aggregation stuff is a bigger change to server config probably needing changes to ofn-install, are you in favour of this?

All changes that need ofn-install changes, need to be backwards compatible. So they need to be optional. And we can’t assume that people want to use log aggregation services. If a change has one pull request for ofn and one for ofn-install, it’s probably good to get them ready at the same time, but review the ofn pr first.