Over the past few days the OFN UK site has had a critical issue that @lin_d_hop has been doing an awesome job of investigating. However, to dig into this kind of thing requires accessing the server directly and parsing various logs to deduce the cause. @lin_d_hop and myself would like to see if we can improve this.
Here’s some ideas we could look at:
Improve logging flexibility
We could:
- Make the log level configurable via an environment variable, with logical defaults, so the log level can be increased for a time easily on a server to get more information.
- Add
rails_semantic_logger
gem that has a number of additional benefits over Rails standard logging, including structured logging - see http://rocketjob.github.io/semantic_logger/index.html for more info. (there are some other gem candidates that do similar things, but this one is particularly well documented) - Consider if existing custom log messages could have more structured attributes added.
Cleanup the log output where possible
One small example is to prevent ActionController::RoutingError
s - I’ve raised #1362 to capture it.
A review of other common log messages that dont add much value to see if they could be reduced/removed might be useful.
Support aggregating logs externally
A flexible configuration for rails_semantic_logger
that allows different log appenders could be built, including ways to enable centralized logging to another system. Ideally this would be done in a way that either a hosted log aggregator, such as Loggly or Logentries could be used or we might look at building an install of the Elastic (ELK) stack.
Beats could be added to servers via ofn-install
and configured to also send logs for nginx, pg and other systems to the same aggregator.
The log scanning could be looked at in a couple of ways:
- Direct all logs to STDOUT and capture that with Beats. This would fit with 12-factor logging
- Review file logs and setup proper
logrotate
config to manage them better. Ensure all tools have a logging and capturing strategy.
Bugsnag
@maikal has a couple of suggestions for improvement in this issue about Bugsnag “creeping into the app”.
Are there exceptions you can throw that are not causing a server for the user, but are reported by error tracking services?
Could this be handled by configuring the Bugsnag appender in semantic_logger
to report error
log statements to Bugsnag? Then the RuntimeError
s wrapped in a call to Bugsnag.notify
were converted to error
log statements.
Alternatively, we could introduce another layer
This is also covered quite well by using semantic_logger
. An alternative could be to add the errbase gem to abstract the reporting tool.