We’re researching all kinds of tools as we set up our new cloud environment, I figure I may as well share for the benefit of the public…
Most recently, we’re looking at log management. That is, a tool to aggregate and analyze your log files from across your infrastructure. We love Splunk and it’s been our tool of choice in the past, but it has two major drawbacks. One, it’s quite expensive. In our new environment where we’re using a lot of open source and other new-format vendors, Splunk is a comparatively big line item for a comparatively small part of an overall systems management portfolio.
Two, which is somewhat related, it’s licensed by amount of logs it processes per day. Which is a problem because when something goes wrong in our systems, it tends to cause logging levels to spike up. In our old environment, we keep having to play this game where an app will get rolled to production with debug on (accidentally or deliberately) or just be logging too much or be having a problem causing it to log too much, and then we have to blacklist it in Splunk so it doesn’t run us over our license and cause the whole damn installation to shut off. It took an annoying amount of micromanagement for this reason.
Other than that, Splunk is the gold standard; it pulls anything in, graphs it, has Google-like search, dashboards, reports, alerts, and even crazier capabilities.
Now on the “low end” there are really simple log watchers like swatch or logwatch. But we’d really like something that will aggregate ALL our logs (not just syslog stuff using syslog-ng – app server logs, application logs, etc.), ideally from both UNIX and Windows systems, and make them usefully searchable. Trying to make everything and everyone log using syslog is an ever receding goal. It’s a fool’s paradise.
There’s the big appliance vendors on the “high end” like LogLogic and LogRhythm, but we looked at them when we looked at Splunk and they are not only expensive but also seem to be “write only solutions” – they aggregate your logs to meet compliance requirements, and do some limited pattern matching, but they don’t put your logs to work to help you in your actual work of application administration the dozen ways Splunk does. At best they are “SIEM”s – security information and event managers – that alert on naughty intruders. But with Splunk I can do everything from generate a report of 404s to send to our designers to fix their bad links/missing images to graph site traffic to make dashboards for specific applications for their developers to review. Plus, as we’re doing this in the cloud, appliances need not apply. (Ooo, that’s a catchy phrase, I’ll have to use that for a separate post!)
I came across three other tools that seem promising:
- Logscape from Liquidlabs – does graphing and dashboards like Splunk does. And “live tail” – Splunk mysteriously took this out when they revved from version 3 to 4! Internet rumor is that it’s a lot cheaper. Seems like a smaller, less expensive Splunk, which is a nice thing to be, all considered.
- Octopussy – open source and Perl based (might work on Windows but I wouldn’t put money on it). Does alerting and reporting. Much more basic, but you can’t beat the price. Don’t think it’ll meet our needs though.
- Xpolog – seems nice and kinda like Splunk. Most of the info I can find on it, though, are “What about xpolog, is good!” comments appended to every forum thread/blog post about Splunk I can find, which is usually a warning sign – that kind of guerrilla marketing gets old quick IMO. One article mentions looking into it and finding it more expensive but with some nice features like autodiscovery, but not as open as Splunk.
Anyone have anything to add? Used any of these? We’ve gotten kind of addicted to having our logs be immediately accessible, converted into metrics, etc. I probably wouldn’t even begrudge Splunk the money if it weren’t for all the micromanagement you have to put into running it. It’s like telling the fire department “you’re licensed for a maximum of three fires at a time” – it verges on irresponsible.
21 responses to “Log Management Tools”
I found this same frustration with all of the tools that I found, so I wrote a python based open source tool called petit (http://crunchtools.com/petit. It just got released in Fedora 13 and I am hoping to eventually get it in the EPEL and Ubuntu. It also runs in cygwin under windows.
I generally use it in conjunction with a small bash script to give me a report each morning, and I do a more in depth analysis on our logs monthly. Also, petit is great for spot analysis, it does hashing (artificial ignorance), graphing, and word counts on the command line and is nice for scripting custom reports. It really integrates with the unix power tools and doesn’t break the way a sys admin thinks. Finally, it helps with word discovery. This allows you to identify all of the negative words to alert on in swatch, such as can’t, don’t, won’t, fail, error, assert, etc, etc. Since logs are composed of natural languages, you must do some kind of word discovery to gain real value out of alerting, petit helps with this.
I am working on a blog post to show more advanced usage but the download page is a brief video and wiki style usage which, hopefully, is pretty easy to follow.
Pingback: Tweets that mention Log Management Tools « the agile admin -- Topsy.com
I quite like SEC but it’s more on the Swatch end of tools (although far more featured IMHO). I’ve taken a look at Arcsight (which is bit more in the security space) and been quite impressed with what it can do but again it tends to be on the pricier side. I’ll be interested to hear what you end up choosing.
We’ll post updates as we evaluate and choose! I was hoping we wouldn’t have to go to much work but Splunk is wanting 2x as much $/GB as we paid them last year; apparently they’re deeply in love with themselves. I can’t spend most of my overall systems management and tool budget just on log management, that’s a relatively small piece of the puzzle.
Out of curiosity, what is your daily log volume typically? You have to have 5 license violations in a month in order to have Splunk search shut off (but indexing keeps working). That doesn’t seem terribly unreasonable. I’m sure other licensing options (per user, per core, flat fee, etc) would be more beneficial to some customers, but licensing by volume is pretty fair to a majority of clients.
Also, if you have certain sources or sourcetypes that are common offenders, just set up an alert to warn you when the indexed volume of that source/sourcetype are over what you’d normally see.
Of the alternatives you mentioned above, it sounds like Logscape is the only one you’re even remotely considering based on your requirements. I’d definitely like to see a follow-up post if you find other alternatives or make a final selection one way or another.
On our new system, we don’t know. On our old system, think it was around 25 GB. The problem is that some issues can’t be fixed in 5 days. Here’s a common scenario.
1. New software release
2. New version of the app has extreme log diarrhea
3. Can’t roll back app just because of logging; app dev didn’t bother to use log4j and proper log levels because he’s a twerp.
4. We have to arrange for an emergency release and/or blacklist the app in splunk before 5 days – or less if someone else pulled the same shenanigan that month already
Or similar. “Bit9 decides their crawler should DoS our site.” “New data in the database causes existing app to start throwing 100 lines of errors every time it’s hit.” Et cetera. In a decent sized environment, fixing these things isn’t always a couple hour deal.
I’ll also note that Splunk doesn’t supply a lot of good tools to manage your log volumes and levels and license. The licensing dashboard they have for download now? One of my guys wrote that!
Some other tools mentioned by devops-toolchain folks (thanks all):
Clarity – http://github.com/tobi/clarity – Clarity is a Splunk like web interface for server log files. It supports searching (using grep) as well as trailing log files in realtime (using tail). Doesn’t do the aggregation itself, you’d do that with something else
Logstash – http://code.google.com/p/logstash/ – Supposed to do centralized log storage, indexing, and searching; in beta. Ruby and needs a message queue.
petit – http://crunchtools.com/software/petit/, mentioned above – hey, it has a Web site that actually shows what it looks like, so I already like it better than the two above. Command line based log cruncher that kinda does lexical analysis and generates cute ASCII charts. Definitely an evolution towards splunkness above the usual swatchy kinds of “we grep! No really!” tools. Turning logs into metrics is the bomb.
http://www.loggly.com/ is nice, too, although a bit young. Hosted
splunk, essentially. On the one hand, “Hmmm, sending really sensitive data to a third party.” Anyone hacks loggly and they own your ass hard since they get all your internal system details. On the other hand, why should you have to fiddle around with your own GB/TB of logs, you could “have people for that.”
More limited “key metrics” kinds of tools
If you are looking at a log management solution in the cloud, you should check out Loggly. From the cloud, for the cloud. It’s in private beta right now, but hopefully soon, it will be available to the broader public.
I would also suggest Motadata, which also correlates log data with network flow and metrics.
Pingback: Why A HTTP Sniffer Is Awesome « the agile admin
Splunk brought “Live Tail” back in 4.1. It’s now called Real Time Search and is available in the time selector in real time windows. Live Tail in 3.x was a hack. Real Time gives you almost the whole search language and graphing dashboards as well. Much better than it used to be.
There has been an explosion in open source logging tools leveraging the various noSQL engines out there. Here’s a quick list I collected from a recent LinkedIn DevOps thread:
splunk (many for core samples only)
syslog-ng + petit
scribe (by facebook, https://github.com/facebook/scribe, uses thrift and optionally hadoop)
flume (cloudera, https://github.com/cloudera/flume)
chukwa (hadoop based, http://incubator.apache.org/chukwa/docs/r0.3.0/admin.html)
logstash (sponsored by loggly, http://code.google.com/p/logstash/)
newrelic (SaaS APM)
graylog2 (store in mongo, http://www.graylog2.org/)
bunyan (store in mongo, https://github.com/ajsharp/bunyan)
nagios and check_mk (http://mathias-kettner.de/check_mk.html)
Just to update, we went ahead with Splunk – it’s expensive, but has so much power out of the box. Our ops team is rolling it out to our Amazon and Azure systems right now and going through the process of refining the log interpretation. The *nix and Windows apps are nice; they give better information on where CPU/memory/etc is going than morst monitoring tools do (breakdown by process, etc.).
In a much later update, we’re migrating over to SumoLogic – it has less functionality but at our scale it’s much less expensive. Maybe later we’ll go with an open source solution, but for now it’s hitting a balance between expensive engineer time and buchu logs.
In an even later update, we went from Splunk to Sumo but were starting to look at logstash+kibana for higher volume stuff. Also at CloudAustin we had a logging roundup with plenty of demos, we recorded video of the meetup: http://cloudaust.in/2013/06/26/july-meetup-log-like-you-mean-it/
I still find this topic an issue. There are just too many options and most of the time there is no clear vision defined for the various solutions. No reasoning for the need of the various combinations of the libraries.
I found this recommendation: http://12factor.net/logs
It says, just log to the standard output from your apps and let the execution environment handle the future of the messages.
This makes it unnecessary for the application to be concerned about including any extra software dependency for a specific logging solution.
This sounds indeed very clean. Heroku kinda does this somehow, but there is no clear recipe how to fabricate such an infrastructure from open source tools.
Bunyan + (upstart + /usr/bin/logger + rsyslog + Redis) + Logstash + ElasticSearch + Kibana seems to be keeps coming up in context of nodejs applications, but there are multitudes of problems with it:
– it’s brutally complicated
– it’s not clear how the redis config deals with outage
– logstash is extremely heavy (especially compared to our little nodejs services)
– the syslog structure is very limited
– it’s log levels are in the opposite order than most loggers’ levels
– if the log is written to standard output, then the message should encode the log level somehow, which is not obvious to massage back to the syslog message structure
– using syslog to buffer log messages locally in case of a log collector connectivity outage, but
1, it’s just complicated and not very well documented
2, there are 2 config file formats for rsyslog and tons of examples for the old format; very confusing…
3, i had to upgrade “manually” to the latest stable rsyslog, because even the latest ubuntu release ships with a very old version
4, i hit a few hours old issue during setting a cluster up for the 1st time
5, the rsyslog site is down as we speak (at least from hong kong it’s not visible)
I can’t believe there is no sane, lightweight solution for this problem….
I can live with the need for elastic search.
Kibana is a client side web app, so any static file server can serve it, that’s also fine.
But how to bridge an application
a, which logs structured data
b, onto stdout
c, uses different log levels
d, with a log collector
e, without losing messages during network outage to the log collector?
Hey, this is Julian from Logentries. If you are looking for an easy-to-use service for centralizing, managing and analyzing your log data, check out our free account at http://logentries.com. We have built the service for the cloud so that you can get to the important data you need, in seconds, at a very cost-effective price. Let us know your feedback or technical questions at firstname.lastname@example.org.
Let me vouch for these guys, I saw them for the first time at Velocity 2014 and was impressed, it’s a legit option to Sumo etc. and should probably be in your log-SaaS-eval-roundups. Splunk Storm never went anywhere and Loggly is dead-wait-I-hear-maybe-they’re-alive, so it’s not a big field at the moment!
The new CloudWatch log functionality is pretty nice
https://aws.amazon.com/blogs/aws/cloudwatch-log-service/ – it’ll aggregate and dump them into S3 for you, later searching is your problem but you can run filters/detection on them as they come in and alert/graph on errors and other items of interest.
And does anyone know if Loggly is still working and moving forward or no? I hear competing rumors.
Hi Ernest and company,
Linda from Loggly here. I see your question is old, but I wanted to close the loop on this thread. Loggly is indeed alive, well, and growing! Latest news on our revenue here: