In the team I am working there is a problem with logs management, we manage more then one hundred machines with heterogeneous systems, with several hundred applications.
Heterogeneous systems with different platform: windows, linux, documentum, kofax, websphere, iis, etc. All with different log formats and log location, some in the event-viewer and most in separate log files, etc.
Sometimes its hard to figure out in what machines are installed each systems, some times machines get out of free space, some times there no easy way to find where the logs are located.
Ideally logs should be accessibile very fast so we can collaborate in trouble-shutting immediately reducing down-time of anomalies. And we should keep them for some time so non obvious problems detected "a posteriori". And Free disk space must be assured, system in production environment shouldn't stop ever.
Do you know a solution and/or product that can help in a situation like this?
If you can make the data you want to log available via SNMP, a monitoring tool like Zenoss Core or Nagios/Cacti allows you to retrieve that SNMP data from each system, log and graph it, and generate alerts when thresholds are exceeded. The good thing about SNMP is that it's available freely and cross-platform. Zenoss Core is also free and easy to setup. We use it to monitor only a half dozen servers, but it scales to many hundreds. Some features require the non-free enterprise version.
A tool like Splunk is another option. This simply collects the raw log data (you tell your servers to send their logs to your Splunk server), indexes it and makes it searchable. You can create reports, dashboards and alerts. It requires more setup and isn't free, but is powerful because it's very free-form, and will allow you to correlate events across many servers. Check out their demonstration video.
I'm pretty sure Nagios is the way you want to go here. We have it setup on our network and it works great.
We use NFS mounts from NetApp 2020's as central logging points - you've still got to write some code to look for issues but at least they're all in less places to get to.