Yamon: Micro-HOWTO ================== 1. cp sample.yam yourhost.yam && vim yourhost.yam 2. perl -c yourhost.yam && yamon.pl -v -a yourhost.yam 3. crontab -e 4. Profit!! Yamon: Slightly longer HOWTO ============================ Yamon, as of version 0.92, consists of two Perl programs. One, yamon.pl, takes care of monitoring and alerting. Alone, yamon.pl does "black box" monitoring, that is probing over the internet whether services are up and running and diagnosing problems much the way a technically savvy user might. Yamon.pl may be all you need. The other, yamon.cgi, adds the ability to perform "white box" monitoring as well. If installed on the hosts you want to monitor, it will expose internal system details, such as free disk space and running processes to yamon.pl. White box monitoring has the potential to warn you about problems (such as disks filling up or logs growing too fast) before they actually lead to user-visible problems. Getting yamon.pl up and running ------------------------------- To make Yamon monitor something, the quickest way is probably to make a copy of the sample.yam file named "yourhost.yam", and edit that. To verify the syntax of your new .yam file is correct, run: perl -c yourhost.yam It should say "yourhost.yam syntax OK". You can then try running it manually in verbose mode with: ./yamon.pl -v yourhost.yam Or, to force it to run all the tests (no dependencies): ./yamon.pl -v -a yourhost.yam You can simulate a failure by creating files in the current working directory named "fail-TEST", where TEST is the name of the test you want to fail, as defined in the "yourhost.yam". Success can be simulated by creating similar files named "succeed-TEST". Play with this a bit and have a look at the contents of the stats file the script creates. Note that you will probably need to run yamon.pl a few times to cause it to actually send an alert, depending on what you've set the alert_threshold to. See also http://yamon.klaki.net/ for further suggestions on how/what to monitor. If all looks well, you finish up by adding Yamon to your Unix crontab. Run "crontab -e" from the shell, and add a line like this to the file: */5 * * * * /path/to/yamon.pl /path/to/yourhost.yam The "*/5" bit means "run this every five minutes". Feel free to choose some other frequency, whatever you think makes sense for your service. See "man 5 crontab" for other scheduling techniques. What you put in the crontab file will be the maximum frequency of Yamon tests, depending on the definitions in the .yam file it may end up testing things far less frequently (until things break that is, failing tests run at max frequency until the problems go away). System health checks and yamon.cgi ---------------------------------- Yamon 0.92 adds the "syshealth" check and the yamon.cgi program. The syshealth check lets Yamon poll a web-server for a simple list of variables and values. Yamon can be configured to check whether some (or all) of the exposed values are within an expected range, and alert if not See the sample config for details. You can add functionality to your own web-apps to expose critical details to Yamon or just use yamon.cgi. Or both! Yamon.cgi exposes common system variables in the format expected by the syshealth check. Most of the common system variables people like to monitor, such as disk utilization and running processes are exposed. Yamon.cgi can be installed as a normal CGI program (consult your web server's manual for details) or it can be added to (x)inetd as a standalone service on a port of its own. The CGI installation is usually simpler (just copy yamon.cgi to your /cgi-bin/ directory), but I personally prefer to use xinetd as this allows yamon.cgi to remain useful even if the system's main web server goes down for some reason. A sample config for xinetd is included in the samples directory. Yamon.cgi currently exposes the following information: - The system hostname and uname -a output - Disk utilization (blocks and inodes) as a percentage per filesystem - Load average - The sizes of log files in /var/log/... - The forecasted relative size of this time-frame's log compared with the last. A trend of 2.0 means the log has suddenly become twice as busy, possibly a sign of trouble. - The size of the mail queue, as reported by mailq (iff run as a user with the required permissions) - Memory usage statistics (Linux only, sorry) - Process statistics: - The total number of running processes - Counts for all running binaries (how many of each are active) - The binary using the most CPU - The binary with the most cumulative CPU time - Which binary is running the most instances If yamon.cgi doesn't expose all the information you need, you can easily write your own similar programs and instruct yamon.pl to monitor them as well; the format expected by syshealth is plain text output like this: variable_name: VALUE variable_name_2: VALUE2 If the values are numbers or percentages, you can use yamon.pl's built in range checking to verify that they are neither too high nor too low. WARNING: many people would consider it a security risk to expose all that information to the Internet at large. Consider restricting access to your system health reporters to only the IP addresses of the hosts doing the monitoring. Keeping historic data using rrdtool ----------------------------------- Yamon 0.94 adds support for automatically exporting collected data to rrdtool round-robin databases. This is primarily useful for generating graphs, but the data could potentially be extracted for other use-cases as well. For normal tests, you can record two things: the time it took to run the test, and whether the test failed or not. For syshealth style checks, you can also record any of the individual variables collected. See sample.yam for examples (grep for "rrd"). Note that you will need to have the rrdtool package installed and in your path. Aside from auto-creating (and updating) RRD databases for you, Yamon does little to simplify/obscure the rrdtool command-line interface. You'll need to read the rrdtool documentation: http://www.mrtg.org/rrdtool/ ... Happy monitoring! - Bjarni R. Einarsson http://bre.klaki.net/