Monitoring free inodes on Linux with Nagios/Adagios

This howto assumes:

  • nrpe is installed and working on the client
  • CentOS 6/7 on both sides
  • Nagios/Adagios server with pynag installed and working

On the server you want to monitor:

Install the check_disk plugin for nrpe:

yum install nagios-plugins-disk

Add the following to /etc/nrpe.d/check_disk_inodes.cfg:

command[check_disk_inodes]=/usr/lib64/nagios/plugins/check_disk -W "$ARG1$" -C "$ARG2$" "$ARG3$"

Restart NRPE (NOTE: Use systemctl if using CentOS 7):

service nrpe restart

On the Nagios server:

Add a check command:

pynag add command command_name="2ks-check_nrpe_disk_inodes" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk_inodes -a "$_SERVICE_WARNING$" "$_SERVICE_CRITICAL$" "$_SERVICE_OPTIONAL_ARGUMENTS$"'

NOTE: In my case pynag placed the cfg file in /etc/nagios/commands/, but it was not included as a cfg_dir in nagios.cfg. To fix that, run:

pynag config --append cfg_dir=/etc/nagios/commands/

Add the service to the host:

pynag add service service_description="Disk inodes" use="generic-service" host_name="host.domain.com" check_command="2ks-check_nrpe_disk_inodes" __CRITICAL="5%" __WARNING="10%"

Reload nagios (NOTE: Use systemctl if using CentOS 7):

service nagios reload

The check output should now show something like:

DISK OK - free space: / 1613 MB (35% inode=95%): /boot 53 MB (57% inode=99%): /dev/shm 1004 MB (100% inode=99%): /var/spool 8682 MB (53% inode=11%):
Monitoring free inodes on Linux with Nagios/Adagios

Postgresql 9.2 monitoring with Adagios on CentOS 7

On the PostgreSQL server:

Note: You may need to deal with SELinux.

Install some needed perl modules, download the check script and make it executable:

yum install perl-Data-Dumper perl-Digest-MD5 perl-Getopt-Long perl-File-Temp perl-Time-HiRes perl-TimeDate
cd /usr/lib64/nagios/plugins
wget https://raw.githubusercontent.com/bucardo/check_postgres/master/check_postgres.pl
chmod +x check_postgres.pl

Add the following to /usr/lib64/nagios/plugins/check_postgres_stats.sh:

#!/bin/bash
DB="$1"
STATS=$(/usr/lib64/nagios/plugins/check_postgres.pl --datadir /var/lib/pgsql/data/ -db "$DB" --action dbstats | sed 's/:/=/g')
echo "OK: Postgres stats collected | $STATS"

Add the following to /etc/nrpe.d/check_postgres.cfg:

command[check_postgres]=/usr/bin/sudo -u postgres /usr/lib64/nagios/plugins/check_postgres.pl --datadir /var/lib/pgsql/data/ -db '$ARG1$' --action '$ARG2$'
command[check_postgres_w]=/usr/bin/sudo -u postgres "/usr/lib64/nagios/plugins/check_postgres.pl" --datadir /var/lib/pgsql/data/ -db '$ARG1$' --action '$ARG2$' --warning '$ARG3$'
command[check_postgres_wc]=/usr/bin/sudo -u postgres "/usr/lib64/nagios/plugins/check_postgres.pl" --datadir /var/lib/pgsql/data/ -db '$ARG1$' --action '$ARG2$' --warning '$ARG3$' --critical '$ARG4$'
command[check_postgres_stats]=/usr/bin/sudo -u postgres /usr/lib64/nagios/plugins/check_postgres_stats.sh '$ARG1$'

Add the following to /etc/sudoers.d/nrpe using visudo:

visudo -f /etc/sudoers.d/nrpe
Defaults:nrpe !requiretty
nrpe ALL=(postgres) NOPASSWD: /usr/lib64/nagios/plugins/check_postgres.pl
nrpe ALL=(postgres) NOPASSWD: /usr/lib64/nagios/plugins/check_postgres_stats.sh
 On the Nagios server:

Create the check commands:

pynag add command command_name="2ks-check_nrpe_postgres" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres -a '$_SERVICE_DATABASE$' '$_SERVICE_ACTION$''
pynag add command command_name="2ks-check_nrpe_postgres_w" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres_w -a '$_SERVICE_DATABASE$' '$_SERVICE_ACTION$' '$_SERVICE_WARNING$''
pynag add command command_name="2ks-check_nrpe_postgres_wc" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres_wc -a '$_SERVICE_DATABASE$' '$_SERVICE_ACTION$' '$_SERVICE_WARNING$' '$_SERVICE_CRITICAL$''
pynag add command command_name="2ks-check_nrpe_postgres_stats" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres_stats -a '$_SERVICE_DATABASE$''

Create the okconfig template /etc/nagios/okconfig/examples/postgres.cfg-example:

define service {
    use                            okc-linux-check_proc
    __WARNING                      1:
    __NAME                         postgres
    host_name                      HOSTNAME
    service_description            Process postgres
    __CRITICAL                     :20
    check_command                 okc-check_nrpe!check_procs -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_NAME$
}

define service {
        service_description           PostgreSQL Database connection
         use                            generic-service
         host_name                      HOSTNAME
        check_command                 2ks-check_nrpe_postgres
        __DATABASE                    database_1
        __ACTION                      connection
        notes                         Simply connects and returns version number.
}

define service {
    use                            generic-service
    __DATABASE                     database_1
    check_command                 2ks-check_nrpe_postgres_stats
    host_name                      HOSTNAME
        service_description           PostgreSQL Database statistics
        notes                         Reports information from the pg_stat_database view, and outputs as performance data.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      bloat
         host_name                      HOSTNAME
        service_description           PostgreSQL Database bloat
        __CRITICAL                    50%
        __WARNING                     25%
        notes                         Checks the amount of bloat in tables and indexes. Bloat is generally the amount of dead unused space taken up in a table or index. This space is usually reclaimed by use of the VACUUM command.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      locks
         host_name                      HOSTNAME
        service_description           PostgreSQL Database locks
        __CRITICAL                    300
        __WARNING                     150
        notes                         Check the total number of locks on one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      timesync
         host_name                      HOSTNAME
        service_description           PostgreSQL Database timesync
        __CRITICAL                    5
        __WARNING                     2
        notes                         Compares the local system time with the time reported by one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      last_vacuum
         host_name                      HOSTNAME
        service_description           PostgreSQL Database last vacuum
        __CRITICAL                    7d
        __WARNING                     3d
        notes                         Checks how long it has been since vacuum (or analyze) was last run on each table in one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      backends
         host_name                      HOSTNAME
        service_description           PostgreSQL Database backends
        __CRITICAL                    95
        __WARNING                     80
        notes                         Checks the current number of connections for one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      hitratio
         host_name                      HOSTNAME
        service_description           PostgreSQL Database hitratio
        __CRITICAL                    80%
        __WARNING                     90%
        notes                         Checks the hit ratio of all databases and complains when they are too low.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      query_time
         host_name                      HOSTNAME
        service_description           PostgreSQL Database query time
        __CRITICAL                    10
        __WARNING                     5
        notes                         Checks the length of running queries on one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      txn_idle
         host_name                      HOSTNAME
         service_description            PostgreSQL Database connections idle in transaction
        __CRITICAL                    5 for 10 seconds
        __WARNING                     2 for 5 seconds
        notes                         Checks the number and duration of "idle in transaction" queries on one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_w
        __ACTION                      disabled_triggers
         host_name                      HOSTNAME
         service_description            PostgreSQL Database disabled triggers
        __WARNING                     1
        notes                         Checks on the number of disabled triggers inside the database. In normal usage having disabled triggers is a dangerous event.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      checkpoint
         host_name                      HOSTNAME
        service_description           PostgreSQL Database last checkpoint
        __CRITICAL                    600
        __WARNING                     400
        notes                         Determines how long since the last checkpoint has been run.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_w
        __ACTION                      settings_checksum
         host_name                      HOSTNAME
         service_description            PostgreSQL Database settings checksum
        __WARNING                     c6358648f0d06757a8311709be307f24
        notes                         Checks that all the Postgres settings are the same as last time you checked.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        __WARNING                     15GB
         check_command                  2ks-check_nrpe_postgres_wc
        __ACTION                      database_size
         host_name                      HOSTNAME
         service_description            PostgreSQL Database size
        __CRITICAL                    30GB
        notes                         Checks the size of all databases and complains when they are too big.
}

Add the template to a host:

okconfig addtemplate db-01.domain.com --template postgres

The values provided in the above configuration are examples. You should change them according to your needs.
adagios_postgres_status
Source: https://bucardo.org/check_postgres/check_postgres.pl.html

Postgresql 9.2 monitoring with Adagios on CentOS 7

Varnish 4 monitoring with Adagios on CentOS 7

On the Varnish server:

Install prerequisites:

yum install git automake libtool varnish-libs-devel

Clone the varnish-nagios repo, autogen, configure, and make:

git clone https://github.com/varnish/varnish-nagios.git
cd varnish-nagios
./autogen.sh
./configure
make

Move the check_varnish binary to /usr/lib64/nagios/plugins/ and restore SELinux context:

mv check_varnish /usr/lib64/nagios/plugins/
restorecon /usr/lib64/nagios/plugins/check_varnish

Create the nrpe command and restart nrpe:

echo 'command[check_varnish]=/usr/lib64/nagios/plugins/check_varnish -p "$ARG1$" -w "$ARG2$" -c "$ARG3$"' > /etc/nrpe.d/check_varnish.cfg
systemctl restart nrpe.service

To see if the check works, run:

/usr/lib64/nagios/plugins/check_varnish -p MAIN.sess_dropped -w 0 -c 5
/usr/lib64/nagios/plugins/check_varnish -p MGT.child_panic -w 0 -c 2
/usr/lib64/nagios/plugins/check_varnish -p SMA.Transient.c_fail -c 0
/usr/lib64/nagios/plugins/check_varnish -p ratio -w 20:90 -c 10:98

It should return:

[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p MAIN.sess_dropped -w 0 -c 5
VARNISH OK: Sessions dropped for thread (0)|MAIN.sess_dropped=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p MGT.child_panic -w 0 -c 2
VARNISH OK: Child process panic (0)|MGT.child_panic=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p SMA.Transient.c_fail -c 0
VARNISH OK: Allocator failures (0)|SMA.Transient.c_fail=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p ratio -w 20:90 -c 10:98
VARNISH OK: Cache hit ratio (26)|ratio=26
[root@varnish-host ~]#
On the Nagios server:

Create a check command:

pynag add command command_name="2ks-check_nrpe_varnish_status" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_varnish -a "$_SERVICE_PARAMETER$" "$_SERVICE_WARNING$" "$_SERVICE_CRITICAL$"'

NOTE: In my case pynag placed the cfg file in /etc/nagios/commands/, but it was not included as a cfg_dir in nagios.cfg. To fix that, run:

pynag config --append cfg_dir=/etc/nagios/commands/

Create an okconfig template:

echo 'define service {
    service_description            Varnish: Sessions dropped
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   MAIN.sess_dropped
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    5
    __WARNING                     0
    notes                         This counter will show the number of requests that have to be dropped because no more threads were available to handle them.
}
define service {
    service_description            Varnish: Child process panic
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   MGT.child_panic
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    2
    __WARNING                     0
    notes                         This counter will count the number of times the child has paniced. The master process will restart the child immediately when it happens, and the cache will be flushed.
}
define service {
    service_description            Varnish: Allocator failures
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   SMA.Transient.c_fail
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    0
    __WARNING                     0
    notes                         This counter indicates that the operating system is unable to allocate memory as requested.
}
define service {
    service_description            Varnish: Cache hit ratio
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   ratio
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    10:98
    __WARNING                     20:90
}
define service {
    use                            okc-linux-check_proc
    __WARNING                      1:
    __NAME                         varnishd
    host_name                      HOSTNAME
    service_description            Process varnishd
    __CRITICAL                     :10
    check_command                 okc-check_nrpe!check_procs -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_NAME$

}' > /etc/nagios/okconfig/examples/varnish.cfg-example

Add the template to a host:

okconfig addtemplate www-01.domain.com --template varnish

Reload nagios and run the service checks from the Adagios web interface, and they should be green:
varnish_template_status_adagios
Sources:
https://www.varnish-software.com/blog/blog-sysadmin-monitoring-health-varnish-cache

Varnish 4 monitoring with Adagios on CentOS 7

Nginx 1.6.3 status monitoring with Adagios on CentOS 7

Download check_nginx_status.pl:

cd /usr/lib64/nagios/plugins/
wget https://raw.githubusercontent.com/regilero/check_nginx_status/master/check_nginx_status.pl

Install prerequisites:

yum install perl-libwww-perl nagios-plugins-perl

Create a check command:

pynag add command command_name="2ks-check_nginx_status" command_line='$USER1$/check_nginx_status.pl -H $HOSTADDRESS$ -p $_SERVICE_PORT$ -s $_SERVICE_SERVER_NAME$ $_SERVICE_OPTIONAL_ARGUMENTS$'

Create an okconfig template:

cd /etc/nagios/okconfig
echo 'define service {
    service_description            Nginx status
    use                            generic-service
    host_name                      HOSTNAME
    check_command                  2ks-check_nginx_status
    __PORT                         80
    __SERVER_NAME                  HOSTNAME
}
define service {
    use                            okc-linux-check_proc
    __WARNING                      1:
    __NAME                         nginx
    host_name                      HOSTNAME
    service_description            Process nginx
    __CRITICAL                     :10
    check_command                 okc-check_nrpe!check_procs -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_NAME$
}' > nginx.cfg-example

Add the template to a host:

okconfig addtemplate www-01.domain.com --template nginx

Reload nagios and run the service checks from the Adagios web interface, and they should be green:

Troubleshooting
Problem: Running the plugin in nagios causes this error:

NGINX UNKNOWN - unable to write temporary data in :/tmp/123.123.123._check_nginx_status...

Probable causes:

You ran the plugin as root, it saved the *check_nginx_status file in /tmp with root ownership, and nagios can’t overwrite it.
SELinux or other ACL’s are preventing nagios from write to /tmp.

Solutions:
Delete the file:

rm -f /tmp/*check_nginx_status

Or reset permissions on /tmp:

chmod 777 /tmp/
chown root:root /tmp/
restorecon -R /tmp/ #NOTE: SELinux should be in permissive mode when using Adagios, unless you can create rules for it.
Nginx 1.6.3 status monitoring with Adagios on CentOS 7