Keeping Naemon or Nagios running at all times with a systemd drop-in unit

Due to a silly bug in Naemon 1.0.4, I looked into ways to make sure it always restarts if it dies or is killed. Turns out it’s rather easy thanks to systemd.

Create a naemon.service.d directory in /etc/systemd/system/

cd /etc/systemd/system/
mkdir naemon.service.d
cd naemon.service.d

Create the file 10-restart.conf with the following contents:

[Service]
RestartSec=10s
Restart=always

Now reload systemd:

systemctl daemon-reload

And make sure the unit is overridden:

[root@manage naemon.service.d]# systemd-delta | grep naemon
[EXTENDED] /usr/lib/systemd/system/naemon.service -> /etc/systemd/system/naemon.service.d/10-restart.conf

Then try killing naemon and watch it restart

killall naemon
watch systemctl status naemon
Advertisements
Keeping Naemon or Nagios running at all times with a systemd drop-in unit

Regex process check with Nagios/Adagios

Check Command:

define command {
 command_name check_nrpe_procs_regex
 command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_procs_regex -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_USER$ $_SERVICE_EREG_ARG_ARRAY$
}

NRPE command:

command[check_procs_regex]=/usr/lib64/nagios/plugins/check_procs -w $ARG1$ -c $ARG2$ -u $ARG3$ --ereg-argument-array "$ARG4$"

Nagios service:

define service {
 use okc-linux-check_proc
 host_name hostname.domain.com
 __NAME apache2
 __WARNING 1:100
 __CRITICAL 0:200
 service_description Process apache2
 check_command check_nrpe_procs_regex
 __EREG_ARG_ARRAY '/usr/sbin/apache2'
 __USER www-data
}
Regex process check with Nagios/Adagios

Monitoring free inodes on Linux with Nagios/Adagios

This howto assumes:

  • nrpe is installed and working on the client
  • CentOS 6/7 on both sides
  • Nagios/Adagios server with pynag installed and working

On the server you want to monitor:

Install the check_disk plugin for nrpe:

yum install nagios-plugins-disk

Add the following to /etc/nrpe.d/check_disk_inodes.cfg:

command[check_disk_inodes]=/usr/lib64/nagios/plugins/check_disk -W "$ARG1$" -C "$ARG2$" "$ARG3$"

Restart NRPE (NOTE: Use systemctl if using CentOS 7):

service nrpe restart

On the Nagios server:

Add a check command:

pynag add command command_name="2ks-check_nrpe_disk_inodes" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk_inodes -a "$_SERVICE_WARNING$" "$_SERVICE_CRITICAL$" "$_SERVICE_OPTIONAL_ARGUMENTS$"'

NOTE: In my case pynag placed the cfg file in /etc/nagios/commands/, but it was not included as a cfg_dir in nagios.cfg. To fix that, run:

pynag config --append cfg_dir=/etc/nagios/commands/

Add the service to the host:

pynag add service service_description="Disk inodes" use="generic-service" host_name="host.domain.com" check_command="2ks-check_nrpe_disk_inodes" __CRITICAL="5%" __WARNING="10%"

Reload nagios (NOTE: Use systemctl if using CentOS 7):

service nagios reload

The check output should now show something like:

DISK OK - free space: / 1613 MB (35% inode=95%): /boot 53 MB (57% inode=99%): /dev/shm 1004 MB (100% inode=99%): /var/spool 8682 MB (53% inode=11%):
Monitoring free inodes on Linux with Nagios/Adagios

Creating custom okconfig templates

For this example I have a host (google.com) with HTTP, HTTPS, DNS, and Ping checks.

google-1

I’ve customized some of the service checks and want to create a template called “Google Server” from this host and it’s services.

To do this I will have to combine the config files for all services into a template. Templates are by default stored in /etc/nagios/okconfig/examples/, and have the file extension .cfg-example.

Locate all config files for this host, this can be done in a few ways, but the easiest is probably with pynag:

[root@adagios ~]# pynag list --quiet filename where host_name=google.com and register=1 | sort | uniq
/etc/nagios/okconfig/hosts/default/google.com-dns.cfg
/etc/nagios/okconfig/hosts/default/google.com-host.cfg
/etc/nagios/okconfig/hosts/default/google.com-http.cfg
/etc/nagios/okconfig/hosts/default/google.com-https.cfg
[root@adagios ~]#

I can do this in the Adagios web interface too, by searching for the host, selecting all services and clicking the bulk edit button (I won’t actually be editing anything, bulk edit will just show me all file names).
google-2

Next I want to combine all services into a template file: /etc/nagios/okconfig/examples/google-server.cfg-example

[root@adagios ~]# cat /etc/nagios/okconfig/hosts/default/google.com-dns.cfg > /etc/nagios/okconfig/examples/google-server.cfg-example
[root@adagios ~]# cat /etc/nagios/okconfig/hosts/default/google.com-http.cfg >> /etc/nagios/okconfig/examples/google-server.cfg-example
[root@adagios ~]# cat /etc/nagios/okconfig/hosts/default/google.com-https.cfg >> /etc/nagios/okconfig/examples/google-server.cfg-example

NOTE: I didn’t include the host config because I haven’t defined any custom services in it. If there are any services in the host config you would like in the template (chances are there will be), add them to the template file yourself. To see what services are defined in the host config, use: pynag list where object_type=service and register=1 and filename=<location of host config>

[root@adagios ~]# pynag list where object_type=service and register=1 and filename=/etc/nagios/okconfig/hosts/default/google.com-host.cfg
object_type          shortname            filename
--------------------------------------------------------------------------------
service              google.com/Ping      /etc/nagios/okconfig/hosts/default/google.com-host.cfg
service              google.com/test      /etc/nagios/okconfig/hosts/default/google.com-host.cfg
----------2 objects matches search condition------------------------------------
[root@adagios ~]#

As seen above, I added a “test” service to the host through the Adagios web interface, and it was saved in the host config file. The define service {…} part is what you would add to the template config.

[root@adagios ~]# cat /etc/nagios/okconfig/hosts/default/google.com-host.cfg
...
define service {
         service_description            test
         use                            generic-service
         host_name                      google.com
        check_command                 okc-check_dummy
        __EXIT_CODE                   0
        __MESSAGE                     Cool!
}

To prepare the template so it can be applied to other hosts, replace the host name and group name with HOSTNAME and GROUP. Okconfig will substitute HOSTNAME and GROUP with the host (and if any, group) name you specify when adding the template. In my case the hostname was google.com and group was default.

sed -i 's/google.com/HOSTNAME/g;s/default/GROUP/g' /etc/nagios/okconfig/examples/google-server.cfg-example

Example how a service defenition should change:

Before:

define service {
        host_name               google.com
        contact_groups          default
        service_description     HTTPS google.com
        check_command           okc-check_https
        use                     okc-check_https
        __URI                   /
        __SEARCH_STRING
        __RESPONSE_WARNING      2
        __RESPONSE_CRITICAL     10
        __VIRTUAL_HOST          google.com
        __PORT                  443
}

After:

define service {
        host_name               HOSTNAME
        contact_groups          GROUP
        service_description     HTTPS HOSTNAME
        check_command           okc-check_https
        use                     okc-check_https
        __URI                   /
        __SEARCH_STRING
        __RESPONSE_WARNING      2
        __RESPONSE_CRITICAL     10
        __VIRTUAL_HOST          HOSTNAME
        __PORT                  443

}

Next, add the host and select the newly created Google Server template:

google-3

google-4

I deleted the google.com host, removed all config files, and added it again with only the template I created:

google-5

All this could probably be done in one (or a few) pynag copy commands, but I haven’t tested that.

Creating custom okconfig templates

Postgresql 9.2 monitoring with Adagios on CentOS 7

On the PostgreSQL server:

Note: You may need to deal with SELinux.

Install some needed perl modules, download the check script and make it executable:

yum install perl-Data-Dumper perl-Digest-MD5 perl-Getopt-Long perl-File-Temp perl-Time-HiRes perl-TimeDate
cd /usr/lib64/nagios/plugins
wget https://raw.githubusercontent.com/bucardo/check_postgres/master/check_postgres.pl
chmod +x check_postgres.pl

Add the following to /usr/lib64/nagios/plugins/check_postgres_stats.sh:

#!/bin/bash
DB="$1"
STATS=$(/usr/lib64/nagios/plugins/check_postgres.pl --datadir /var/lib/pgsql/data/ -db "$DB" --action dbstats | sed 's/:/=/g')
echo "OK: Postgres stats collected | $STATS"

Add the following to /etc/nrpe.d/check_postgres.cfg:

command[check_postgres]=/usr/bin/sudo -u postgres /usr/lib64/nagios/plugins/check_postgres.pl --datadir /var/lib/pgsql/data/ -db '$ARG1$' --action '$ARG2$'
command[check_postgres_w]=/usr/bin/sudo -u postgres "/usr/lib64/nagios/plugins/check_postgres.pl" --datadir /var/lib/pgsql/data/ -db '$ARG1$' --action '$ARG2$' --warning '$ARG3$'
command[check_postgres_wc]=/usr/bin/sudo -u postgres "/usr/lib64/nagios/plugins/check_postgres.pl" --datadir /var/lib/pgsql/data/ -db '$ARG1$' --action '$ARG2$' --warning '$ARG3$' --critical '$ARG4$'
command[check_postgres_stats]=/usr/bin/sudo -u postgres /usr/lib64/nagios/plugins/check_postgres_stats.sh '$ARG1$'

Add the following to /etc/sudoers.d/nrpe using visudo:

visudo -f /etc/sudoers.d/nrpe
Defaults:nrpe !requiretty
nrpe ALL=(postgres) NOPASSWD: /usr/lib64/nagios/plugins/check_postgres.pl
nrpe ALL=(postgres) NOPASSWD: /usr/lib64/nagios/plugins/check_postgres_stats.sh
 On the Nagios server:

Create the check commands:

pynag add command command_name="2ks-check_nrpe_postgres" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres -a '$_SERVICE_DATABASE$' '$_SERVICE_ACTION$''
pynag add command command_name="2ks-check_nrpe_postgres_w" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres_w -a '$_SERVICE_DATABASE$' '$_SERVICE_ACTION$' '$_SERVICE_WARNING$''
pynag add command command_name="2ks-check_nrpe_postgres_wc" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres_wc -a '$_SERVICE_DATABASE$' '$_SERVICE_ACTION$' '$_SERVICE_WARNING$' '$_SERVICE_CRITICAL$''
pynag add command command_name="2ks-check_nrpe_postgres_stats" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_postgres_stats -a '$_SERVICE_DATABASE$''

Create the okconfig template /etc/nagios/okconfig/examples/postgres.cfg-example:

define service {
    use                            okc-linux-check_proc
    __WARNING                      1:
    __NAME                         postgres
    host_name                      HOSTNAME
    service_description            Process postgres
    __CRITICAL                     :20
    check_command                 okc-check_nrpe!check_procs -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_NAME$
}

define service {
        service_description           PostgreSQL Database connection
         use                            generic-service
         host_name                      HOSTNAME
        check_command                 2ks-check_nrpe_postgres
        __DATABASE                    database_1
        __ACTION                      connection
        notes                         Simply connects and returns version number.
}

define service {
    use                            generic-service
    __DATABASE                     database_1
    check_command                 2ks-check_nrpe_postgres_stats
    host_name                      HOSTNAME
        service_description           PostgreSQL Database statistics
        notes                         Reports information from the pg_stat_database view, and outputs as performance data.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      bloat
         host_name                      HOSTNAME
        service_description           PostgreSQL Database bloat
        __CRITICAL                    50%
        __WARNING                     25%
        notes                         Checks the amount of bloat in tables and indexes. Bloat is generally the amount of dead unused space taken up in a table or index. This space is usually reclaimed by use of the VACUUM command.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      locks
         host_name                      HOSTNAME
        service_description           PostgreSQL Database locks
        __CRITICAL                    300
        __WARNING                     150
        notes                         Check the total number of locks on one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      timesync
         host_name                      HOSTNAME
        service_description           PostgreSQL Database timesync
        __CRITICAL                    5
        __WARNING                     2
        notes                         Compares the local system time with the time reported by one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      last_vacuum
         host_name                      HOSTNAME
        service_description           PostgreSQL Database last vacuum
        __CRITICAL                    7d
        __WARNING                     3d
        notes                         Checks how long it has been since vacuum (or analyze) was last run on each table in one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      backends
         host_name                      HOSTNAME
        service_description           PostgreSQL Database backends
        __CRITICAL                    95
        __WARNING                     80
        notes                         Checks the current number of connections for one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      hitratio
         host_name                      HOSTNAME
        service_description           PostgreSQL Database hitratio
        __CRITICAL                    80%
        __WARNING                     90%
        notes                         Checks the hit ratio of all databases and complains when they are too low.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      query_time
         host_name                      HOSTNAME
        service_description           PostgreSQL Database query time
        __CRITICAL                    10
        __WARNING                     5
        notes                         Checks the length of running queries on one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      txn_idle
         host_name                      HOSTNAME
         service_description            PostgreSQL Database connections idle in transaction
        __CRITICAL                    5 for 10 seconds
        __WARNING                     2 for 5 seconds
        notes                         Checks the number and duration of "idle in transaction" queries on one or more databases.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_w
        __ACTION                      disabled_triggers
         host_name                      HOSTNAME
         service_description            PostgreSQL Database disabled triggers
        __WARNING                     1
        notes                         Checks on the number of disabled triggers inside the database. In normal usage having disabled triggers is a dangerous event.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_wc
        __ACTION                      checkpoint
         host_name                      HOSTNAME
        service_description           PostgreSQL Database last checkpoint
        __CRITICAL                    600
        __WARNING                     400
        notes                         Determines how long since the last checkpoint has been run.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        check_command                 2ks-check_nrpe_postgres_w
        __ACTION                      settings_checksum
         host_name                      HOSTNAME
         service_description            PostgreSQL Database settings checksum
        __WARNING                     c6358648f0d06757a8311709be307f24
        notes                         Checks that all the Postgres settings are the same as last time you checked.
}

define service {
         use                            generic-service
         __DATABASE                     database_1
        __WARNING                     15GB
         check_command                  2ks-check_nrpe_postgres_wc
        __ACTION                      database_size
         host_name                      HOSTNAME
         service_description            PostgreSQL Database size
        __CRITICAL                    30GB
        notes                         Checks the size of all databases and complains when they are too big.
}

Add the template to a host:

okconfig addtemplate db-01.domain.com --template postgres

The values provided in the above configuration are examples. You should change them according to your needs.
adagios_postgres_status
Source: https://bucardo.org/check_postgres/check_postgres.pl.html

Postgresql 9.2 monitoring with Adagios on CentOS 7

Varnish 4 monitoring with Adagios on CentOS 7

On the Varnish server:

Install prerequisites:

yum install git automake libtool varnish-libs-devel

Clone the varnish-nagios repo, autogen, configure, and make:

git clone https://github.com/varnish/varnish-nagios.git
cd varnish-nagios
./autogen.sh
./configure
make

Move the check_varnish binary to /usr/lib64/nagios/plugins/ and restore SELinux context:

mv check_varnish /usr/lib64/nagios/plugins/
restorecon /usr/lib64/nagios/plugins/check_varnish

Create the nrpe command and restart nrpe:

echo 'command[check_varnish]=/usr/lib64/nagios/plugins/check_varnish -p "$ARG1$" -w "$ARG2$" -c "$ARG3$"' &gt; /etc/nrpe.d/check_varnish.cfg
systemctl restart nrpe.service

To see if the check works, run:

/usr/lib64/nagios/plugins/check_varnish -p MAIN.sess_dropped -w 0 -c 5
/usr/lib64/nagios/plugins/check_varnish -p MGT.child_panic -w 0 -c 2
/usr/lib64/nagios/plugins/check_varnish -p SMA.Transient.c_fail -c 0
/usr/lib64/nagios/plugins/check_varnish -p ratio -w 20:90 -c 10:98

It should return:

[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p MAIN.sess_dropped -w 0 -c 5
VARNISH OK: Sessions dropped for thread (0)|MAIN.sess_dropped=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p MGT.child_panic -w 0 -c 2
VARNISH OK: Child process panic (0)|MGT.child_panic=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p SMA.Transient.c_fail -c 0
VARNISH OK: Allocator failures (0)|SMA.Transient.c_fail=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p ratio -w 20:90 -c 10:98
VARNISH OK: Cache hit ratio (26)|ratio=26
[root@varnish-host ~]#
On the Nagios server:

Create a check command:

pynag add command command_name="2ks-check_nrpe_varnish_status" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_varnish -a "$_SERVICE_PARAMETER$" "$_SERVICE_WARNING$" "$_SERVICE_CRITICAL$"'

NOTE: In my case pynag placed the cfg file in /etc/nagios/commands/, but it was not included as a cfg_dir in nagios.cfg. To fix that, run:

pynag config --append cfg_dir=/etc/nagios/commands/

Create an okconfig template:

echo 'define service {
    service_description            Varnish: Sessions dropped
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   MAIN.sess_dropped
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    5
    __WARNING                     0
    notes                         This counter will show the number of requests that have to be dropped because no more threads were available to handle them.
}
define service {
    service_description            Varnish: Child process panic
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   MGT.child_panic
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    2
    __WARNING                     0
    notes                         This counter will count the number of times the child has paniced. The master process will restart the child immediately when it happens, and the cache will be flushed.
}
define service {
    service_description            Varnish: Allocator failures
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   SMA.Transient.c_fail
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    0
    __WARNING                     0
    notes                         This counter indicates that the operating system is unable to allocate memory as requested.
}
define service {
    service_description            Varnish: Cache hit ratio
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   ratio
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    10:98
    __WARNING                     20:90
}
define service {
    use                            okc-linux-check_proc
    __WARNING                      1:
    __NAME                         varnishd
    host_name                      HOSTNAME
    service_description            Process varnishd
    __CRITICAL                     :10
    check_command                 okc-check_nrpe!check_procs -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_NAME$

}' > /etc/nagios/okconfig/examples/varnish.cfg-example

Add the template to a host:

okconfig addtemplate www-01.domain.com --template varnish

Reload nagios and run the service checks from the Adagios web interface, and they should be green:
varnish_template_status_adagios
Sources:
https://www.varnish-software.com/blog/blog-sysadmin-monitoring-health-varnish-cache

Varnish 4 monitoring with Adagios on CentOS 7

Nginx 1.6.3 status monitoring with Adagios on CentOS 7

Download check_nginx_status.pl:

cd /usr/lib64/nagios/plugins/
wget https://raw.githubusercontent.com/regilero/check_nginx_status/master/check_nginx_status.pl

Install prerequisites:

yum install perl-libwww-perl nagios-plugins-perl

Create a check command:

pynag add command command_name="2ks-check_nginx_status" command_line='$USER1$/check_nginx_status.pl -H $HOSTADDRESS$ -p $_SERVICE_PORT$ -s $_SERVICE_SERVER_NAME$ $_SERVICE_OPTIONAL_ARGUMENTS$'

Create an okconfig template:

cd /etc/nagios/okconfig
echo 'define service {
    service_description            Nginx status
    use                            generic-service
    host_name                      HOSTNAME
    check_command                  2ks-check_nginx_status
    __PORT                         80
    __SERVER_NAME                  HOSTNAME
}
define service {
    use                            okc-linux-check_proc
    __WARNING                      1:
    __NAME                         nginx
    host_name                      HOSTNAME
    service_description            Process nginx
    __CRITICAL                     :10
    check_command                 okc-check_nrpe!check_procs -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_NAME$
}' > nginx.cfg-example

Add the template to a host:

okconfig addtemplate www-01.domain.com --template nginx

Reload nagios and run the service checks from the Adagios web interface, and they should be green:

Troubleshooting
Problem: Running the plugin in nagios causes this error:

NGINX UNKNOWN - unable to write temporary data in :/tmp/123.123.123._check_nginx_status...

Probable causes:

You ran the plugin as root, it saved the *check_nginx_status file in /tmp with root ownership, and nagios can’t overwrite it.
SELinux or other ACL’s are preventing nagios from write to /tmp.

Solutions:
Delete the file:

rm -f /tmp/*check_nginx_status

Or reset permissions on /tmp:

chmod 777 /tmp/
chown root:root /tmp/
restorecon -R /tmp/ #NOTE: SELinux should be in permissive mode when using Adagios, unless you can create rules for it.
Nginx 1.6.3 status monitoring with Adagios on CentOS 7