Varnish 4 monitoring with Adagios on CentOS 7

On the Varnish server:

Install prerequisites:

yum install git automake libtool varnish-libs-devel

Clone the varnish-nagios repo, autogen, configure, and make:

git clone https://github.com/varnish/varnish-nagios.git
cd varnish-nagios
./autogen.sh
./configure
make

Move the check_varnish binary to /usr/lib64/nagios/plugins/ and restore SELinux context:

mv check_varnish /usr/lib64/nagios/plugins/
restorecon /usr/lib64/nagios/plugins/check_varnish

Create the nrpe command and restart nrpe:

echo 'command[check_varnish]=/usr/lib64/nagios/plugins/check_varnish -p "$ARG1$" -w "$ARG2$" -c "$ARG3$"' > /etc/nrpe.d/check_varnish.cfg
systemctl restart nrpe.service

To see if the check works, run:

/usr/lib64/nagios/plugins/check_varnish -p MAIN.sess_dropped -w 0 -c 5
/usr/lib64/nagios/plugins/check_varnish -p MGT.child_panic -w 0 -c 2
/usr/lib64/nagios/plugins/check_varnish -p SMA.Transient.c_fail -c 0
/usr/lib64/nagios/plugins/check_varnish -p ratio -w 20:90 -c 10:98

It should return:

[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p MAIN.sess_dropped -w 0 -c 5
VARNISH OK: Sessions dropped for thread (0)|MAIN.sess_dropped=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p MGT.child_panic -w 0 -c 2
VARNISH OK: Child process panic (0)|MGT.child_panic=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p SMA.Transient.c_fail -c 0
VARNISH OK: Allocator failures (0)|SMA.Transient.c_fail=0
[root@varnish-host ~]# /usr/lib64/nagios/plugins/check_varnish -p ratio -w 20:90 -c 10:98
VARNISH OK: Cache hit ratio (26)|ratio=26
[root@varnish-host ~]#
On the Nagios server:

Create a check command:

pynag add command command_name="2ks-check_nrpe_varnish_status" command_line='$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_varnish -a "$_SERVICE_PARAMETER$" "$_SERVICE_WARNING$" "$_SERVICE_CRITICAL$"'

NOTE: In my case pynag placed the cfg file in /etc/nagios/commands/, but it was not included as a cfg_dir in nagios.cfg. To fix that, run:

pynag config --append cfg_dir=/etc/nagios/commands/

Create an okconfig template:

echo 'define service {
    service_description            Varnish: Sessions dropped
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   MAIN.sess_dropped
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    5
    __WARNING                     0
    notes                         This counter will show the number of requests that have to be dropped because no more threads were available to handle them.
}
define service {
    service_description            Varnish: Child process panic
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   MGT.child_panic
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    2
    __WARNING                     0
    notes                         This counter will count the number of times the child has paniced. The master process will restart the child immediately when it happens, and the cache will be flushed.
}
define service {
    service_description            Varnish: Allocator failures
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   SMA.Transient.c_fail
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    0
    __WARNING                     0
    notes                         This counter indicates that the operating system is unable to allocate memory as requested.
}
define service {
    service_description            Varnish: Cache hit ratio
    use                            generic-service
    host_name                      HOSTNAME
    __PARAMETER                   ratio
    check_command                 2ks-check_nrpe_varnish_status
    __CRITICAL                    10:98
    __WARNING                     20:90
}
define service {
    use                            okc-linux-check_proc
    __WARNING                      1:
    __NAME                         varnishd
    host_name                      HOSTNAME
    service_description            Process varnishd
    __CRITICAL                     :10
    check_command                 okc-check_nrpe!check_procs -a $_SERVICE_WARNING$ $_SERVICE_CRITICAL$ $_SERVICE_NAME$

}' > /etc/nagios/okconfig/examples/varnish.cfg-example

Add the template to a host:

okconfig addtemplate www-01.domain.com --template varnish

Reload nagios and run the service checks from the Adagios web interface, and they should be green:
varnish_template_status_adagios
Sources:
https://www.varnish-software.com/blog/blog-sysadmin-monitoring-health-varnish-cache

Advertisements
Varnish 4 monitoring with Adagios on CentOS 7