12.4. مانیتورینگ

مانیتورینگ یک عبارت عمومی است و فعالیت‌های مرتبط با آن اهداف گوناگونی را دنبال می‌کنند: از یک طرف، پیگیری منابع مصرفی فراهم شده توسط ماشین امکان پیشبینی میزان اشباع و بروزرسانی‌های متعاقب با آن را فراهم می‌کند؛ از طرف دیگر، هشدار به مدیر سیستم به محض اینکه یک سرویس از دسترس خارج شود یا به درستی کار نکند به معنی رفع سریع‌تر مشکلات در زمان بروز حادثه است.

Munin با نمایش نمودارهای گرافیکی برای مقادیر مختلف از پارامترهای متعدد (حافظه مصرفی، فضای اشغال شده دیسک، بار پردازنده، ترافیک شبکه، بار وب سرور/پایگاه‌داده و از این قبیل) ناحیه اول را پوشش می‌دهد. Nagios با بررسی مداوم سرویس‌ها و نحوه کارکرد و قابل دسترس بودن آن‌ها، همراه با ارسال پیام به مدیر سیستم با استفاده از کانال‌های مناسب (ایمیل، پیامک و از این قبیل) ناحیه دوم را پوشش می‌دهد. هر دو ابزار ساختاری ماژولار دارند که به توسعه هر یک از آن‌ها و افزودن پارامترها یا سرویس‌های خاص کمک می‌کند.

جایگزین Zabbix، یک ابزار مانیتورینگ یکپارچه

Although Munin and Nagios are in very common use, they are not the only players in the monitoring field, and each of them only handles half of the task (graphing on one side, alerting on the other). Zabbix, on the other hand, integrates both parts of monitoring; it also has a web interface for configuring the most common aspects. It has grown by leaps and bounds during the last few years, and can now be considered a viable contender. On the monitoring server, you would install zabbix-server-pgsql (or zabbix-server-mysql), possibly together with zabbix-frontend-php to have a web interface. On the hosts to monitor you would install zabbix-agent feeding data back to the server.

→ https://zabbix.com/

جایگزین Icinga، یک fork از Nagios

به موجب اختلاف نظر در مدل توسعه انتخابی برای Nagios (که به دست یک شرکت واپاییده می‌شود)، تعدادی از توسعه‌دهندگان منشعبش کرده و برایش از نام جدید Icinga استفاده کردند. Icinga کماکان تا جای ممکن با پیکربندی‌ها و افزایه‌های Nagios سازگار بوده؛ ولی ویژگی‌هایی به آن افزوده است.

→ https://icinga.com/

12.4.1. راه‌اندازی Munin

هدف Munin مانیتور کردن ماشین‌های متعدد است؛ بنابراین، طبیعی است که از معماری کلاینت/سرور استفاده کند. میزبان مرکزی - یا grapher - داده را از تمام میزبان‌های قابل مانیتور کردن دریافت کرده و نمودارهای گرافیکی تولید می‌کند.

12.4.1.1. پیکربندی میزبان‌ها برای مانیتور شدن

اولین گام نصب بسته munin-node است. فرآیند پس‌زمینه‌ای که توسط این بسته نصب می‌شود به درگاه ۴۹۴۹ گوش کرده و داده‌های دریافتی از پلاگین‌های فعال را ارسال می‌کند. هر پلاگین یک برنامه ساده است که توضیح مرتبط با داده دریافتی همراه با آخرین مقدار بدست آمده را باز می‌گرداند. پلاگین‌ها در مسیر /usr/share/munin/plugins/ ذخیره شده‌اند اما تنها آن‌هایی که به صورت پیوند نمادین در /etc/munin/plugins/ قرار داشته باشند، استفاده می‌گردند.

When the package is installed, a set of active plugins is determined based on the available software and the current configuration of the host. However, this auto-configuration depends on a feature that each plugin must provide, and it is usually a good idea to review and tweak the results by hand. Browsing the Plugin Gallery can be interesting even though not all plugins have comprehensive documentation.

→ https://gallery.munin-monitoring.org

However, all plugins are scripts and most are rather simple and well-commented. Browsing /etc/munin/plugins/ is therefore a good way of getting an idea of what each plugin is about and determining which should be removed. Similarly, enabling an interesting plugin found in /usr/share/munin/plugins/ is a simple matter of setting up a symbolic link with ln -sf /usr/share/munin/plugins/plugin /etc/munin/plugins/. Note that when a plugin name ends with an underscore “_”, the plugin requires a parameter. This parameter must be stored in the name of the symbolic link; for instance, the “if_” plugin must be enabled with a if_eth0 symbolic link, and it will monitor network traffic on the eth0 interface.

Once all plugins are correctly set up, the daemon configuration must be updated to describe access control for the collected data. This involves allow directives in the /etc/munin/munin-node.conf file. The default configuration is allow ^127\.0\.0\.1$, and only allows access to the local host. An administrator will usually add a similar line containing the IP address of the grapher host, then restart the daemon with systemctl restart munin-node.

مطالعه بیشتر ایجاد پلاگین‌های محلی

Munin does include detailed documentation on how plugins should behave, and how to develop new plugins.

→ https://guide.munin-monitoring.org/en/latest/plugin/writing.html

یک پلاگین بهتر است در شرایط مشابه با فراخوانی توسط munin-node مورد آزمون قرار گیرد؛ این عمل با اجرای munin-run plugin به عنوان root شبیه‌سازی می‌شود. یک پارامتر دوم احتمالی که به این دستور داده می‌شود (از جمله config) به عنوان یک پارامتر به پلاگین فرستاده می‌شود.

زمانی که یک پلاگین توسط پارامتر config فراخوانی می‌شود، باید خود را با بازگرداندن مجموعه‌ای از فیلدها تعریف کند:

# munin-run load config
graph_title Load average
graph_args --base 1000 -l 0
graph_vlabel load
graph_scale no
graph_category system
load.label load
graph_info The load average of the machine describes how many processes are in the run-queue (scheduled to run "immediately").
load.info 5 minute load average

The various available fields are described by the “Plugin reference” available as part of the “Munin guide”.

→ https://munin.readthedocs.org/en/latest/reference/plugin.html

زمانی که بدون پارامتر فراخوانی می‌شود، پلاگین به سادگی آخرین مقدار محاسبه شده را باز می‌گرداند؛ برای نمونه، اجرای sudo munin-run load می‌تواند مقدار load.value 0.12 را باز گرداند.

در نهایت، زمانی که یک پلاگین توسط پارامتر autoconf فراخوانی می‌شود، باید مقدار “yes” (گزارش خروج ۰) یا “no” (گزارش خروج ۱) با توجه به اینکه آیا پلاگین باید در این میزبان فعال شود یا خیر را باز گرداند.

12.4.1.2. پیکربندی Grapher

“grapher” در واقع رایانه‌ای است که داده‌ها را گردآوری کرده و نمودارهای مرتبط با آن را رسم می‌کند. نرم‌افزار مورد نیاز در بسته munin قرار دارد. پیکربندی استاندارد munin-cron را هر ۵ دقیقه یکبار اجرا کرده، تا اطلاعات از تمام میزبان‌های موجود در /etc/munin/munin.conf گردآوری شوند (فقط میزبان محلی به صورت پیشفرض قرار دارد)، داده‌های بدست آمده را در فایل‌های RRD، که مخفف Round Robin Database و مناسب ذخیره‌سازی داده‌های متغیر در طول زمان است، ذخیره‌سازی می‌کند که این فایل‌ها در مسیر /var/lib/munin/ قرار دارند و در نهایت یک صفحه HTML همراه با نمودارها در /var/cache/munin/www/ ایجاد می‌کند.

بنابراین تمام ماشین‌های مانیتور شده باید در فایل پیکربندی /etc/munin/munin.conf قرار داشته باشند. هر ماشین به عنوان یک قسمت کامل همراه با نام آن و حداقل یک مدخل address که شامل نشانی IP ماشین است، قرار می‌گیرد.

[ftp.falcot.com]
    address 192.168.0.12
    use_node_name yes

قسمت‌ها می‌توانند پیچیده‌تر باشند، تا با ترکیب داده‌های بدست آمده از چند ماشین نمودارهای اضافی رسم گردد. مثال‌های موجود در فایل پیکربندی نقطه آغاز مناسبی برای سفارشی‌کردن این فرآیند هستند.

آخرین گام انتشار صفحات تولید شده است؛ اینکار نیازمند پیکربندی سرور وب به گونه‌ای است که محتوای /var/cache/munin/www/ از طریق یک وبسایت قابل دسترس باشد. دسترسی به این وبسایت می‌تواند با استفاده از مکانیزم احرازهویت یا کنترل دسترسی مبتنی بر IP مدیریت شود. برای جزئیات مرتبط قسمت 11.2, “سرور وب (HTTP)” را مشاهده کنید.

12.4.2. راه‌اندازی Nagios

برخلاف Munin، الزامی بر نصب Nagios روی میزبان‌های مانیتور شده نیست؛ از Nagios بیشتر به منظور بررسی موجود بودن سرویس‌های شبکه استفاده می‌شود. برای نمونه، Nagios می‌تواند به یک سرور وب متصل شده و بررسی کند یک صفحه مشخص در زمان مشخص قابل دسترس است یا خیر.

12.4.2.1. نصب

The first step in setting up Nagios is to install the nagios4 and monitoring-plugins packages. Installing the packages configures the web interface and the Apache server. The authz_groupfile and auth_digest Apache modules must be enabled, for that execute:

# a2enmod authz_groupfile
Considering dependency authz_core for authz_groupfile:
Module authz_core already enabled
Module authz_core already enabled
Enabling module authz_groupfile.
To activate the new configuration, you need to run:
  systemctl restart apache2
# a2enmod auth_digest
Considering dependency authn_core for auth_digest:
Module authn_core already enabled
Enabling module auth_digest.
To activate the new configuration, you need to run:
  systemctl restart apache2
# systemctl restart apache2

Adding other users is a simple matter of inserting them in the /etc/nagios4/hdigest.users file.

Pointing a browser at http://server/nagios4/ displays the web interface; in particular, note that Nagios already monitors some parameters of the machine where it runs. However, some interactive features such as adding comments to a host do not work. These features are disabled in the default configuration for Nagios, which is very restrictive for security reasons.

Enabling some features involves editing /etc/nagios4/nagios.cfg. We also need to set up write permissions for the directory used by Nagios, with commands such as the following:

# systemctl stop nagios4
# dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios4/rw
# dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios4
# systemctl start nagios4

12.4.2.2. پیکربندی

The Nagios web interface is rather nice, but it does not allow configuration, nor can it be used to add monitored hosts and services. The whole configuration is managed via files referenced in the central configuration file, /etc/nagios4/nagios.cfg.

بدون درک از مفاهیم Nagios نباید مستقیم سراغ این فایل‌ها رفت. پیکربندی شامل اشیایی از نوع زیر می‌باشد:

یک host ماشینی است که باید مانیتور شود؛
یک hostgroup مجموعه‌ای از میزبان‌ها است که برای نمایش یا در نظر گرفتن برخی عناصر پیکربندی، باید گروه‌بندی شوند؛
یک service شامل عنصری قابل آزمون مرتبط با یک میزبان یا گروه میزبانی است. در اکثر موارد برای بررسی یک سرویس شبکه بکار می‌رود، اما می‌تواند برای بررسی محدوده مجاز برخی پارامترها نیز استفاده شود (برای نمونه، فضای آزاد دیسک یا بار پردازنده)؛
یک servicegroup مجموعه‌ای از سرویس‌ها است که برای نمایش باید گروه‌بندی شوند؛
یک contact شخصی است که می‌تواند هشدارها را دریافت کند؛
یک contactgroup مجموعه‌ای از چنین افرادی است؛
یک timeperiod بازه‌ای از زمان است که طی آن برخی سرویس‌ها باید بررسی شوند؛
یک command دستوری است که به منظور بررسی یک سرویس مشخص فراخوانی می‌شود.

با توجه به نوع، هر شی شامل قابلیت‌هایی است که می‌تواند سفارشی شود. فهرست کامل آن بسیار طولانی است، اما مهم‌ترین قابلیت‌ها همان روابط بین اشیا است.

یک service از command استفاده می‌کند تا وضعیت یک قابلیت موجود درhost (یا hostgroup) را طی بازه timeperiod بررسی کند. در صورت بروز مشکل، Nagios یک هشدار به تمام اعضای موجود در contactgroup مرتبط با سرویس ارسال می‌کند. با توجه به کانال تعریف شده برای هر فرد در شی contact، پیام برای وی فرستاده می‌شود.

An inheritance system allows easy sharing of a set of properties across many objects without duplicating information. Moreover, the initial configuration includes a number of standard objects; in many cases, defining new hosts, services and contacts is a simple matter of deriving from the provided generic objects. The files in /etc/nagios4/conf.d/ are a good source of information on how they work.

مدیرسیستم‌های شرکت فالکوت از پیکربندی زیر استفاده می‌کنند:

مثال 12.5. /etc/nagios4/conf.d/falcot.cfg file

define contact{
    name                            generic-contact
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    service_notification_commands   notify-service-by-email
    host_notification_commands      notify-host-by-email
    register                        0 ; Template only
}
define contact{
    use             generic-contact
    contact_name    rhertzog
    alias           Raphael Hertzog
    email           hertzog@debian.org
}
define contact{
    use             generic-contact
    contact_name    rmas
    alias           Roland Mas
    email           lolando@debian.org
}

define contactgroup{
    contactgroup_name     falcot-admins
    alias                 Falcot Administrators
    members               rhertzog,rmas
}

define host{
    use                   generic-host ; Name of host template to use
    host_name             www-host
    alias                 www.falcot.com
    address               192.168.0.5
    contact_groups        falcot-admins
    hostgroups            debian-servers,ssh-servers
}
define host{
    use                   generic-host ; Name of host template to use
    host_name             ftp-host
    alias                 ftp.falcot.com
    address               192.168.0.12
    contact_groups        falcot-admins
    hostgroups            debian-servers,ssh-servers
}

# 'check_ftp' command with custom parameters
define command{
    command_name          check_ftp2
    command_line          /usr/lib/nagios/plugins/check_ftp -H $HOSTADDRESS$ -w 20 -c 30 -t 35
}

# Generic Falcot service
define service{
    name                  falcot-service
    use                   generic-service
    contact_groups        falcot-admins
    register              0
}

# Services to check on www-host
define service{
    use                   falcot-service
    host_name             www-host
    service_description   HTTP
    check_command         check_http
}
define service{
    use                   falcot-service
    host_name             www-host
    service_description   HTTPS
    check_command         check_https
}
define service{
    use                   falcot-service
    host_name             www-host
    service_description   SMTP
    check_command         check_smtp
}

# Services to check on ftp-host
define service{
    use                   falcot-service
    host_name             ftp-host
    service_description   FTP
    check_command         check_ftp2
}

This configuration file describes two monitored hosts. The first one is the web server, and the checks are made on the HTTP (80) and secure-HTTP (443) ports. Nagios also checks that an SMTP server runs on port 25. The second host is the FTP server, and the check includes making sure that a reply comes within 20 seconds. Beyond this delay, a warning is emitted; beyond 30 seconds, the alert is deemed critical. The Nagios web interface also shows that the SSH service is monitored: this comes from the hosts belonging to the ssh-servers hostgroup.

کاربرد ارث‌گرایی را به یاد داشته باشید: یک شی می‌تواند از شی دیگری با استفاده از “use parent-name” ارث‌بری کند. شی والد باید قابل شناسایی باشد، که نیازمند اختصاص قابلیت “name identifier” به آن می‌باشد. اگر قرار بر واقعی نبودن شی والد باشد، اما تنها به صورت والد عمل کند، اختصاص قابلیت “register 0” به آن به Nagios می‌گوید که از آن شی صرف نظر کند، که در این صورت نبود برخی پارامترهای مورد نیاز آن مشکلی را بوجود نمی‌آورد.

مطالعه بیشتر آزمون‌های راه‌دور با استفاده از NRPE

Many Nagios plugins allow checking some parameters local to a host; if many machines need these checks while a central installation gathers them, the NRPE (Nagios Remote Plugin Executor) plugin needs to be deployed. The nagios-nrpe-plugin package needs to be installed on the Nagios server, and nagios-nrpe-server on the hosts where local tests need to run. The latter gets its configuration from /etc/nagios/nrpe.cfg. This file should list the tests that can be started remotely, and the IP addresses of the machines allowed to trigger them. On the Nagios side, enabling these remote tests is a simple matter of adding matching services using the new check_nrpe command.