System Monitoring with Xymon/Administration Guide

From Wikibooks, open books for an open world
Jump to: navigation, search

All things related system administration will be documented here.

Contents

[edit] Design Overview

Following is a cut and paste from http://www.hswn.dk/hobbiton/2006/11/msg00315.html

I don't have any formal design documents; the Hobbit design has evolved from a few basic principles and some ideas I had. I'll try to give you a quick summary.

Hobbit should be portable across Unix architectures. This is obviously important for the Hobbit client code, but I've done my best to use only "standard" Unix and C for both the server- and client-code. This has turned out to be easier than I thought; there are some really old Unix systems that cannot run the Hobbit server, but any recent version of a Unix-like system (released within the past 5-10 years) runs Hobbit without problems.

Hobbit also has to be backwards compatible with Big Brother version 1.9c , in the sense that you can use BB clients with a Hobbit server. I had over 1000 servers with a BB client on them (Unix and Windows), so doing a "Big Bang" switch changing the servers and clients at once would be impossible.

Hobbit must scale well. When I started using Big Brother I had 40 servers to monitor. When I got to 100, I had to re-implement parts of BB and this became the "bbgen" toolkit. When I got to 500, my BB server was getting overloaded, and I started to work on a replacement. Today I have 2500 servers, and before Christmas I will be monitoring nearly 4000 servers. That's a 100x increase in just 5 years.

Hobbit should not rely on a lot of other infrastructure to work. E.g. it doesn't require a huge database backend; you can add one if you like, but it is not needed. Keeping Hobbit simple makes it robust - and you really, REALLY want your monitoring to work when everything else is crashing. We once had a major power outage at our datacenter; the Hobbit server came up quickly, but there were a lot of systems that needed some manual intervention to get back online. It was quite interesting to see how the activity on the Hobbit server just sky-rocketed, because everyone was looking at Hobbit to see which systems were running, and which were down.

[edit] Analyzing data

For me, a key principle of handling the data that is poured into Hobbit is that as much as possible of the data analysis should take place on the Hobbit server. I believe it is a huge benefit to keep configuration settings in one place (the Hobbit server); also, by having access to the raw data you can also perform types of analysis that you didn't think of when a script was initially created to collect some data.

E.g. the Hobbit client reports data about who is logged on to a system. This information is not currently used by Hobbit, but I know that someone wrote a custom backend utility to check this data and alert him if there was someone logged in as "root". I had not thought of this, but by making the raw data from the client available, it was very easy for him to implement this check on all of his servers.

This is quite different from Big Brother, where the data is pre-processed into status messages. The BB client can check if a process is running, but then it just reports "process foo is OK". When process "foo" is NOT running you get an error status - but you cannot easily see if process "foo" has stopped because your backup is running at the same time (and they cannot coexist) because you only get part of the information, not the full process listing.

This may seem like a trivial example, but I realized early on that there are far more ways of using these data than I could possibly imagine. So instead of forcing my ideas of how to use the data upon others, it should be possible to just get the raw data and perform your own analysis of it.

Another example is that in Hobbit 4.2, I added a module which saves a copy of the client data if a status goes red on a host. This has turned out to be extremely helpful in diagnosing those "why did the webserver crash at 4 AM last Tuesday" questions ... because you have access to a lot of raw data collected by the client just before the crash happened, including all of the data that Hobbit didn't analyze by itself but which humans can use to put the whole picture together.

This is not implemented completely yet. The network test utility - which was also carried over from the bbgen toolkit - works the "Big Brother" way. One thing on my agenda is to change that, so the network tester just reports that the ping of host "foo" responded in 12.7 ms, the ping of host "bar" failed and so on. Then a module on the Hobbit server can decide if these should result in a red or yellow status, perhaps based on other information it has (eg that the response time shouldn't exceed 10 ms during working hours, unless the primary network connection was down so we were running on a backup line with less capacity).


[edit] The core daemons

I wanted to have a network daemon holding all of the "current state" information. This information changes all the time as new status reports arrive, so it has to keep this in memory - writing it to disk would be too slow (BB did this, and it doesn't scale). So a core component of Hobbit would be this central daemon (hobbitd). The daemon NEVER does any disk I/O; this would slow it down and I don't want that, because Hobbit must support monitoring of thousands of servers. All communication between hobbitd and the outside world goes via a network connection; this is used both for in-band data (status updates and data messages), but also for out-of-band data like control messages (drop a host, disable a server and so on). Tools that need to fetch the entire status of all servers, or just the detailed status of a host also do this through a network connection to hobbitd.

However, some things must be stored on disk - RRD (graph) files, for instance, or historical eventlogs. So this is handled by a bunch of independent "worker" modules - hobbitd_rrd (RRD updates), hobbitd_history (history logs), hobbitd_alert (sending out alerts). These obviously have to be fed information about the data that flows into the hobbitd daemon - e.g. hobbitd_rrd needs the full status message to extract the data it puts into the RRD files, and hobbitd_history needs information about the status changes from green->red and so on. So I needed a fast inter-process communication mechanism between hobbitd and the worker modules. Also, I wanted to be able to start/stop/restart worker modules on-the-fly; this is extremely nice for testing and makes the system much more robust. Finally, I wanted an interface that was simple to use so that end-users can hook into the data stream if they need to write some custom back-end script. The solution for this was a mechanism that uses the System V "shared memory" IPC mechanism, combined with a group of semaphores to control access to the shared memory area. So hobbitd copies a message into the shared memory area and up's a semaphore telling the workers that there is a new message. The workers then pick up the message and down's another semaphore once they have secured their copy of the message; hobbitd then knows when it is safe to overwrite the shared-memory area with a new message. I call this IPC mechanism a "channel", and there are in fact several of these: One for each type of message. So there is a channel which receives all of the raw "status" messages; another channel for the raw "data" messages; a channel that receives messages about status changes (for history logging); a channel that receives messages about critical red/yellow statuses (for alerts) and so on. Recently a new channel was added for the "client" messages that comes from the Hobbit client.

There are some early notes about this mechanism in the hobbitd/new-daemon.txt file in the hobbit sources. Not all of the ideas there have been implemented, e.g. the "streaming" protocol turned out not to be particularly important.

To make sure that the semaphore stuff is handled correctly, I decided to put a "buffer" module between hobbitd and the actual workers. This is the hobbitd_channel module; it serves only one purpose, which is to grab the messages that hobbitd sends out through the IPC mechanism, and queue them for the real worker module (hobbitd_rrd, hobbitd_history etc). The fact that hobbitd_channel acts as a message queue is useful to accommodate spikes in the activity, e.g. the alert module sometimes gets a huge spike of messages e.g. when a network switch dies. hobbitd_channel also makes it easy to build your own backend modules, because it forwards the messages via a simple text-based pipe; so your custom backend modules can just read them from stdin.

Another benefit of having hobbitd_channel between hobbitd and the worker modules showed up recently; I am currently working on a new version of hobbitd_channel which can distribute the incoming messages between multiple worker "clones" running on different servers, to perform some load balancing of the heavy tasks (primarily RRD file updates). This has been implemented almost exclusively by changing hobbitd_channel, instead of having to modify all of the worker modules.

So the core design looks like this:


   Network tests --\
                    \
                     \         TCP:1984    IPC
                      Clients ----------> hobbitd     --------------> hobbitd_channel ------> worker modules
                     /                  Shared memory                   stdin
                    /
   Custom tests ---/

[edit] Xymon Protocol

[edit] The Web interface

The web interface is mostly carried over from Hobbit's predecessor, the "bbgen" toolkit. I wrote this for Big Brother, to speed up the generation of the Big Brother webpages, and by re-using this in Hobbit I would quickly get a working web interface - all I had to do was to change the programs to grab their data from the hobbitd daemon, instead of reading through the status logfiles that Big Brother uses.

This also means that the web interface is not tied in with the core daemons. Sure, they need to communicate and there are some things in the core daemons that are closely related to how the web interface works - e.g. disabling a host. But it should be possible for an adventurous programmer to use the core Hobbit daemons with their own web front-end tools and come up with a completely different user-interface.

So the web interface is probably the part of Hobbit that has evolved the least from it's origins in Big Brother. Some new CGI programs have been added, but nothing revolutionary new - it just picks up bits of information from hobbitd and the configuration files and displays them.

One design criteria for the web interface is that it should be as dynamic as possible; it must reflect the current status and configuration as much as possible. That is why most of the web interface is done with CGI programs; the only static webpages in Hobbit are the overview pages generated by bbgen - and I hope to eliminate those soon.

[edit] The clients

So with this background, it is obvious that the Hobbit client is really, really dumb. It is basically just a shell script that runs some normal OS commands - df, ps, who and so on - and then it's up to the Hobbit server to analyze them and generate some status columns. Client data is sent to hobbitd, which feeds it through a channel to the hobbitd_client module. hobbitd_client has some parsers for each of the operating systems it knows about, and uses those to grab the interesting data and compare it to the client configuration rules. Then hobbitd_client generates some "status" messages and sends them to hobbitd. The major challenge with this design is logfiles; you cannot realistically send entire logfiles - some of them are several GB of data - over to Hobbit for analysis every 5 minutes. So some filtering must be done on the client side; to keep all of the configuration data on the Hobbit server this meant that the client has to pick up its filter-configuration from the Hobbit server.


I hope this is enough of an overview for You. Good luck with your thesis.


Regards, Henrik

[edit] Architecture of a Hobbit System Monitoring Environment

TBC

[edit] Picking an OS for Hobbit Server

These some notes and advices from Hobbit users.

[edit] Linux

[edit] Oracle Solaris 10

[edit] Pros

[edit] Cons

  • Minus 1: Hobbit depended other open source software doesn't come with Oracle Solaris by default. Following are three sources that you can get the software in binary or source code format.
  1. http://www.blastwave.org
  2. http://www.sunfreeware.com has lots of open source.
  3. http://www.thewrittenword.com

List of software required to meet all dependecies and order of installation:

  1. common-1.4.5-SunOS5.8-sparc-CSW.pkg.gz
  2. pcre-4.5-SunOS5.8-sparc-CSW.pkg.gz
  3. fping-2.4,REV=2004.10.12_rev=b2_to_ipv6-SunOS5.8-sparc-CSW.pkg.gz
  4. zlib-1.2.3,REV=2007.05.12-SunOS5.8-sparc-CSW.pkg.gz
  5. png-1.2.18-SunOS5.8-sparc-CSW.pkg.gz
  6. libiconv-1.9.2-SunOS5.8-sparc-CSW.pkg.gz
  7. expat-1.95.7-SunOS5.8-sparc-CSW.pkg.gz
  8. ggettext-0.14.1,REV=2005.06.29-SunOS5.8-sparc-CSW.pkg.gz
  9. libpopt-1.7,REV=2004.05.15-SunOS5.8-sparc-CSW.pkg.gz
  10. chkconfig-1.2.24h,REV=2006.12.12-SunOS5.8-sparc-CSW.pkg.gz
  11. libpopt-1.7,REV=2004.05.15-SunOS5.8-sparc-CSW.pkg.gz
  12. openssl-0.9.8,REV=2007.05.10_rev=e-SunOS5.8-sparc-CSW.pkg.gz
  13. imaprt-2004,REV=2006.09.02_rev=g-SunOS5.8-sparc-CSW.pkg.gz
  14. freetype2-2.1.10,REV=2005.12.11-SunOS5.8-sparc-CSW.pkg.gz
  15. libart-2.3.16-SunOS5.8-sparc-CSW.pkg.gz
  16. berkeleydb44-4.4.20,REV=2007.01.27-SunOS5.8-sparc-CSW.pkg.gz
  17. ncurses-5.5,REV=2006.02.10-SunOS5.8-sparc-CSW.pkg.gz
  18. readline-5.0,REV=2005.06.07-SunOS5.8-sparc-CSW.pkg.gz
  19. gbc-1.06-SunOS5.8-sparc-CSW.pkg.gz
  20. gdbm-1.8.3,REV=2006.01.01-SunOS5.8-sparc-CSW.pkg.gz
  21. perl-5.8.8,REV=2007.03.16-SunOS5.8-sparc-CSW.pkg.gz
  22. cvs-1.11.22-sol10-sparc-local.gz
  23. rrdtool-1.2.19,REV=2007.02.07-SunOS5.8-sparc-CSW.pkg.gz
  24. libnet-1.0.2,REV=2004.04.08_rev=a-SunOS5.8-sparc-CSW.pkg.gz
  25. berkeleydb4-4.2.52,REV=2005.04.28_rev=p4-SunOS5.8-sparc-CSW.pkg.gz
  26. sasl-2.1.22,REV=2007.06.19-SunOS5.8-sparc-CSW.pkg.gz
  27. openldap_rt-2.3.35,REV=2007.04.14-SunOS5.8-sparc-CSW.pkg.gz
  28. hobbit-4.2.0,REV=2007.04.12-SunOS5.8-sparc-CSW.pkg.gz
  29. hobbit_client-4.2.0,REV=2007.04.12-SunOS5.8-sparc-CSW.pkg.gz

[edit] Notes

  1. To avoid "hobbitd status-board not available" error message in bbgen webpage, add "set ip:do_tcp_fusion = 0x0" into /etc/system to disable TCP fusion.
    1. References: http://www.hswn.dk/hobbiton/2007/04/msg00187.html
    2. Solaris 5.10 kernel patch 120011-14-1, it fix this bug "6449337 kmem exhaustion caused by tcp fusion flow control logic error" .

[edit] Hobbit Server: Solaris Intel 11/06 U3 VMware appliance on a 2GB flash pen drive

Following are main procedures for this to-go hobbit server.

  • VMware server 1.0.1 to create Solaris 10 VMware session.
  • Create a 1.9G partition, select custom install.
  • modify the partition table to take out /export/home,only leave /swap and /.
  • decrease default 512M swap size to 300M.
  • select "Core group" (about 573M in size).
  • Install httpd server
  • Install hobbit server

[edit] Hobbit Server and Development: Solaris Intel 11/06 U3 VMware appliance on a 4GB flash pen drive

  • VMware server 1.0.1 to create Solaris 10 VMware session.
    • Need to use vmware player 1.0.3 so dhcp will work.

[edit] Hobbit Server Test site

  • Solaris Intel 11/06 U3 VMware appliance on a 4GB flash pen drive

[edit] Operational difference between Hobbit and BB BTF

[edit] Servers

This is a comparison table on how Hobbit server is different from BB when performing an adminstration task.

Operation Hobbit 4.2.0 above Big Brother BTF(Better Than Free, 1.9c version above)
start/stop server ~/hobbit.sh start/stop ~/runbb.sh start/stop
Delete a host ~/bin/bb 127.0.0.1 "drop HOSTNAME [test]" $BBHOME/bin/bbrm
add a host 1. add hostnames into bb-hosts 1. add hostnames into bb-hosts
Log data path 1. 1.

[edit] Clients

This is a detail comparison on how Hobbit is different from BB when performing an adminstration task.

Operation Hobbit 4.2.0 above Big Brother BTF(Better Than Free, 1.9c version above)
addin external module ~hobbit/client/etc/hobbitclient.cfg $BBHOME/etc/bb-extab

[edit] References

[edit] Capacity Planning

rule of Thumb is 5MB disk space on Xymon server per machine being monitored

[edit] Installation

[edit] Windows

[edit] Client

  • Run the BBWin 0.12 installer.
  • Under HKEY_LOCAL_MACHINE\SOFTWARE\BBWin in the registry set the computer name (as it is in the bbhosts file)
  • Make the top of the config file in C:\Program Files\BBWin\etc (or C:\Program Files (x86)\BBWin\etc on Windows x64 systems) look like this:
<setting name="bbdisplay" value="xymon-server" />

<!-- bbwin mode local or central -->
<setting name="mode" value="central" />
<setting name="configclass" value="win32" />
  • Start the service.
  • Then I go and edit /home/xymon/server/etc/hobbit-clients.cfg and add:
#Hostname entries from bbwin clients.
#
HOST=[[new host name, as it appears in the bbhosts file]]
        LOAD 65 75       # Load threholds are in %
        DISK C 80 90
        DISK D 90 95
        MEMPHYS 75 101
        MEMSWAP 75 85
        MEMACT  75 85
        PROC BBWin.exe 1 1

[edit] Server

  • /hobbit/server/etc/client-local.cfg:
[win32]
eventlog:Security
ignore Success
eventlog:System
ignore Information
eventlog:Application
ignore Information
  • filtering in: /hobbit/server/etc/hobbit-clients.cfg
CLASS=win32
        LOAD 80 90 # Load threholds are in %
        PROC BBWin.exe 1 1
        PORT STATE=LISTENING MIN=0 TRACK=Listen TEXT=Listen
        LOG %.*  %error -.* COLOR=yellow
        LOG eventlog:Security  %failure.* COLOR=yellow
        LOG eventlog:Application  %warning.* COLOR=yellow
        LOG eventlog:System  %error.* COLOR=yellow
  • Instead you can use the following, but every update to the eventlog is send to the xymon server (instead of local filteret first).
CLASS=win32
        LOAD 80 90 # Load threholds are in %
        PROC BBWin.exe 1 1
        PORT STATE=LISTENING MIN=0 TRACK=Listen TEXT=Listen
        LOG %.*  %^error.* COLOR=red #IGNORE=TermServDevices \(
        LOG %.*  %^warning.* COLOR=yellow IGNORE=%.*TermServDevices.*
        LOG %.*  %^failure.* COLOR=yellow
 

[edit] Unix-like

[edit] Client

bash-2.05b# ls -lrt
-r-xr-xr-x    1 root     administ     2891 Aug  9  2006 hobbitclient.sh
-r-xr-xr-x    1 root     administ     3033 Aug  9  2006 hobbitclient-sunos.sh
-r-xr-xr-x    1 root     administ     1841 Aug  9  2006 hobbitclient-sco_sv.sh
-r-xr-xr-x    1 root     administ     1701 Aug  9  2006 hobbitclient-osf1.sh
-r-xr-xr-x    1 root     administ     1904 Aug  9  2006 hobbitclient-openbsd.sh
-r-xr-xr-x    1 root     administ     1907 Aug  9  2006 hobbitclient-netbsd.sh
-r-xr-xr-x    1 root     administ     2512 Aug  9  2006 hobbitclient-linux.sh
-r-xr-xr-x    1 root     administ     1834 Aug  9  2006 hobbitclient-irix.sh
-r-xr-xr-x    1 root     administ     2070 Aug  9  2006 hobbitclient-hp-ux.sh
-r-xr-xr-x    1 root     administ     2039 Aug  9  2006 hobbitclient-freebsd.sh
-r-xr-xr-x    1 root     administ     1554 Aug  9  2006 hobbitclient-darwin.sh
-r-xr-xr-x    1 root     administ     1971 Aug  9  2006 hobbitclient-aix.sh
-rwxr-xr-x    1 root     root       832531 Feb 16 16:51 bb
-rwxr-xr-x    1 root     root       695294 Feb 16 16:51 hobbitlaunch
-rwxr-xr-x    1 root     root       676992 Feb 16 16:52 bbcmd
-rwxr-xr-x    1 root     root       842123 Feb 16 16:52 bbhostgrep
-rwxr-xr-x    1 root     root       670898 Feb 16 16:52 bbhostshow
-rwxr-xr-x    1 root     root       716800 Feb 16 16:52 bbdigest
-rwxr-xr-x    1 root     root       944795 Feb 16 16:53 logfetch
-rwxr-xr-x    1 root     root       839071 Feb 16 16:53 clientupdate
-rwxr-xr-x    1 root     root       830390 Feb 16 16:53 orcahobbit
-rwxr-xr-x    1 root     root       698541 Feb 16 16:53 msgcache
bash-2.05b# ./bb
Hobbit version 4.2.0
Usage: ./bb [--debug] [--proxy=http://ip.of.the.proxy:port/] RECIPIENT DATA
  RECIPIENT: IP-address, hostname or URL
  DATA: Message to send, or "-" to read from stdin
bash-2.05b# uname -a
Linux LKG7BFA96 2.4.22-xfs #1 Sun Jun 12 21:17:17 PDT 2005 armv5b unknown
bash-2.05b# date
Sat Feb 17 11:45:50 CST 2007
bash-2.05b#

[edit] Server

[edit] Building from package source using TWW HPMS

TWW Hyper Package Management system can help a software developer or system administrator to create different native package formats for different OS. The package source for compiling and packaging hobbit client and server software are in XML format that can be repeated reliably with TWW's sb and pb tools.

Hobbit server and Hobbit client package source is GPL licensed on TWW's support ftp server.

[edit] Building from src RPM

Sometimes it's better to build your own RPMs specifically for your environment. If you are using RH Enterprise or CentOS, the Fedora Core or generic RPM may not install correctly. You could also run into this problem if you have versions of dependent libraries that are not compatible with the system that the RPM was built on.

In order to build the src RPM, you'll need several packages:

  1. openssl-devel, openldap-devel, and pcre-devel from the CentOS CDs.
    • You may also have to make a link from /usr/include/pcre/pcre.h to /usr/include/pcre.h
  2. rrdtool-devel
  3. fping

RPM's from a matching version of RH EL, usually work on Centos with no problem (for example RPMs for EL 4 work fine on Centos 4)

Once you have all the dependencies installed, download the src RPM from SourceForge. Once you have that, just run rpmbuild --rebuild hobbit-xxxx.src.rpm. For example:

rpmbuild --rebuild hobbit-4.1.0-1.src.rpm

The rpmbuild command should compile and build the RPM for you. You can watch the compiler output for any problems. After it is done, you should have new RPMs in the /usr/src/redhat/RPMS/i386 directory (assuming your architecture is i386). This process will build both server and client RPMs for your system. The server RPM also includes the client, so it is not necessary to install both of them.

[edit] Ubuntu

With Synaptic, install the PCRE and RRDtool libraries[1]. Then, download xymon and unpack it.

Launch a terminal (CTRL + t) and enter the commands below, in order to install the software in your HTTP directory. Example with Apache:

$ adduser xymon
$ cd /home/Desktop/xymon
$ ./configure.server
[...]
Where do you want the Xymon installation [/home/xymon] ? /var/www/xymon
[...]
What group-ID does your webserver use [nobody] ? xymon
[...]
$ make
[...]
Now run 'make install' as root
$ make install
[...]
Installation complete.
 
You must configure your webserver for the Xymon webpages and CGI-scripts.
A sample Apache configuration is in /var/www/xymon/server/etc/xymon-apache.conf
If you have your Administration CGI scripts in a separate directory,
then you must also setup the password-file with the htpasswd command.
 
To start Xymon, as the xymon user run '/var/www/xymon/server/bin/xymon.sh start'
To view the Xymon webpages, go to http://localhost/xymon

If it hadn't already done, it's necessary to configure Apache to execute the CGI programs:

$ vim /etc/apache2/httpd.conf
# Add the following lines without the sharps and save:
<Directory /var/www/*>
Options +ExecCGI
AddHandler cgi-script .cgi
</Directory> 
$ /etc/init.d/apache2 restart
$ su xymon /home/xymon/server/bin/xymon.sh start
Xymon started

Finally, test the software: http://localhost/xymon/server/bin/confreport.cgi

[edit] Hobbit in HA

There are two approaches to implement High Availability for Xymon servers,HA-LAN and HA-WAN. Pick one of them according to your network structure.

[edit] HA-LAN approach

This approach is using clustering software to do fail over using a set of Xymon servers. Each OS has their own version of clustering software. We know for Linux we can use Linux-HA plus DRBD. For Solaris, we have Sun Cluster Software.

The cons of this approach is the High Availability is at the scale of LAN not WAN level. The server in clustering need to reside at same LAN subnet. If the clustering site went down then we will end up with xymon messages has no place to send message to.

[edit] HA-LAN using LinuxHA and DRBD

[edit] HA-LAN using Solaris Sun Cluster software plus TrueCopy

[edit] HA-WAN approach

For networks that span over states or countries, failing over a primary xymon server to standby server over WAN network is not an easy networking task.

Following HA-WAN architecture can do fail-over without involve network team to do dns or routing changes.

                
          hobbit.test.com                     hobbit2.test.com
                   | Primary                         | Standby Xymon server
                   |  <-----  heart beat ----->      | 
      LAN1         |                                 |     LAN2                        
     --------------------------             -------------------------
     ^           ^           ^                ^   ^          ^
     |           |           |                |   |          |
     |  ---------------------------------------   |          |
     |  |        |     ----------------------------          |
     |  |        |     |     |--------------------------     |
     |  |        |     |                                |    |
 hobbitc A     hobbitc B                              hobbitc C 
    LAN 3         LAN 4                                LAN 5 

LAN1: California
LAN2: Brazil
LAN3: Argentina
LAN4: Mexico
LAN5: Japan                     


[edit] Requirements
  • a script that can detect failing of hobbit.test.com services.
[edit] Notes
  • hobbit2.test.com pager module is disabled.
  • Hobbit2.test.com and hobbit.test.com reside on different sites connected by WAN.
  • Hobbit clients does not lock on to hobbit.test.com alone.
  • Each hobbit client send messages to both hobbit.test.com and hobbit2.test.com
  • Hobbit2.test.com has every thing hobbit.test.com has and become active as hobbit2.test.com to send out alerts for hobbit.test.com.
  • There is no need to do ip failover of hobbit.test.com to hobbit2.test.com.
[edit] Pros
  • No need to alter existing network configuration.
[edit] Cons
  • Increase network bandwidth by sending same message to two different servers.


[edit] HA-WAN 2 approach

From Patrick: we have 3 data centres and each data centre contains a xymon server. All clients in a data centre only report to their local xymon server. However the xymon servers can communicate with each other using BBDISPLAYS (its a little more complicated than that as we utilise a bbproxy in each DC to take the messages and spray them to all 3 xymons).


                       
          hobbit1.test.com                     hobbit2.test.com
                   | Primary                         | Standby Xymon server
                   |  <-----  bbproxy    ----->      | 
      LAN1         |                                 |     LAN2                        
     --------------------------             -------------------------
     ^          ^     ^                                ^
     |          |     |                                |
     |          |     |                                |
     |          |     |                                |
     |          |     |                                |    
 hobbitc A     hobbitc B                              hobbitc C 

LAN1= has hobbitc A,B
LAN2= has hobbitc C                   


[edit] HA-WAN3 approach

This is a two node hobbit loosely-coupled cluster across WAN. It has following challange need to be resolved.

  • hobbit.test.com DNS need to failover to hobbit2 from hobbit1 when hobbit1 is down.
  • The web page on hobbit1 and hobbit2 are not in sync.
  • Maintence records are not in sync between two servers.
  • RRD databases on two hobbit servers are not in sync after either one server is down for a while.



                           hobbit.test.com
                                 -> hobbitdynamic.test.com (using CISCO DD software).
                                      -> hobbit1.test.com
                                      -> hobbit2.test.com

                
          hobbit1.test.com                     hobbit2.test.com
                   | Primary                             | Standby Xymon server                              
                   |  <----- 1985 heart beat ----->      | 
                   |  <----- 1986 history    ----->      | 
                   |  <----- 1987 heart beat ----->      | 
      LAN1         |                                     |     LAN2                        
     --------------------------             -------------------------
     ^           ^           ^                ^   ^          ^
     |           |           |                |   |          |
     |  ---------------------------------------   |          |
     |  |        |     ----------------------------          |
     |  |        |     |     |--------------------------     |
     |  |        |     |                                |    |
 hobbitc A     hobbitc B                              hobbitc C 
    LAN 3         LAN 4                                LAN 5 

LAN1: California
LAN2: Brazil
LAN3: Argentina
LAN4: Mexico
LAN5: Japan                     


[edit] Requirements
  • a script that can detect failing of hobbit.test.com services.
[edit] Notes
  • hobbit2.test.com pager module is disabled.
  • Hobbit2.test.com and hobbit.test.com reside on different sites connected by WAN.
  • Hobbit clients does not lock on to hobbit.test.com alone.
  • Each hobbit client send messages to both hobbit.test.com and hobbit2.test.com
  • Hobbit2.test.com has every thing hobbit.test.com has and become active as hobbit2.test.com to send out alerts for hobbit.test.com.
  • There is no need to do ip failover of hobbit.test.com to hobbit2.test.com.
[edit] Pros
  • No need to alter existing network configuration.
[edit] Cons
  • Increase network bandwidth by sending same message to two different servers.

[edit] Hobbit HA on LAN

           
          hobbit.test.com                       hobbit2.test.com
                   |       HA Software                 |
                   |    <-  heart beat ->              | 
                   |                                   | LAN1: 192.168.1.0
  ----------------------------------------------------------------
     ^          ^    ^
     |          |    |
     |          |    ---------------------------
     |          |                              |
     |          |                              |
     |          |     
 hobbitc A     hobbitc B                   hobbitc C 
 LAN 2          LAN 3                        LAN4

LAN1: California
LAN2: Brazil
LAN3: Argentina
LAN4: Mexico     

[edit] Notes
  • HA Software = Sun Cluster 3.2 + Sun AVS
  • hobbit2.test.com and hobbit.test.com reside on same subnet(same site).
  • Cluster software (Sun Cluster 3.2) is used to do hobbit.test.com fail over.
  • Each hobbit client send messages to hobbit.test.com only.
  • hobbit2.test.com has every thing hobbit.test.com has.
  • hobbit2.test.com is monitoring hobbit.test.com and will assume hobbit.test.com's identity.
  • identity: MAC address and IP address of hobbit.test.com
[edit] Pros
  • Close to real-time fail-over.
[edit] Cons
  • Fail over happens only on LAN, not WAN.

[edit] SunCluster

Free and opensourced clustering software from Sun. Commercial technical support is available.

  • Using two sol-nv-b68-x86 VMware sessions with Sun Cluster express 07/07.

[edit] References

[edit] FST HA

An opensource Clustering solution specifically for Solaris.

Small Text

[edit] Hobbit Configuration and tuning

[edit] Hobbit(bb) port 1984 encryption

Plain text bb message will be a bottleneck to make Hobbit a enterprise solution which require high security standard. Following is an attempt to make your CIO smile on hobbit solution.

  1. Machine A : has both HB Server and Stunnel server running.
  2. Machine B : is a BB client.
  3. Machine C : is a hobbit client with stunnel client enabled. hb client will send bb message via encrypted port 1999.
  4. Machine D : is a HB client.
  5. Note: old bb port is one way, hb's bb protocol's is bi-directional.
      Machine A (192.168.1.111)                                          

    ---------------------------
     HB Server process         |   <---------port 1984 <---------  BB client (Machine B)
         |                     |
         |1984                 |   <---------port 1984 --------->  HB client (Machine D)
         |                     |                                   
   Stunnel Server process 1999 |   <-------- port 1999 ----------> 1999 Stunnel Client
   ----------------------------                                    |            (Machine C 192.168.1.141)
                                                                   |
                                                                   --1984 ---HB client 
                                                                   

[edit] Configure stunnel server to run in hobbit server

  1. stunnel config file on server to direct 1999 into local 1984 port.
accept = 1999, we accept any incoming bb message on port 1999.
connect = 127.0.0.1:1984, redirect 1999 to 1984 on hb server itself.
 
bash-3.00# cat /opt/stunnel420/etc/stunnel/stunnel.conf
<snip>
[hobbit-server]
accept  = 1999
connect = 1984
<snip>
bash-3.00#
  1. starting stunnel server on machine A. we can see hobbit-server port redirection is ok.
bash-3.00# /etc/init.d/stunnel420 start
Starting universal SSL tunnel: stunnel2007.04.29 06:47:50 LOG7[1898:1]: RAND_status claims sufficient entropy for the PRNG
2007.04.29 06:47:50 LOG7[1898:1]: PRNG seeded successfully
2007.04.29 06:47:50 LOG7[1898:1]: Certificate: /opt/stunnel420/etc/stunnel/stunnel.pem
2007.04.29 06:47:50 LOG7[1898:1]: Certificate loaded
2007.04.29 06:47:50 LOG7[1898:1]: Key file: /opt/moto/stunnel420/etc/stunnel/stunnel.pem
2007.04.29 06:47:50 LOG7[1898:1]: Private key loaded
2007.04.29 06:47:50 LOG7[1898:1]: SSL context initialized for service pop3s
2007.04.29 06:47:50 LOG7[1898:1]: Certificate: /opt/stunnel420/etc/stunnel/stunnel.pem
2007.04.29 06:47:50 LOG7[1898:1]: Certificate loaded
2007.04.29 06:47:50 LOG7[1898:1]: Key file: /opt/stunnel420/etc/stunnel/stunnel.pem
2007.04.29 06:47:50 LOG7[1898:1]: Private key loaded
2007.04.29 06:47:50 LOG7[1898:1]: SSL context initialized for service hobbit-server
.
bash-3.00#
  1. make sure stunnel is running.
bash-3.00# ps -eaf |grep stunnel
  nobody  1984     1   0 06:55:00 ?           0:00 /opt/stunnel420/sbin/stunnel
    root  2133  1811   0 07:04:32 pts/2       0:00 grep stunnel
bash-3.00#
  1. Testing port 1999 on hb server directly, typing garbage message "asdf" and then control+d to quit.
bash-3.00# telnet machineA.test.com 1999
Trying 192.168.1.111...
Connected to machineA.test.com.
Escape character is '^]'.
asdf
Connection to machineA.test.com closed by foreign host.
bash-3.00#
  1. We can see port 1999 has incoming message from 192.168.1.141(machine c)in stunnel log file on machine A.
bash-3.00# tail -10f /opt/stunnel420/etc/stunnel/stunnel.log
2007.04.29 06:55:00 LOG5[1983:1]: 125 clients allowed
2007.04.29 06:55:00 LOG7[1983:1]: FD 4 in non-blocking mode
2007.04.29 06:55:00 LOG7[1983:1]: FD 5 in non-blocking mode
2007.04.29 06:55:00 LOG7[1983:1]: FD 6 in non-blocking mode
2007.04.29 06:55:00 LOG7[1983:1]: SO_REUSEADDR option set on accept socket
2007.04.29 06:55:00 LOG7[1983:1]: pop3s bound to 0.0.0.0:995
2007.04.29 06:55:00 LOG7[1983:1]: FD 7 in non-blocking mode
2007.04.29 06:55:00 LOG7[1983:1]: SO_REUSEADDR option set on accept socket
2007.04.29 06:55:00 LOG7[1983:1]: hobbit-server bound to 0.0.0.0:1999
2007.04.29 06:55:00 LOG7[1984:1]: Created pid file /stunnel.pid
2007.04.29 06:55:35 LOG7[1984:1]: hobbit-server accepted FD=0 from 192.168.1.141:38764
2007.04.29 06:55:35 LOG7[1984:2]: hobbit-server started
2007.04.29 06:55:35 LOG7[1984:2]: FD 0 in non-blocking mode
2007.04.29 06:55:35 LOG7[1984:2]: TCP_NODELAY option set on local socket
2007.04.29 06:55:35 LOG5[1984:2]: hobbit-server accepted connection from 192.168.1.141:38764
2007.04.29 06:55:35 LOG7[1984:2]: SSL state (accept): before/accept initialization
2007.04.29 06:55:39 LOG3[1984:2]: SSL_accept: 1408F10B: error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
2007.04.29 06:55:39 LOG5[1984:2]: Connection reset: 0 bytes sent to SSL, 0 bytes sent to socket
2007.04.29 06:55:39 LOG7[1984:2]: hobbit-server finished (0 left)

[edit] Configuring hb client to use port 1999

  1. add hobbitclientLocalIP into hobbitclient.cfg file. We want hobbit client send bb message to itself.
bash-3.00# grep ^BBDISPLAYS   /etc/opt/hobbitclient42/hobbitclient.cfg
BBDISPLAYS="myotherhobbitserver.my.com hobbitclientLocalIP"                   # IP of multiple Hobbit servers. BBDISP must be "0.0.0.0".
bash-3.00#
bash-3.00# egrep -v '^;|^$'  /opt/stunnel420/etc/stunnel/stunnel.conf
cert = /opt/stunnel420/etc/stunnel/stunnel.pem
sslVersion = SSLv3
chroot = /opt/stunnel420/var/lib/stunnel/
setuid = nobody
setgid = nogroup
pid = /stunnel.pid
socket = l:TCP_NODELAY=1
socket = r:TCP_NODELAY=1
debug = 7
output = stunnel.log
client = yes
[hobbitclient]
connect  =  hbServerRemoteIP:1999
accept   =  hbLocalIP:1984
bash-3.00#
  1. A successful hobbit client stunneling to hobbit server using port 1999.
bash-3.00# grep 06:50   stunnel.log
2007.08.19 00:06:50 LOG7[14842:1]: hobbitclient accepted FD=0 from HobbitclientIP:63758
2007.08.19 00:06:50 LOG7[14842:3]: hobbitclient started
2007.08.19 00:06:50 LOG7[14842:3]: FD 0 in non-blocking mode
2007.08.19 00:06:50 LOG7[14842:3]: TCP_NODELAY option set on local socket
2007.08.19 00:06:50 LOG5[14842:3]: hobbitclient accepted connection from HobbitclientIP:63758
2007.08.19 00:06:50 LOG7[14842:3]: FD 1 in non-blocking mode
2007.08.19 00:06:50 LOG7[14842:3]: hobbitclient connecting HobbitServerIP:1999
2007.08.19 00:06:50 LOG7[14842:3]: connect_wait: waiting 10 seconds
2007.08.19 00:06:50 LOG7[14842:3]: connect_wait: connected
2007.08.19 00:06:50 LOG5[14842:3]: hobbitclient connected remote server from HobbitclientIP:63759
2007.08.19 00:06:50 LOG7[14842:3]: Remote FD=1 initialized
2007.08.19 00:06:50 LOG7[14842:3]: TCP_NODELAY option set on remote socket
2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): before/connect initialization
2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 write client hello A
2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 read server hello A
2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 read finished A
2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 write change cipher spec A
2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 write finished A
2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 flush data
2007.08.19 00:06:50 LOG7[14842:3]:    1 items in the session cache
2007.08.19 00:06:50 LOG7[14842:3]:    2 client connects (SSL_connect())
2007.08.19 00:06:50 LOG7[14842:3]:    2 client connects that finished
2007.08.19 00:06:50 LOG7[14842:3]:    0 client renegotiations requested
2007.08.19 00:06:50 LOG7[14842:3]:    0 server connects (SSL_accept())
2007.08.19 00:06:50 LOG7[14842:3]:    0 server connects that finished
2007.08.19 00:06:50 LOG7[14842:3]:    0 server renegotiations requested
2007.08.19 00:06:50 LOG7[14842:3]:    1 session cache hits
2007.08.19 00:06:50 LOG7[14842:3]:    0 session cache misses
2007.08.19 00:06:50 LOG7[14842:3]:    0 session cache timeouts
2007.08.19 00:06:50 LOG6[14842:3]: SSL connected: previous session reused
2007.08.19 00:06:50 LOG7[14842:3]: Socket closed on read
2007.08.19 00:06:50 LOG7[14842:3]: SSL write shutdown
2007.08.19 00:06:50 LOG7[14842:3]: SSL alert (write): warning: close notify
2007.08.19 00:06:50 LOG6[14842:3]: SSL socket closed on SSL_shutdown
2007.08.19 00:06:50 LOG7[14842:3]: Socket write shutdown
2007.08.19 00:06:50 LOG5[14842:3]: Connection closed: 30068 bytes sent to SSL, 0 bytes sent to socket
2007.08.19 00:06:50 LOG7[14842:3]: hobbitclient finished (0 left)
bash-3.00#

[edit] 32 bit vs 64 bit binary for hobbit on Solaris

  • This article describe this subject in great detail.

[edit] Configuration

[edit] LDAP Authentication

Example httpd.conf (Apache 2.0.x with LDAP authenticated against Active Directory):

Substitute LDAPSERVER.DOMAIN.COM with your LDAP server

<USERNAME>: use account with permission to view LDAP directory

<PASSWORD>: password for account (You should limit what this account can do)

<Directory "/var/hobbit/cgi-secure">
    AllowOverride None
    Options ExecCGI Includes
    Order allow,deny
    Allow from all
 
    AuthType Basic
    AuthName "Hobbit Administration"
    AuthLDAPEnabled on
    AuthLDAPURL ldap://LDAPSERVER.DOMAIN.COM:389/dc=DOMAIN,dc=COM?sAMAccountName?sub?(objectClass=person)
    AuthLDAPBindDN "cn=<USERNAME>,cn=Users,dc=DOMAIN,dc=COM"
    AuthLDAPBindPassword <PASSWORD>
    require valid-user
 
</Directory>

Same for a Novell-edir ldap server:

<Directory "/usr/lib/hobbit/cgi-secure">
    AllowOverride None
    Options ExecCGI Includes
    Order allow,deny
    Allow from all
 
    AuthName "Hobbit-Admin"
    AuthType Basic
    AuthLDAPURL ldap://LDAPSERVER.DOMAIN.COM/o=TREE,ou=Users?cn?sub?(groupMembership=cn=your_group,ou=groups,o=TREE)
    require valid-user
</Directory>

[edit] Alerts setting

  • Pager

Using sms_client [smsclient.org]

Create a shell-script (/usr/bin/hobbitsms) like this:

#!/bin/bash
if [ $RECOVERED != 1 ]; then
echo $RCPT \"HOBBIT : $BBHOSTSVC  is $BBCOLORLEVEL\" >> /var/log/hobbit/page.log
/usr/bin/sms_client $RCPT "HOBBIT : $BBHOSTSVC  is $BBCOLORLEVEL"
else
echo $RCPT \"HOBBIT : $BBHOSTSVC  is weer OK\" >> /var/log/hobbit/page.log
/usr/bin/sms_client $RCPT "HOBBIT : $BBHOSTSVC  is OK"
fi

Edit hobbit-alerts.cfg and add the lines for the alerts you want to receive:

      SCRIPT /usr/bin/hobbitsms hobbit DURATION>5 FORMAT=SMS REPEAT=180 COLOR=red TIME=W:0730:1800 RECOVERED
  • Pager.

Using snpp sendpage.org

Create a shell-script (/usr/bin/hobbitsnpp) like this:

#!/bin/bash
/usr/bin/snpp -n $RCPT <<SCRIPTEOF
$BBALPHAMSG
SCRIPTEOF
  • Email.

[edit] Tuning

[edit] How to shorten Xymon Server nslook up time ?

Xymon server do lots nslookup for every five minutes on the machines that need to be pinged.

Install a local dns cache server. I use djbdns for it

[edit] How to shorten the ping test time ?

[edit] Hobbit and Remedy Ticket System

[edit] Overview

Remedy ticket system has a web interface for opening up a ticket to a perticular ticket queue.

Perl approach is to use following software to automate the ticket request upon a alert occurred.

  • perl
  • LWP
  • trouble_ticket.tgz on http://www.deadcat.net
  • an entrance URL on remedy server web interface.
  • A perl subroutine to open up remedy ticket.

[edit] Open Remedy ticket on hobbit alerts

[edit] Open Remedy ticket on demand

[edit] Migration from BB

[edit] Cost (efforts) of Migration

[edit] System and Inventory Monitoring

System monitoring and inventory monitoring can achieved by an external module to report a system's inventory's informaton.(TBC)


[edit] Trouble Shooting Guide

[edit] Q. When I click on a status icon I get the message "Status not available". What should I check?

A. First make sure that the server is actually running.

ps -ef | grep hobbitd

You should see several processes similar to:

hobbit   32717 32716  0 Nov07 ?        00:01:07 hobbitd --pidfile....
hobbit   32726 32716  0 Nov07 ?        00:00:03 hobbitd_channel --channel=page...
hobbit   32727 32716  0 Nov07 ?        00:01:58 hobbitd_channel --channel=status...
hobbit   32728 32716  0 Nov07 ?        00:00:01 hobbitd_channel --channel=data...
hobbit   32725 32716  0 Nov07 ?        00:00:00 hobbitd_channel --channel=stachg...

If the server is failing to start, start looking at the hobbit logs directory. Check here for one location

/var/log/hobbit

[edit] Q. After installing the Hobbit client, my msgs tests are "clear" (sometimes refered to as "white")

A. As of the time of this writing, the Hobbit client does NOT have msgs functionality like the BB client does. This can be added by installing the bb-msgs.sh file from the BB client as an external test. Even so, the Hobbit server will turn the test to "clear" instead of the expected status. To correct his issue, you'll have to edit the hobbitlaunch.cfg file (usually found in /etc/hobbit/ or /usr/lib/hobbit/server/etc/) to add --no-clear-msgs to the client channel and restart the server:

CMD hobbitd_channel --channel=client hobbitd_client --no-clear-msgs --log=$BBSERVERLOGS/clientdata.log ...

[edit] Q. Tried to down BOARDBUSY: Invalid argument

A.

On Sat, Dec 09, 2006 at 12:08:02PM -0500, Geoff Hallford wrote:
> I am getting the following error message in various Hobbit logs:
>
> 2006-12-04 07:59:46 Tried to down BOARDBUSY: Invalid argument
>
> Does anyone know what this is referring to or what I need to change?

It often shows up when stopping Hobbit - you can ignore it.


Regards,
Henrik

[edit] Hobbit clients in DMZ zone

[edit] DMZ with NAT

  • digram

[edit] DMZ with restricted Firewall

  • diagram

[edit] References

  1. http://www.xymon.com/xymon/help/install.html

[edit] See also

Personal tools
Namespaces
Variants
Actions
Navigation
Community
Toolbox
Sister projects
Print/export