System Monitoring with Xymon/Administration Guide
All things related system administration will be documented here.
[edit] Design Overview
Following is a cut and paste from http://www.hswn.dk/hobbiton/2006/11/msg00315.html
I don't have any formal design documents; the Hobbit design has evolved from a few basic principles and some ideas I had. I'll try to give you a quick summary.
Hobbit should be portable across Unix architectures. This is obviously important for the Hobbit client code, but I've done my best to use only "standard" Unix and C for both the server- and client-code. This has turned out to be easier than I thought; there are some really old Unix systems that cannot run the Hobbit server, but any recent version of a Unix-like system (released within the past 5-10 years) runs Hobbit without problems.
Hobbit also has to be backwards compatible with Big Brother version 1.9c , in the sense that you can use BB clients with a Hobbit server. I had over 1000 servers with a BB client on them (Unix and Windows), so doing a "Big Bang" switch changing the servers and clients at once would be impossible.
Hobbit must scale well. When I started using Big Brother I had 40 servers to monitor. When I got to 100, I had to re-implement parts of BB and this became the "bbgen" toolkit. When I got to 500, my BB server was getting overloaded, and I started to work on a replacement. Today I have 2500 servers, and before Christmas I will be monitoring nearly 4000 servers. That's a 100x increase in just 5 years.
Hobbit should not rely on a lot of other infrastructure to work. E.g. it doesn't require a huge database backend; you can add one if you like, but it is not needed. Keeping Hobbit simple makes it robust - and you really, REALLY want your monitoring to work when everything else is crashing. We once had a major power outage at our datacenter; the Hobbit server came up quickly, but there were a lot of systems that needed some manual intervention to get back online. It was quite interesting to see how the activity on the Hobbit server just sky-rocketed, because everyone was looking at Hobbit to see which systems were running, and which were down.
[edit] Analyzing data
For me, a key principle of handling the data that is poured into Hobbit is that as much as possible of the data analysis should take place on the Hobbit server. I believe it is a huge benefit to keep configuration settings in one place (the Hobbit server); also, by having access to the raw data you can also perform types of analysis that you didn't think of when a script was initially created to collect some data.
E.g. the Hobbit client reports data about who is logged on to a system. This information is not currently used by Hobbit, but I know that someone wrote a custom backend utility to check this data and alert him if there was someone logged in as "root". I had not thought of this, but by making the raw data from the client available, it was very easy for him to implement this check on all of his servers.
This is quite different from Big Brother, where the data is pre-processed into status messages. The BB client can check if a process is running, but then it just reports "process foo is OK". When process "foo" is NOT running you get an error status - but you cannot easily see if process "foo" has stopped because your backup is running at the same time (and they cannot coexist) because you only get part of the information, not the full process listing.
This may seem like a trivial example, but I realized early on that there are far more ways of using these data than I could possibly imagine. So instead of forcing my ideas of how to use the data upon others, it should be possible to just get the raw data and perform your own analysis of it.
Another example is that in Hobbit 4.2, I added a module which saves a copy of the client data if a status goes red on a host. This has turned out to be extremely helpful in diagnosing those "why did the webserver crash at 4 AM last Tuesday" questions ... because you have access to a lot of raw data collected by the client just before the crash happened, including all of the data that Hobbit didn't analyze by itself but which humans can use to put the whole picture together.
This is not implemented completely yet. The network test utility - which was also carried over from the bbgen toolkit - works the "Big Brother" way. One thing on my agenda is to change that, so the network tester just reports that the ping of host "foo" responded in 12.7 ms, the ping of host "bar" failed and so on. Then a module on the Hobbit server can decide if these should result in a red or yellow status, perhaps based on other information it has (eg that the response time shouldn't exceed 10 ms during working hours, unless the primary network connection was down so we were running on a backup line with less capacity).
[edit] The core daemons
I wanted to have a network daemon holding all of the "current state" information. This information changes all the time as new status reports arrive, so it has to keep this in memory - writing it to disk would be too slow (BB did this, and it doesn't scale). So a core component of Hobbit would be this central daemon (hobbitd). The daemon NEVER does any disk I/O; this would slow it down and I don't want that, because Hobbit must support monitoring of thousands of servers. All communication between hobbitd and the outside world goes via a network connection; this is used both for in-band data (status updates and data messages), but also for out-of-band data like control messages (drop a host, disable a server and so on). Tools that need to fetch the entire status of all servers, or just the detailed status of a host also do this through a network connection to hobbitd.
However, some things must be stored on disk - RRD (graph) files, for instance, or historical eventlogs. So this is handled by a bunch of independent "worker" modules - hobbitd_rrd (RRD updates), hobbitd_history (history logs), hobbitd_alert (sending out alerts). These obviously have to be fed information about the data that flows into the hobbitd daemon - e.g. hobbitd_rrd needs the full status message to extract the data it puts into the RRD files, and hobbitd_history needs information about the status changes from green->red and so on. So I needed a fast inter-process communication mechanism between hobbitd and the worker modules. Also, I wanted to be able to start/stop/restart worker modules on-the-fly; this is extremely nice for testing and makes the system much more robust. Finally, I wanted an interface that was simple to use so that end-users can hook into the data stream if they need to write some custom back-end script. The solution for this was a mechanism that uses the System V "shared memory" IPC mechanism, combined with a group of semaphores to control access to the shared memory area. So hobbitd copies a message into the shared memory area and up's a semaphore telling the workers that there is a new message. The workers then pick up the message and down's another semaphore once they have secured their copy of the message; hobbitd then knows when it is safe to overwrite the shared-memory area with a new message. I call this IPC mechanism a "channel", and there are in fact several of these: One for each type of message. So there is a channel which receives all of the raw "status" messages; another channel for the raw "data" messages; a channel that receives messages about status changes (for history logging); a channel that receives messages about critical red/yellow statuses (for alerts) and so on. Recently a new channel was added for the "client" messages that comes from the Hobbit client.
There are some early notes about this mechanism in the hobbitd/new-daemon.txt file in the hobbit sources. Not all of the ideas there have been implemented, e.g. the "streaming" protocol turned out not to be particularly important.
To make sure that the semaphore stuff is handled correctly, I decided to put a "buffer" module between hobbitd and the actual workers. This is the hobbitd_channel module; it serves only one purpose, which is to grab the messages that hobbitd sends out through the IPC mechanism, and queue them for the real worker module (hobbitd_rrd, hobbitd_history etc). The fact that hobbitd_channel acts as a message queue is useful to accommodate spikes in the activity, e.g. the alert module sometimes gets a huge spike of messages e.g. when a network switch dies. hobbitd_channel also makes it easy to build your own backend modules, because it forwards the messages via a simple text-based pipe; so your custom backend modules can just read them from stdin.
Another benefit of having hobbitd_channel between hobbitd and the worker modules showed up recently; I am currently working on a new version of hobbitd_channel which can distribute the incoming messages between multiple worker "clones" running on different servers, to perform some load balancing of the heavy tasks (primarily RRD file updates). This has been implemented almost exclusively by changing hobbitd_channel, instead of having to modify all of the worker modules.
So the core design looks like this:
Network tests --\
\
\ TCP:1984 IPC
Clients ----------> hobbitd --------------> hobbitd_channel ------> worker modules
/ Shared memory stdin
/
Custom tests ---/
[edit] Xymon Protocol
- There is a version of Xymon protocol in ASCII text format from Xymon author.
[edit] The Web interface
The web interface is mostly carried over from Hobbit's predecessor, the "bbgen" toolkit. I wrote this for Big Brother, to speed up the generation of the Big Brother webpages, and by re-using this in Hobbit I would quickly get a working web interface - all I had to do was to change the programs to grab their data from the hobbitd daemon, instead of reading through the status logfiles that Big Brother uses.
This also means that the web interface is not tied in with the core daemons. Sure, they need to communicate and there are some things in the core daemons that are closely related to how the web interface works - e.g. disabling a host. But it should be possible for an adventurous programmer to use the core Hobbit daemons with their own web front-end tools and come up with a completely different user-interface.
So the web interface is probably the part of Hobbit that has evolved the least from it's origins in Big Brother. Some new CGI programs have been added, but nothing revolutionary new - it just picks up bits of information from hobbitd and the configuration files and displays them.
One design criteria for the web interface is that it should be as dynamic as possible; it must reflect the current status and configuration as much as possible. That is why most of the web interface is done with CGI programs; the only static webpages in Hobbit are the overview pages generated by bbgen - and I hope to eliminate those soon.
[edit] The clients
So with this background, it is obvious that the Hobbit client is really, really dumb. It is basically just a shell script that runs some normal OS commands - df, ps, who and so on - and then it's up to the Hobbit server to analyze them and generate some status columns. Client data is sent to hobbitd, which feeds it through a channel to the hobbitd_client module. hobbitd_client has some parsers for each of the operating systems it knows about, and uses those to grab the interesting data and compare it to the client configuration rules. Then hobbitd_client generates some "status" messages and sends them to hobbitd. The major challenge with this design is logfiles; you cannot realistically send entire logfiles - some of them are several GB of data - over to Hobbit for analysis every 5 minutes. So some filtering must be done on the client side; to keep all of the configuration data on the Hobbit server this meant that the client has to pick up its filter-configuration from the Hobbit server.
I hope this is enough of an overview for You. Good luck with your thesis.
Regards, Henrik
[edit] Architecture of a Hobbit System Monitoring Environment
TBC
[edit] Picking an OS for Hobbit Server
These some notes and advices from Hobbit users.
[edit] Linux
[edit] Oracle Solaris 10
[edit] Pros
- Plus 1: Turbocharged TCP/IP.
- Plus 2: dtrace
- Plus 3: Self Heal
- Plus 4: You can configure root and disk to use zfs and have zfs snapshot enabled.
[edit] Cons
- Minus 1: Hobbit depended other open source software doesn't come with Oracle Solaris by default. Following are three sources that you can get the software in binary or source code format.
- http://www.blastwave.org
- http://www.sunfreeware.com has lots of open source.
- http://www.thewrittenword.com
List of software required to meet all dependecies and order of installation:
- common-1.4.5-SunOS5.8-sparc-CSW.pkg.gz
- pcre-4.5-SunOS5.8-sparc-CSW.pkg.gz
- fping-2.4,REV=2004.10.12_rev=b2_to_ipv6-SunOS5.8-sparc-CSW.pkg.gz
- zlib-1.2.3,REV=2007.05.12-SunOS5.8-sparc-CSW.pkg.gz
- png-1.2.18-SunOS5.8-sparc-CSW.pkg.gz
- libiconv-1.9.2-SunOS5.8-sparc-CSW.pkg.gz
- expat-1.95.7-SunOS5.8-sparc-CSW.pkg.gz
- ggettext-0.14.1,REV=2005.06.29-SunOS5.8-sparc-CSW.pkg.gz
- libpopt-1.7,REV=2004.05.15-SunOS5.8-sparc-CSW.pkg.gz
- chkconfig-1.2.24h,REV=2006.12.12-SunOS5.8-sparc-CSW.pkg.gz
- libpopt-1.7,REV=2004.05.15-SunOS5.8-sparc-CSW.pkg.gz
- openssl-0.9.8,REV=2007.05.10_rev=e-SunOS5.8-sparc-CSW.pkg.gz
- imaprt-2004,REV=2006.09.02_rev=g-SunOS5.8-sparc-CSW.pkg.gz
- freetype2-2.1.10,REV=2005.12.11-SunOS5.8-sparc-CSW.pkg.gz
- libart-2.3.16-SunOS5.8-sparc-CSW.pkg.gz
- berkeleydb44-4.4.20,REV=2007.01.27-SunOS5.8-sparc-CSW.pkg.gz
- ncurses-5.5,REV=2006.02.10-SunOS5.8-sparc-CSW.pkg.gz
- readline-5.0,REV=2005.06.07-SunOS5.8-sparc-CSW.pkg.gz
- gbc-1.06-SunOS5.8-sparc-CSW.pkg.gz
- gdbm-1.8.3,REV=2006.01.01-SunOS5.8-sparc-CSW.pkg.gz
- perl-5.8.8,REV=2007.03.16-SunOS5.8-sparc-CSW.pkg.gz
- cvs-1.11.22-sol10-sparc-local.gz
- rrdtool-1.2.19,REV=2007.02.07-SunOS5.8-sparc-CSW.pkg.gz
- libnet-1.0.2,REV=2004.04.08_rev=a-SunOS5.8-sparc-CSW.pkg.gz
- berkeleydb4-4.2.52,REV=2005.04.28_rev=p4-SunOS5.8-sparc-CSW.pkg.gz
- sasl-2.1.22,REV=2007.06.19-SunOS5.8-sparc-CSW.pkg.gz
- openldap_rt-2.3.35,REV=2007.04.14-SunOS5.8-sparc-CSW.pkg.gz
- hobbit-4.2.0,REV=2007.04.12-SunOS5.8-sparc-CSW.pkg.gz
- hobbit_client-4.2.0,REV=2007.04.12-SunOS5.8-sparc-CSW.pkg.gz
[edit] Notes
- To avoid "hobbitd status-board not available" error message in bbgen webpage, add "set ip:do_tcp_fusion = 0x0" into /etc/system to disable TCP fusion.
- References: http://www.hswn.dk/hobbiton/2007/04/msg00187.html
- Solaris 5.10 kernel patch 120011-14-1, it fix this bug "6449337 kmem exhaustion caused by tcp fusion flow control logic error" .
[edit] Hobbit Server: Solaris Intel 11/06 U3 VMware appliance on a 2GB flash pen drive
Following are main procedures for this to-go hobbit server.
- VMware server 1.0.1 to create Solaris 10 VMware session.
- Create a 1.9G partition, select custom install.
- modify the partition table to take out /export/home,only leave /swap and /.
- decrease default 512M swap size to 300M.
- select "Core group" (about 573M in size).
- Install httpd server
- Install hobbit server
[edit] Hobbit Server and Development: Solaris Intel 11/06 U3 VMware appliance on a 4GB flash pen drive
- VMware server 1.0.1 to create Solaris 10 VMware session.
- Need to use vmware player 1.0.3 so dhcp will work.
[edit] Hobbit Server Test site
- Solaris Intel 11/06 U3 VMware appliance on a 4GB flash pen drive
[edit] Operational difference between Hobbit and BB BTF
[edit] Servers
This is a comparison table on how Hobbit server is different from BB when performing an adminstration task.
| Operation | Hobbit 4.2.0 above | Big Brother BTF(Better Than Free, 1.9c version above) |
| start/stop server | ~/hobbit.sh start/stop | ~/runbb.sh start/stop |
| Delete a host | ~/bin/bb 127.0.0.1 "drop HOSTNAME [test]" | $BBHOME/bin/bbrm |
| add a host | 1. add hostnames into bb-hosts | 1. add hostnames into bb-hosts |
| Log data path | 1. | 1. |
[edit] Clients
This is a detail comparison on how Hobbit is different from BB when performing an adminstration task.
| Operation | Hobbit 4.2.0 above | Big Brother BTF(Better Than Free, 1.9c version above) |
| addin external module | ~hobbit/client/etc/hobbitclient.cfg | $BBHOME/etc/bb-extab |
[edit] References
[edit] Capacity Planning
rule of Thumb is 5MB disk space on Xymon server per machine being monitored
[edit] Installation
[edit] Windows
[edit] Client
- Run the BBWin 0.12 installer.
- Under HKEY_LOCAL_MACHINE\SOFTWARE\BBWin in the registry set the computer name (as it is in the bbhosts file)
- Make the top of the config file in C:\Program Files\BBWin\etc (or C:\Program Files (x86)\BBWin\etc on Windows x64 systems) look like this:
<setting name="bbdisplay" value="xymon-server" /> <!-- bbwin mode local or central --> <setting name="mode" value="central" /> <setting name="configclass" value="win32" />
- Start the service.
- Then I go and edit /home/xymon/server/etc/hobbit-clients.cfg and add:
#Hostname entries from bbwin clients.
#
HOST=[[new host name, as it appears in the bbhosts file]]
LOAD 65 75 # Load threholds are in %
DISK C 80 90
DISK D 90 95
MEMPHYS 75 101
MEMSWAP 75 85
MEMACT 75 85
PROC BBWin.exe 1 1
[edit] Server
- /hobbit/server/etc/client-local.cfg:
[win32] eventlog:Security ignore Success eventlog:System ignore Information eventlog:Application ignore Information
- filtering in: /hobbit/server/etc/hobbit-clients.cfg
CLASS=win32
LOAD 80 90 # Load threholds are in %
PROC BBWin.exe 1 1
PORT STATE=LISTENING MIN=0 TRACK=Listen TEXT=Listen
LOG %.* %error -.* COLOR=yellow
LOG eventlog:Security %failure.* COLOR=yellow
LOG eventlog:Application %warning.* COLOR=yellow
LOG eventlog:System %error.* COLOR=yellow
- Instead you can use the following, but every update to the eventlog is send to the xymon server (instead of local filteret first).
CLASS=win32
LOAD 80 90 # Load threholds are in %
PROC BBWin.exe 1 1
PORT STATE=LISTENING MIN=0 TRACK=Listen TEXT=Listen
LOG %.* %^error.* COLOR=red #IGNORE=TermServDevices \(
LOG %.* %^warning.* COLOR=yellow IGNORE=%.*TermServDevices.*
LOG %.* %^failure.* COLOR=yellow
[edit] Unix-like
- AIX
- Debian (Ubuntu)
- FreeBSD
- HP-UX
- IRIX
- Mandriva (xymon 4.2.3 is available in contrib as of 2009.0, prior to that hobbit was available in contrib)
- NSLU2 Unslung OS.
- RedHat Linux / RedHat Enterprise Linux / Fedora Core (http://rpm.razorsedge.org/ or http://staff.telkomsa.net/packages/)
- Solaris
[edit] Client
bash-2.05b# ls -lrt -r-xr-xr-x 1 root administ 2891 Aug 9 2006 hobbitclient.sh -r-xr-xr-x 1 root administ 3033 Aug 9 2006 hobbitclient-sunos.sh -r-xr-xr-x 1 root administ 1841 Aug 9 2006 hobbitclient-sco_sv.sh -r-xr-xr-x 1 root administ 1701 Aug 9 2006 hobbitclient-osf1.sh -r-xr-xr-x 1 root administ 1904 Aug 9 2006 hobbitclient-openbsd.sh -r-xr-xr-x 1 root administ 1907 Aug 9 2006 hobbitclient-netbsd.sh -r-xr-xr-x 1 root administ 2512 Aug 9 2006 hobbitclient-linux.sh -r-xr-xr-x 1 root administ 1834 Aug 9 2006 hobbitclient-irix.sh -r-xr-xr-x 1 root administ 2070 Aug 9 2006 hobbitclient-hp-ux.sh -r-xr-xr-x 1 root administ 2039 Aug 9 2006 hobbitclient-freebsd.sh -r-xr-xr-x 1 root administ 1554 Aug 9 2006 hobbitclient-darwin.sh -r-xr-xr-x 1 root administ 1971 Aug 9 2006 hobbitclient-aix.sh -rwxr-xr-x 1 root root 832531 Feb 16 16:51 bb -rwxr-xr-x 1 root root 695294 Feb 16 16:51 hobbitlaunch -rwxr-xr-x 1 root root 676992 Feb 16 16:52 bbcmd -rwxr-xr-x 1 root root 842123 Feb 16 16:52 bbhostgrep -rwxr-xr-x 1 root root 670898 Feb 16 16:52 bbhostshow -rwxr-xr-x 1 root root 716800 Feb 16 16:52 bbdigest -rwxr-xr-x 1 root root 944795 Feb 16 16:53 logfetch -rwxr-xr-x 1 root root 839071 Feb 16 16:53 clientupdate -rwxr-xr-x 1 root root 830390 Feb 16 16:53 orcahobbit -rwxr-xr-x 1 root root 698541 Feb 16 16:53 msgcache bash-2.05b# ./bb Hobbit version 4.2.0 Usage: ./bb [--debug] [--proxy=http://ip.of.the.proxy:port/] RECIPIENT DATA RECIPIENT: IP-address, hostname or URL DATA: Message to send, or "-" to read from stdin bash-2.05b# uname -a Linux LKG7BFA96 2.4.22-xfs #1 Sun Jun 12 21:17:17 PDT 2005 armv5b unknown bash-2.05b# date Sat Feb 17 11:45:50 CST 2007 bash-2.05b#
[edit] Server
[edit] Building from package source using TWW HPMS
TWW Hyper Package Management system can help a software developer or system administrator to create different native package formats for different OS. The package source for compiling and packaging hobbit client and server software are in XML format that can be repeated reliably with TWW's sb and pb tools.
Hobbit server and Hobbit client package source is GPL licensed on TWW's support ftp server.
[edit] Building from src RPM
Sometimes it's better to build your own RPMs specifically for your environment. If you are using RH Enterprise or CentOS, the Fedora Core or generic RPM may not install correctly. You could also run into this problem if you have versions of dependent libraries that are not compatible with the system that the RPM was built on.
In order to build the src RPM, you'll need several packages:
- openssl-devel, openldap-devel, and pcre-devel from the CentOS CDs.
- You may also have to make a link from /usr/include/pcre/pcre.h to /usr/include/pcre.h
- rrdtool-devel
- I recommend getting this from the DAG repository
- fping
- Also available from the DAG repository
RPM's from a matching version of RH EL, usually work on Centos with no problem (for example RPMs for EL 4 work fine on Centos 4)
Once you have all the dependencies installed, download the src RPM from SourceForge. Once you have that, just run rpmbuild --rebuild hobbit-xxxx.src.rpm. For example:
rpmbuild --rebuild hobbit-4.1.0-1.src.rpm
The rpmbuild command should compile and build the RPM for you. You can watch the compiler output for any problems. After it is done, you should have new RPMs in the /usr/src/redhat/RPMS/i386 directory (assuming your architecture is i386). This process will build both server and client RPMs for your system. The server RPM also includes the client, so it is not necessary to install both of them.
[edit] Ubuntu
With Synaptic, install the PCRE and RRDtool libraries[1]. Then, download xymon and unpack it.
Launch a terminal (CTRL + t) and enter the commands below, in order to install the software in your HTTP directory. Example with Apache:
$ adduser xymon $ cd /home/Desktop/xymon $ ./configure.server [...] Where do you want the Xymon installation [/home/xymon] ? /var/www/xymon [...] What group-ID does your webserver use [nobody] ? xymon [...] $ make [...] Now run 'make install' as root $ make install [...] Installation complete. You must configure your webserver for the Xymon webpages and CGI-scripts. A sample Apache configuration is in /var/www/xymon/server/etc/xymon-apache.conf If you have your Administration CGI scripts in a separate directory, then you must also setup the password-file with the htpasswd command. To start Xymon, as the xymon user run '/var/www/xymon/server/bin/xymon.sh start' To view the Xymon webpages, go to http://localhost/xymon
If it hadn't already done, it's necessary to configure Apache to execute the CGI programs:
$ vim /etc/apache2/httpd.conf # Add the following lines without the sharps and save: <Directory /var/www/*> Options +ExecCGI AddHandler cgi-script .cgi </Directory> $ /etc/init.d/apache2 restart $ su xymon /home/xymon/server/bin/xymon.sh start Xymon started
Finally, test the software: http://localhost/xymon/server/bin/confreport.cgi
[edit] Hobbit in HA
There are two approaches to implement High Availability for Xymon servers,HA-LAN and HA-WAN. Pick one of them according to your network structure.
[edit] HA-LAN approach
This approach is using clustering software to do fail over using a set of Xymon servers. Each OS has their own version of clustering software. We know for Linux we can use Linux-HA plus DRBD. For Solaris, we have Sun Cluster Software.
The cons of this approach is the High Availability is at the scale of LAN not WAN level. The server in clustering need to reside at same LAN subnet. If the clustering site went down then we will end up with xymon messages has no place to send message to.
[edit] HA-LAN using LinuxHA and DRBD
[edit] HA-LAN using Solaris Sun Cluster software plus TrueCopy
[edit] HA-WAN approach
For networks that span over states or countries, failing over a primary xymon server to standby server over WAN network is not an easy networking task.
Following HA-WAN architecture can do fail-over without involve network team to do dns or routing changes.
hobbit.test.com hobbit2.test.com
| Primary | Standby Xymon server
| <----- heart beat -----> |
LAN1 | | LAN2
-------------------------- -------------------------
^ ^ ^ ^ ^ ^
| | | | | |
| --------------------------------------- | |
| | | ---------------------------- |
| | | | |-------------------------- |
| | | | | |
hobbitc A hobbitc B hobbitc C
LAN 3 LAN 4 LAN 5
LAN1: California
LAN2: Brazil
LAN3: Argentina
LAN4: Mexico
LAN5: Japan
[edit] Requirements
- a script that can detect failing of hobbit.test.com services.
[edit] Notes
- hobbit2.test.com pager module is disabled.
- Hobbit2.test.com and hobbit.test.com reside on different sites connected by WAN.
- Hobbit clients does not lock on to hobbit.test.com alone.
- Each hobbit client send messages to both hobbit.test.com and hobbit2.test.com
- Hobbit2.test.com has every thing hobbit.test.com has and become active as hobbit2.test.com to send out alerts for hobbit.test.com.
- There is no need to do ip failover of hobbit.test.com to hobbit2.test.com.
[edit] Pros
- No need to alter existing network configuration.
[edit] Cons
- Increase network bandwidth by sending same message to two different servers.
[edit] HA-WAN 2 approach
From Patrick: we have 3 data centres and each data centre contains a xymon server. All clients in a data centre only report to their local xymon server. However the xymon servers can communicate with each other using BBDISPLAYS (its a little more complicated than that as we utilise a bbproxy in each DC to take the messages and spray them to all 3 xymons).
hobbit1.test.com hobbit2.test.com
| Primary | Standby Xymon server
| <----- bbproxy -----> |
LAN1 | | LAN2
-------------------------- -------------------------
^ ^ ^ ^
| | | |
| | | |
| | | |
| | | |
hobbitc A hobbitc B hobbitc C
LAN1= has hobbitc A,B
LAN2= has hobbitc C
[edit] HA-WAN3 approach
This is a two node hobbit loosely-coupled cluster across WAN. It has following challange need to be resolved.
- hobbit.test.com DNS need to failover to hobbit2 from hobbit1 when hobbit1 is down.
- The web page on hobbit1 and hobbit2 are not in sync.
- Maintence records are not in sync between two servers.
- RRD databases on two hobbit servers are not in sync after either one server is down for a while.
hobbit.test.com
-> hobbitdynamic.test.com (using CISCO DD software).
-> hobbit1.test.com
-> hobbit2.test.com
hobbit1.test.com hobbit2.test.com
| Primary | Standby Xymon server
| <----- 1985 heart beat -----> |
| <----- 1986 history -----> |
| <----- 1987 heart beat -----> |
LAN1 | | LAN2
-------------------------- -------------------------
^ ^ ^ ^ ^ ^
| | | | | |
| --------------------------------------- | |
| | | ---------------------------- |
| | | | |-------------------------- |
| | | | | |
hobbitc A hobbitc B hobbitc C
LAN 3 LAN 4 LAN 5
LAN1: California
LAN2: Brazil
LAN3: Argentina
LAN4: Mexico
LAN5: Japan
[edit] Requirements
- a script that can detect failing of hobbit.test.com services.
[edit] Notes
- hobbit2.test.com pager module is disabled.
- Hobbit2.test.com and hobbit.test.com reside on different sites connected by WAN.
- Hobbit clients does not lock on to hobbit.test.com alone.
- Each hobbit client send messages to both hobbit.test.com and hobbit2.test.com
- Hobbit2.test.com has every thing hobbit.test.com has and become active as hobbit2.test.com to send out alerts for hobbit.test.com.
- There is no need to do ip failover of hobbit.test.com to hobbit2.test.com.
[edit] Pros
- No need to alter existing network configuration.
[edit] Cons
- Increase network bandwidth by sending same message to two different servers.
[edit] Hobbit HA on LAN
hobbit.test.com hobbit2.test.com
| HA Software |
| <- heart beat -> |
| | LAN1: 192.168.1.0
----------------------------------------------------------------
^ ^ ^
| | |
| | ---------------------------
| | |
| | |
| |
hobbitc A hobbitc B hobbitc C
LAN 2 LAN 3 LAN4
LAN1: California
LAN2: Brazil
LAN3: Argentina
LAN4: Mexico
[edit] Notes
- HA Software = Sun Cluster 3.2 + Sun AVS
- hobbit2.test.com and hobbit.test.com reside on same subnet(same site).
- Cluster software (Sun Cluster 3.2) is used to do hobbit.test.com fail over.
- Each hobbit client send messages to hobbit.test.com only.
- hobbit2.test.com has every thing hobbit.test.com has.
- hobbit2.test.com is monitoring hobbit.test.com and will assume hobbit.test.com's identity.
- identity: MAC address and IP address of hobbit.test.com
[edit] Pros
- Close to real-time fail-over.
[edit] Cons
- Fail over happens only on LAN, not WAN.
[edit] SunCluster
Free and opensourced clustering software from Sun. Commercial technical support is available.
- Using two sol-nv-b68-x86 VMware sessions with Sun Cluster express 07/07.
[edit] References
- http://www.opensolaris.org/os/community/ha-clusters
- http://www.sun.com/software/solaris/howtoguides/twonodecluster.jsp
- Analyzing the Application for Suitability
- Using AVS, not TrueCopy
[edit] FST HA
An opensource Clustering solution specifically for Solaris.
Small Text
[edit] Hobbit Configuration and tuning
[edit] Hobbit(bb) port 1984 encryption
- References: http://www.stunnel.org/
Plain text bb message will be a bottleneck to make Hobbit a enterprise solution which require high security standard. Following is an attempt to make your CIO smile on hobbit solution.
- Machine A : has both HB Server and Stunnel server running.
- Machine B : is a BB client.
- Machine C : is a hobbit client with stunnel client enabled. hb client will send bb message via encrypted port 1999.
- Machine D : is a HB client.
- Note: old bb port is one way, hb's bb protocol's is bi-directional.
Machine A (192.168.1.111)
---------------------------
HB Server process | <---------port 1984 <--------- BB client (Machine B)
| |
|1984 | <---------port 1984 ---------> HB client (Machine D)
| |
Stunnel Server process 1999 | <-------- port 1999 ----------> 1999 Stunnel Client
---------------------------- | (Machine C 192.168.1.141)
|
--1984 ---HB client
[edit] Configure stunnel server to run in hobbit server
- stunnel config file on server to direct 1999 into local 1984 port.
accept = 1999, we accept any incoming bb message on port 1999. connect = 127.0.0.1:1984, redirect 1999 to 1984 on hb server itself. bash-3.00# cat /opt/stunnel420/etc/stunnel/stunnel.conf <snip> [hobbit-server] accept = 1999 connect = 1984 <snip> bash-3.00#
- starting stunnel server on machine A. we can see hobbit-server port redirection is ok.
bash-3.00# /etc/init.d/stunnel420 start Starting universal SSL tunnel: stunnel2007.04.29 06:47:50 LOG7[1898:1]: RAND_status claims sufficient entropy for the PRNG 2007.04.29 06:47:50 LOG7[1898:1]: PRNG seeded successfully 2007.04.29 06:47:50 LOG7[1898:1]: Certificate: /opt/stunnel420/etc/stunnel/stunnel.pem 2007.04.29 06:47:50 LOG7[1898:1]: Certificate loaded 2007.04.29 06:47:50 LOG7[1898:1]: Key file: /opt/moto/stunnel420/etc/stunnel/stunnel.pem 2007.04.29 06:47:50 LOG7[1898:1]: Private key loaded 2007.04.29 06:47:50 LOG7[1898:1]: SSL context initialized for service pop3s 2007.04.29 06:47:50 LOG7[1898:1]: Certificate: /opt/stunnel420/etc/stunnel/stunnel.pem 2007.04.29 06:47:50 LOG7[1898:1]: Certificate loaded 2007.04.29 06:47:50 LOG7[1898:1]: Key file: /opt/stunnel420/etc/stunnel/stunnel.pem 2007.04.29 06:47:50 LOG7[1898:1]: Private key loaded 2007.04.29 06:47:50 LOG7[1898:1]: SSL context initialized for service hobbit-server . bash-3.00#
- make sure stunnel is running.
bash-3.00# ps -eaf |grep stunnel nobody 1984 1 0 06:55:00 ? 0:00 /opt/stunnel420/sbin/stunnel root 2133 1811 0 07:04:32 pts/2 0:00 grep stunnel bash-3.00#
- Testing port 1999 on hb server directly, typing garbage message "asdf" and then control+d to quit.
bash-3.00# telnet machineA.test.com 1999 Trying 192.168.1.111... Connected to machineA.test.com. Escape character is '^]'. asdf Connection to machineA.test.com closed by foreign host. bash-3.00#
- We can see port 1999 has incoming message from 192.168.1.141(machine c)in stunnel log file on machine A.
bash-3.00# tail -10f /opt/stunnel420/etc/stunnel/stunnel.log 2007.04.29 06:55:00 LOG5[1983:1]: 125 clients allowed 2007.04.29 06:55:00 LOG7[1983:1]: FD 4 in non-blocking mode 2007.04.29 06:55:00 LOG7[1983:1]: FD 5 in non-blocking mode 2007.04.29 06:55:00 LOG7[1983:1]: FD 6 in non-blocking mode 2007.04.29 06:55:00 LOG7[1983:1]: SO_REUSEADDR option set on accept socket 2007.04.29 06:55:00 LOG7[1983:1]: pop3s bound to 0.0.0.0:995 2007.04.29 06:55:00 LOG7[1983:1]: FD 7 in non-blocking mode 2007.04.29 06:55:00 LOG7[1983:1]: SO_REUSEADDR option set on accept socket 2007.04.29 06:55:00 LOG7[1983:1]: hobbit-server bound to 0.0.0.0:1999 2007.04.29 06:55:00 LOG7[1984:1]: Created pid file /stunnel.pid 2007.04.29 06:55:35 LOG7[1984:1]: hobbit-server accepted FD=0 from 192.168.1.141:38764 2007.04.29 06:55:35 LOG7[1984:2]: hobbit-server started 2007.04.29 06:55:35 LOG7[1984:2]: FD 0 in non-blocking mode 2007.04.29 06:55:35 LOG7[1984:2]: TCP_NODELAY option set on local socket 2007.04.29 06:55:35 LOG5[1984:2]: hobbit-server accepted connection from 192.168.1.141:38764 2007.04.29 06:55:35 LOG7[1984:2]: SSL state (accept): before/accept initialization 2007.04.29 06:55:39 LOG3[1984:2]: SSL_accept: 1408F10B: error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number 2007.04.29 06:55:39 LOG5[1984:2]: Connection reset: 0 bytes sent to SSL, 0 bytes sent to socket 2007.04.29 06:55:39 LOG7[1984:2]: hobbit-server finished (0 left)
[edit] Configuring hb client to use port 1999
- add hobbitclientLocalIP into hobbitclient.cfg file. We want hobbit client send bb message to itself.
bash-3.00# grep ^BBDISPLAYS /etc/opt/hobbitclient42/hobbitclient.cfg BBDISPLAYS="myotherhobbitserver.my.com hobbitclientLocalIP" # IP of multiple Hobbit servers. BBDISP must be "0.0.0.0". bash-3.00# bash-3.00# egrep -v '^;|^$' /opt/stunnel420/etc/stunnel/stunnel.conf cert = /opt/stunnel420/etc/stunnel/stunnel.pem sslVersion = SSLv3 chroot = /opt/stunnel420/var/lib/stunnel/ setuid = nobody setgid = nogroup pid = /stunnel.pid socket = l:TCP_NODELAY=1 socket = r:TCP_NODELAY=1 debug = 7 output = stunnel.log client = yes [hobbitclient] connect = hbServerRemoteIP:1999 accept = hbLocalIP:1984 bash-3.00#
- A successful hobbit client stunneling to hobbit server using port 1999.
bash-3.00# grep 06:50 stunnel.log 2007.08.19 00:06:50 LOG7[14842:1]: hobbitclient accepted FD=0 from HobbitclientIP:63758 2007.08.19 00:06:50 LOG7[14842:3]: hobbitclient started 2007.08.19 00:06:50 LOG7[14842:3]: FD 0 in non-blocking mode 2007.08.19 00:06:50 LOG7[14842:3]: TCP_NODELAY option set on local socket 2007.08.19 00:06:50 LOG5[14842:3]: hobbitclient accepted connection from HobbitclientIP:63758 2007.08.19 00:06:50 LOG7[14842:3]: FD 1 in non-blocking mode 2007.08.19 00:06:50 LOG7[14842:3]: hobbitclient connecting HobbitServerIP:1999 2007.08.19 00:06:50 LOG7[14842:3]: connect_wait: waiting 10 seconds 2007.08.19 00:06:50 LOG7[14842:3]: connect_wait: connected 2007.08.19 00:06:50 LOG5[14842:3]: hobbitclient connected remote server from HobbitclientIP:63759 2007.08.19 00:06:50 LOG7[14842:3]: Remote FD=1 initialized 2007.08.19 00:06:50 LOG7[14842:3]: TCP_NODELAY option set on remote socket 2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): before/connect initialization 2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 write client hello A 2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 read server hello A 2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 read finished A 2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 write change cipher spec A 2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 write finished A 2007.08.19 00:06:50 LOG7[14842:3]: SSL state (connect): SSLv3 flush data 2007.08.19 00:06:50 LOG7[14842:3]: 1 items in the session cache 2007.08.19 00:06:50 LOG7[14842:3]: 2 client connects (SSL_connect()) 2007.08.19 00:06:50 LOG7[14842:3]: 2 client connects that finished 2007.08.19 00:06:50 LOG7[14842:3]: 0 client renegotiations requested 2007.08.19 00:06:50 LOG7[14842:3]: 0 server connects (SSL_accept()) 2007.08.19 00:06:50 LOG7[14842:3]: 0 server connects that finished 2007.08.19 00:06:50 LOG7[14842:3]: 0 server renegotiations requested 2007.08.19 00:06:50 LOG7[14842:3]: 1 session cache hits 2007.08.19 00:06:50 LOG7[14842:3]: 0 session cache misses 2007.08.19 00:06:50 LOG7[14842:3]: 0 session cache timeouts 2007.08.19 00:06:50 LOG6[14842:3]: SSL connected: previous session reused 2007.08.19 00:06:50 LOG7[14842:3]: Socket closed on read 2007.08.19 00:06:50 LOG7[14842:3]: SSL write shutdown 2007.08.19 00:06:50 LOG7[14842:3]: SSL alert (write): warning: close notify 2007.08.19 00:06:50 LOG6[14842:3]: SSL socket closed on SSL_shutdown 2007.08.19 00:06:50 LOG7[14842:3]: Socket write shutdown 2007.08.19 00:06:50 LOG5[14842:3]: Connection closed: 30068 bytes sent to SSL, 0 bytes sent to socket 2007.08.19 00:06:50 LOG7[14842:3]: hobbitclient finished (0 left) bash-3.00#
[edit] 32 bit vs 64 bit binary for hobbit on Solaris
- This article describe this subject in great detail.
[edit] Configuration
[edit] LDAP Authentication
Example httpd.conf (Apache 2.0.x with LDAP authenticated against Active Directory):
Substitute LDAPSERVER.DOMAIN.COM with your LDAP server
<USERNAME>: use account with permission to view LDAP directory
<PASSWORD>: password for account (You should limit what this account can do)
<Directory "/var/hobbit/cgi-secure"> AllowOverride None Options ExecCGI Includes Order allow,deny Allow from all AuthType Basic AuthName "Hobbit Administration" AuthLDAPEnabled on AuthLDAPURL ldap://LDAPSERVER.DOMAIN.COM:389/dc=DOMAIN,dc=COM?sAMAccountName?sub?(objectClass=person) AuthLDAPBindDN "cn=<USERNAME>,cn=Users,dc=DOMAIN,dc=COM" AuthLDAPBindPassword <PASSWORD> require valid-user </Directory>
Same for a Novell-edir ldap server:
<Directory "/usr/lib/hobbit/cgi-secure"> AllowOverride None Options ExecCGI Includes Order allow,deny Allow from all AuthName "Hobbit-Admin" AuthType Basic AuthLDAPURL ldap://LDAPSERVER.DOMAIN.COM/o=TREE,ou=Users?cn?sub?(groupMembership=cn=your_group,ou=groups,o=TREE) require valid-user </Directory>
[edit] Alerts setting
- Pager
Using sms_client [smsclient.org]
Create a shell-script (/usr/bin/hobbitsms) like this:
#!/bin/bash if [ $RECOVERED != 1 ]; then echo $RCPT \"HOBBIT : $BBHOSTSVC is $BBCOLORLEVEL\" >> /var/log/hobbit/page.log /usr/bin/sms_client $RCPT "HOBBIT : $BBHOSTSVC is $BBCOLORLEVEL" else echo $RCPT \"HOBBIT : $BBHOSTSVC is weer OK\" >> /var/log/hobbit/page.log /usr/bin/sms_client $RCPT "HOBBIT : $BBHOSTSVC is OK" fi
Edit hobbit-alerts.cfg and add the lines for the alerts you want to receive:
SCRIPT /usr/bin/hobbitsms hobbit DURATION>5 FORMAT=SMS REPEAT=180 COLOR=red TIME=W:0730:1800 RECOVERED
- Pager.
Using snpp sendpage.org
Create a shell-script (/usr/bin/hobbitsnpp) like this:
#!/bin/bash /usr/bin/snpp -n $RCPT <<SCRIPTEOF $BBALPHAMSG SCRIPTEOF
- Email.
[edit] Tuning
[edit] How to shorten Xymon Server nslook up time ?
Xymon server do lots nslookup for every five minutes on the machines that need to be pinged.
Install a local dns cache server. I use djbdns for it
[edit] How to shorten the ping test time ?
[edit] Hobbit and Remedy Ticket System
[edit] Overview
Remedy ticket system has a web interface for opening up a ticket to a perticular ticket queue.
Perl approach is to use following software to automate the ticket request upon a alert occurred.
- perl
- LWP
- trouble_ticket.tgz on http://www.deadcat.net
- an entrance URL on remedy server web interface.
- A perl subroutine to open up remedy ticket.
[edit] Open Remedy ticket on hobbit alerts
[edit] Open Remedy ticket on demand
[edit] Migration from BB
[edit] Cost (efforts) of Migration
[edit] System and Inventory Monitoring
System monitoring and inventory monitoring can achieved by an external module to report a system's inventory's informaton.(TBC)
[edit] Trouble Shooting Guide
[edit] Q. When I click on a status icon I get the message "Status not available". What should I check?
A. First make sure that the server is actually running.
ps -ef | grep hobbitd
You should see several processes similar to:
hobbit 32717 32716 0 Nov07 ? 00:01:07 hobbitd --pidfile.... hobbit 32726 32716 0 Nov07 ? 00:00:03 hobbitd_channel --channel=page... hobbit 32727 32716 0 Nov07 ? 00:01:58 hobbitd_channel --channel=status... hobbit 32728 32716 0 Nov07 ? 00:00:01 hobbitd_channel --channel=data... hobbit 32725 32716 0 Nov07 ? 00:00:00 hobbitd_channel --channel=stachg...
If the server is failing to start, start looking at the hobbit logs directory. Check here for one location
/var/log/hobbit
[edit] Q. After installing the Hobbit client, my msgs tests are "clear" (sometimes refered to as "white")
A. As of the time of this writing, the Hobbit client does NOT have msgs functionality like the BB client does. This can be added by installing the bb-msgs.sh file from the BB client as an external test. Even so, the Hobbit server will turn the test to "clear" instead of the expected status. To correct his issue, you'll have to edit the hobbitlaunch.cfg file (usually found in /etc/hobbit/ or /usr/lib/hobbit/server/etc/) to add --no-clear-msgs to the client channel and restart the server:
CMD hobbitd_channel --channel=client hobbitd_client --no-clear-msgs --log=$BBSERVERLOGS/clientdata.log ...
[edit] Q. Tried to down BOARDBUSY: Invalid argument
A.
On Sat, Dec 09, 2006 at 12:08:02PM -0500, Geoff Hallford wrote: > I am getting the following error message in various Hobbit logs: > > 2006-12-04 07:59:46 Tried to down BOARDBUSY: Invalid argument > > Does anyone know what this is referring to or what I need to change? It often shows up when stopping Hobbit - you can ignore it. Regards, Henrik
[edit] Hobbit clients in DMZ zone
[edit] DMZ with NAT
- digram
[edit] DMZ with restricted Firewall
- diagram