Mon, 04 Feb 2008
A Review of OpenNMS
Neil H. Watson |
Abstract: This is a review of OpenNMS, version 1.3.9, an open source network monitoring system.
Contents
- 1 Introduction
- 2 Installation
- 3 Configuration
- 4 Events, alarms and notifications
- 5 Surveillance and the dashboard
- 6 User authentication
- 7 Customization
- 8 Conclusions
- A References
1 Introduction
OpenNMS is an open source network monitoring system, loosely inspired by HP’s Openview, Netcool, Spectrum and Tivoli. OpenNMS is Java based and uses XML configurations files, RRD type history graphs, SNMP and other clients to gather information and a Postgresql backend. The product is designed to scale and monitor thousands of “nodes” which are network devices such as servers, routers or switches.
OpenNMS is has two separate branches. The 1.2.x branch represents a “stable” branch where updates provide only bug fixes and no new features. The 1.3.x branch is the “unstable” branch. This branch provides both bug fixes and new features. The unstable branch is not so unstable that it cannot be used for production. Additionally is has many new features such as support for SNMPv3. Just be prepared that new bugs could be introduced.
Official OpenNMS documentation can be found at http://www.opennms.org. Careful study of these is important in the planning execution of an OpenNMS deployment. Commercial support is also available.
2 Installation
OpenNMS is installable via source, if necessary, but thankfully comes pre built so that it can be installed on Linux systems (e.g. Apt, RPM and Yum) and even OS X and Windows. Following the instructions carefully I was able to easily install OpenNMS, using Yum, on a CentOS 5 server.
3 Configuration
While installation was easy, configuration requires research and planning. OpenNMS has different parts that each need to be configured in order for it to work as a whole. All configuration files mentioned here are found in the OpenNMS “etc” directory.
Discovery enables OpenNMS to probe a defined group of IP addresses and add them as nodes to its database.
Capabilities remotely probe each newly discovered node for running services. These services are then monitored an graphed. OpenNMS comes with a good list of preconfigured services that are monitored, after discovery, by default. Additionally, OpenNMS allows for you to add your own customized service probes. In most cases this can be as simple as altering some configurations and not having to write your own remote agent.
Data collection instructs OpenNMS what information to collect, usually via SNMP, and store for historical graphing and performance monitoring.
4 Events, alarms and notifications
It is important to understand how OpenNMS records what changes it sees amongst the nodes it monitors. At the lowest level are events. The loss of a monitored service, a node going down, a new node being discovered, the reception of an SNMP trap are all examples of events. Each event has its own ID number. The ID number is used to store them in the database. At a higher layer events are categorized by “UEI”. UEI stands for unique event identifier. For example a UEI may identify a node going down or a HTTP service failure. As the name implies the UEI differentiates this event from any other. This will be used later to generate alarms and notifications. When studied on their own, events can be a very long list making the identification of important versus unimportant events time consuming. This is what alarms are for.
Alarms are configured to generate upon the reception of certain events such as a service or node going down. Alarms are separate from events and easy to see either by searching for active alarms or by viewing them on the dashboard. The dashboard is a page on the OpenNMS web page that gives an overview of the current state of the network that OpenNMS monitors. Alarms can also be configured to clear themselves once a relevant event is generated (e.g. a node has come up) Watching alarms appear on the web page is not always practical. Notifications allow OpenNMS to send notification of an alarm remotely.
Notifications are also configured to trigger when certain events are generated. Notifications can be sent to recipients using such methods as email, paging and Jabber IM. Notifications also appear on the dashboard. The goal is that notifications can be configured to escalate to one or more persons until someone acknowledges them. Notifications on the dashboard are identified as acknowledged or not. This enables staff to understand who is working on a issue. Like alarms, notifications can acknowledge themselves if a situation is resolved.
5 Surveillance and the dashboard
Surveillance is a way to organize your nodes into groups that are meaningful to you. For example, nodes may be organized into service groups such as “production”, “infrastructure” or “www.example.com”. If your organization is large you may have certain people or teams of people that are responsible for different groups of nodes. Using surveillance groups you’ll be able to configure what node groups users of the OpenNMS web page are allowed to view.
The dashboard uses surveillance groups as its guide for what to display. This ensures that users see only what they need to see. This offers both security and the ability for users to focus only on the nodes they are responsible for. Additionally the dashboard uses Ajax to allow users to call up information, such as performance graphs, and page through alarms and notifications quickly and efficiently.
6 User authentication
By default OpenNMS starts with a single user called “admin”. Users are defined in the “users.xml” file and are created and managed using the OpenNMS web page. It is also possible to configure OpenNMS to authenticate with a remote LDAP service.
7 Customization
Network monitoring is a complicated task. The topology and service content of every organization’s network is different. Keeping this in mind the OpenNMS developers have allowed for significant customization. Let’s look at some custom examples.
7.1 Thresholding Linux system loads
Linux keeps three numbers that represent system load for the past one, five and fifteen minutes. These load numbers are available via SNMP. In this example we configure OpenNMS to issue alarms and notifications when any of the load numbers surpass a defined number. Additionally, we will configure OpenNMS to automatically clear alarms and acknowledge notifications when these load numbers drop below an acceptable level. Note that you should not consider this a definitive “howto”. OpenNMS is in a constant state of change. Please consult the official OpenNMS documentation.
Data collection, currently stored in
“datacollection-config.xml”, gives
OpenNMS the ability to collect these load numbers. They can be
found in the file by default.
<!-- datacollection-config.xml -->
<group name="ucd-loadavg" ifType="ignore">
<mibObj oid=".1.3.6.1.4.1.2021.10.1.5"
instance="1" alias="loadavg1" type="integer" />
<mibObj oid=".1.3.6.1.4.1.2021.10.1.5"
instance="2" alias="loadavg5" type="integer" />
<mibObj oid=".1.3.6.1.4.1.2021.10.1.5"
instance="3" alias="loadavg15" type="integer" />
</group>
7.1.1 Thresholds
OpenNMS has knowledge of the SNMP MIB object that contains the three different load average numbers. Using this, we can configure OpenNMS to perform “thresholding” on these objects. In “thresholds.xml” we add our new thresholds to the “default-snmp” group already defined.
<!-- thesholds.xml -->
<!-- note that this number is 100 times... -->
<!-- the number listed by uptime or top -->
<threshold type="high" ds-name="loadavg1" ds-type="node"
ds-label="1min-load"
triggeredUEI="uei.opennms.org/mssd/1min-load-trigger"
rearmedUEI="uei.opennms.org/mssd/1min-load-rearm"
value="400" rearm="200" trigger="1"/>
<threshold type="high" ds-name="loadavg5" ds-type="node"
ds-label="5min-load"
triggeredUEI="uei.opennms.org/mssd/5min-load-trigger"
rearmedUEI="uei.opennms.org/mssd/5min-load-rearm"
value="300" rearm="200" trigger="1"/>
<threshold type="high" ds-name="loadavg15" ds-type="node"
ds-label="15min-load"
triggeredUEI="uei.opennms.org/mssd/15min-load-trigger"
rearmedUEI="uei.opennms.org/mssd/15min-load-rearm"
value="200" rearm="100" trigger="1"/>
There is a lot to consider here. In this stanza there are actually three separate thresholds defined. The first is for the one minute load number. Let’s break this down.
- type=“high”: The defines the threshold as a “high” type. High thresholds trigger events when the monitored number exceeds the given value. Conversely, “low” thresholds trigger events when the monitored number falls below the given value. Threshold types can also be defined using relative changes and mathematical expressions. See the official OpenNMS documentation for more information.
- ds-name=“loadavg1”: This defines the variable to be monitored. It points back to the “alias” defined earlier in “datacollection-config.xml”.
- ds-type=“node”: This defines the data source type. In this case it tell OpenNMS that this is data gathered from a node. If we where thresholding TCP/IP statistics we would probably set this to “if” for interface.
- ds-label=“1min-load”: This is a label that will be listed in reports for this threshold. It should be set to something that will make sense.
- triggeredUEI=“uei.opennms.org/mssd/1min-load-trigger”: This defines what event is generated when the threshold is triggered.
- rearmedUEI=“uei.opennms.org/mssd/1min-load-rearm”: This defines what event is generated when the load number drops below the defined value. This event will allow us to have the alarms and notifications resolve automatically.
- value=“400”: This defines the target load number that must be exceeded in order for the trigger event to be generated.
- rearm=“200”: This defines the number that the acquired load number must fall below in order for the rearm event to be generated.
- trigger=“1”: This defines how many times the threshold must be exceeded before the trigger event is generated. This can allow for the server to have load spikes without generating alarms. For this example we allow only one in order to see faster results during testing.
Note that this testing was performed on version 1.3.9 of OpenNMS. In later version this configuration and its location will change slightly in order to allow for better scaling and responsiveness.
7.1.2 Events
Now that we have defined the thresholds we need to define our custom UEIs. Events are defined in the “eventconf.xml” file. OpenNMS comes configured with thousands of predefined events. To help better organize these separate event files can be included. The included files are located in the “events” subdirectory of the OpenNMS “etc” directory. For our custom events we’ll create a custom file and include it in “eventconf.xml”.
<!-- eventconf.xml --> <event-file>/opt/opennms/etc/events/mssd.xml</event-file>
This tells OpenNMS that we have created a file called
“mssd.xml” that contains event definitions. Recall
that our thresholding configuration defined UEIs of
“uei.opennms.org/mssd/1min-load-trigger” and
“uei.opennms.org/mssd/1min-load-rearm”. This defines
not only what the UEIs are called but where to find them. The
first part “uei.opennms.org” precedes all UEI. The
second part defines that the event
“1min-load-trigger” is found in the event file
“mssd.xml”. Now let’s look at the mssd.xml
file.
<events>
<!-- mssd.xml -->
<!-- custom events -->
<event>
<uei>uei.opennms.org/mssd/1min-load-trigger</uei>
<event-label>High 1 minute CPU load</event-label>
<descr>High load for the past 1 minute on %nodelabel%</descr>
<logmsg dest="logndisplay">
High 1 minute CPU load on %nodelabel%
</logmsg>
<alarm-data reduction-key="%uei%:%nodeid%"
alarm-type="1" auto-clean="false" />
<severity>Minor</severity>
</event>
<event>
<uei>uei.opennms.org/mssd/1min-load-rearm</uei>
<event-label>High 1 minute CPU load cleared</event-label>
<descr>High load for the past 1 minute cleared on %nodelabel%</descr>
<logmsg dest="logndisplay">
High 1 minute CPU load cleared on %nodelabel%
</logmsg>
<alarm-data reduction-key="%uei%:%nodeid%" alarm-type="2"
clear-key="uei.opennms.org/mssd/1min-load-trigger:%nodeid%"
auto-clean="false" />
<severity>Normal</severity>
</event>
In this example the events for five and fifteen minute loads have been removed for brevity. After discussing thresholds the events should be a little easier to understand. Each event is contained within an “<event>” tag set. The “uei” tag define the uei name for this event. The “event-label” and “descr” tags define human readable text for when events are generated. The remaining tags require more explanation. Please refer to the first event.
“logmesg dest="logndisplay"”: This tells OpenNMS to log this message in the database and to display it on the web page.
“alarm-data”: This tag defines if and how an alarm is triggered. The reduction key helps OpenNMS to correlate the alarms and help prevent duplicate alarms from appearing. In this case it will correlate base on the uei and the nodeid which is a unique number that identifies each node. The “alarm-type” is set to 1 which means that this alarm is a trouble alarm. The parameter “auto-clean”, when set to true, tells OpenNMS to remove old events from the database that have been reduced under the same alarm. In this case we’d like to keep these for future reference.
The “severity” tag defines the severity of the event. In OpenNMS severity comes in seven categories:
- Critical (red)
- Major (orange)
- Minor (yellow)
- Warning (cyan)
- Normal (green)
- Cleared (white)
- Indeterminate (light blue)
These are fairly self explanatory but the official OpenNMS documentation should be consulted for more information. One thing that is worth noting is that it is not wise to configure events with to high a severity. Critical events should mean all hands on deck even at 0300. Bear that in mind as you define your own alarms.
In the second event we see some differences. The “alarm-type” is set to 2. This identifies the alarm for this event as a clearing alarm. Note that this uei is labeled as a rearm event. Our goal is that when the load average drops to an acceptable level a rearm event is triggered. The event is given a severity of “Normal”. Finally we want this alarm to clear the alarm in the previous event. To do this we define the “clear-key” parameter. Note it calls the previous triggered event. This what tells OpenNMS to clear the alarm from the previous event.
7.1.3 Notifications
Finally we come to notifications. For this test I wanted
to have OpenNMS send email as well as display alerts on the
dashboard. To get emails working I had to alter the
“javamail-configuration.properties” file. I added the
line:
org.opennms.core.utils.fromAddress=opennsm@example.com
which defines who the emails appear to come from.
Additionally I had to set the use of the javamail library to
false:
org.opennms.core.utils.useJMTA=falseThis
forces OpenNMS to use normal SMTP methods of sending
email.
Next I had to define custom notifications for our load average events and alarms. Hear I show only the 1 minute load average notification. This is found in the “notifications.xml” file:
<!-- notifications.xml -->
<notification name="1min-load-trigger" status="on">
<uei>uei.opennms.org/mssd/1min-load-trigger</uei>
<rule>IPADDR != '0.0.0.0'</rule>
<destinationPath>Email-Admin</destinationPath>
<text-message>High load for the past 1
minute on %nodelabel%</text-message>
<subject>Notice #%noticeid%: High 1
minute CPU load on %nodelabel%</subject>
</notification>
Here we are defining a notification for the “1min-load-trigger” that we defined earlier as threshold, an event and an alarm. We set its status to “on” so that a notification will occur. We identify the uei, which is “uei.opennms.org/mssd/1min-load-trigger”, of the event that will trigger this notification. The “rule” tag is a filter should you wish this notification to occur only on nodes with certain IP addresses. In our case we define anything with an IP address that is not “0.0.0.0”. Next we define the “destinationPath”. This defines what escalation path OpenNMS should follow. In this example we tell it to use email. Next is the text message or body of the email. Last is the subject of the message or email.
Finally, we create an automatic acknowledgement in
“notifid-configuraton.xml”. Again for brevity we show
only the 1 minute load average configuration. Our goal here is to
have OpenNMS automatically acknowledge a
“1min-load-trigger” notification whenever a
“1min-load-rearm” event is generated on the same
node.
<!-- notifd-configuration.xml -->
<auto-acknowledge resolution-prefix="RESOVLED: "
uei="uei.opennms.org/mssd/1min-load-rearm"
acknowledge="uei.opennms.org/mssd/1min-load-trigger">
<match>nodeid</match>
</auto-acknowledge>
We start with the “auto-acknowledge” tag and define the “resolution-prefix” parameter. This prefix will precede the subject on a notification message indicating that the previous notification is now resolved. The “uei” parameter defines what event will trigger this auto acknowledgement. Next, the acknowledge parameter defines what former acknowledgement should be automatically acknowledged. Finally we “match” the node ID. This ensure that notifications from other nodes are not auto acknowledged.
That was quite a lot to take in. Let’s see how it all works by showing the chain of events.
- Load a system by whatever means you like.
- After perhaps 5 to 10 minutes you will see a minor event listed by the node: “High 1 minute CPU load on mynode.example.com”.
- On the dashboard an alarm will appear: “High 1 minute CPU load on mynode.example.com”.
- At the same time a notification will appear on the dashboard with the same message.
- An email will be sent to the OpenNMS admin user’s email address if it is defined. This will have a subject similar to “Notice #69: High 1 minute CPU load on mynode.example.com”.
- Now stop loading the test system and wait for the load to fall past the rearm threshold.
- After that happens a normal event will be listed by the node: “High 1 minute CPU load cleared on mynode.example.com”.
- On the dashboard the previous alarm will now be gone.
- On the dashboard the previous notification will turn green and the “responder” will be listed as “auto-acknowledged”.
- Finally an email will be sent to the same recipient (admin) with the subject “RESOLVED: Notice #69: High 1 minute CPU load on mynode.example.com”.
That seemed like a lot of work. It is but, you shouldn’t have to do this very often. OpenNMS comes loaded with many events, alarms and notifications that will cover most of your needs including working with proprietary gear such as Cisco and APC.
7.2 Testing URLs
Suppose you want to monitor the response times and availability of a web site. By default OpenNMS will monitor a given IP address for the existence and the responsiveness of the HTTP daemon. However, to have it request a URL from the HTTP daemon requires a little extra configuration.
In this example lets assume that we have a URL
“https://www.example.com/ping” and that this returns
a page that contains “ok”. The expected HTTP response
code is “200”. We purposely built this page for such
testing. First we must configure a URL test as a
“service” and then configure a “monitor”
to monitor that service. This is done in the
“poller-configuration.xml” file.
<!-- poller-configuration.xml -->
<!-- define service for checking www.example.com ping page -->
<service name="www.example.com-ping" interval="300000"
user-defined="false" status="on">
<parameter key="retry" value="1"/>
<parameter key="timeout" value="30000"/>
<parameter key="port" value="443"/>
<parameter key="host-name" value="www.example.com"/>
<parameter key="url" value="/ping"/>
<parameter key="response" value="200"/>
</service>
<!-- define monitor for monitoring www.example.com ping page -->
<monitor service="www.example.com-ping"
class-name="org.opennms.netmgt.poller.monitors.HttpsMonitor"/>
The first stanza is the “service” tag. This tag defines what must be checked.
- name=“www.example.com-ping”: This is a label that will be referred to later.
- interval=“300000”: This defines the frequency, in milliseconds, that the URL should be tested.
- user-defined=“false”: Services can be defined by users using the OpenNMS web page. This identifies such services.
- status=“on”: This service has not been disabled.
- key=“retry” value=“1”: This defines the number of retries OpenNMS should attempt before determining if the service is down.
- key=“timeout” value=“30000”: This defines the maximum amount of time, in milliseconds, the service should take to respond.
- key=“port” value=“443”: This defines the port on which to test this service.
- key=“host-name” value=“www.example.com”: This defines the host name to test.
- key=“url” value=“/ping”: This defines the URL to add to the host name for the test.
- key=“response” value=“200”: This defines the HTTP status code that should be returned by the service.
The second stanza defines that OpenNMS should monitor this service when it is associated with a node. The “service=” references the name of the service we previously defined. The “class-name=” defines what OpenNMS poller agent should be used. In this case we define an HTTPS monitor agent.
Now that we’ve defined this service we must associate it with a node. To offer more learning potential we’ll assume that the node in question is beyond OpenNMS. That is to say that the node is separated from OpenNMS by enough uncontrollable infrastructure that ICMP testing is beyond our control. In this case we can only check the URL. Any other testing is beyond our control. In such cases it is useful to add a fake node and associate our service with it.
Fake nodes can be created using the “Provisioning Groups” tool in the “Admin” section of the OpenNMS web page. Use this tool to create a fake node and then a fake interface. The interface IP does not have to match the IP of the URL you are testing since we defined a host name. Do not assign the “SNMP Primary” on the interface to “P”. Choose “S” instead. Now add the “www.example.com-ping” service to the node. Once this is done press the “Done” button. You will see the fake node presented to you. Press the “Import” link to activate it. You can now test this by using Iptables to deny traffic to the remote URL. This should generate an event and an outage. Applying what you learned in our thresholding exercise you can also configure custom events, alarms and notifications.
It should be noted that it is also possible to have OpenNMS poll web sites more interactively, including logins and following links.
8 Conclusions
OpenNMS offers the flexibility to meet the monitoring needs of even the most discriminating users. OpenNMS is a mature product having been in existence since at least the year 2000. The learning curve is steep but the documentation is quite helpful. The OpenNMS mailing lists are also a valuable resource. List members helped me to successfully test the customization examples presented in this review.
One caveat to consider before choosing OpenNMS or any network monitoring system is to define carefully what you want to accomplish. A requirement of “Monitor everything” may lead to disappointment or make your project so large that it cannot be completed within budget. With your requirements clearly defined you’ll be able to evaluate OpenNMS and other products in a more useful manner.
A References
OpenNMS Project site: http://www.opennms.org
The OpenNMS Group (commercial sponsor site): http://www.opennms.com
This document was translated from LATEX by HEVEA.