- Introduction
Organizations often face the challenge of integrating modern technology with legacy systems, as they implement new systems, in this case a cloud-based distributed database, while still using systems that have served them for years and that they trust and rely on to perform their work day in and day out, in this case, IBM Netcool. This case study highlights the importance of balancing innovation with stability while minimizing disruptions.
- Background
2.1. Existing Technology
For many years, corporations around the world have been using IBM Netcool/Omnibus as their centralized alarm systems. It is a solid, time-tested software that offers many advantages such as real-time event correlation, customizable rules and policies, integration with other important systems (such as ticketing), high availability and reliability, and scalability. Add incumbency on top of everything mentioned: it was there already and there no need to spend more money to have it.
The Tier 1 carrier made it a non-negotiable requirement to send the alarms shown in Grafana to Netcool.
2.2. New Technology (Grafana)
Grafana is an open-source data visualization and monitoring tool that is very flexible and versatile. It has many features such as: support of many data sources (Prometheus, InfluxDB, Graphite, MySQL, PostgreSQL, Elasticsearch, and AWS CloudWatch, among others), nice-looking highly customizable dashboards, powerful alerting system able to send notifications via many channels (email, Slack, PagerDuty, Microsoft Teams, and more), plugins and extensions from many vendors, and, last but not least, Kubernetes and Cloud Monitoring.
ENEA’s Stratum 4.4 ships with a set of Grafana dashboards to monitor the application and the Kubernetes cluster were it is hosted.
- Methodology
3.1. Assessment Phase
We did some reasearch and our options were: 1) to create SNMP traps for each alarm and send them to Netcool. 2) to use Netcool’s webhook probe. We discarded creating SNMP traps as it would entail several weeks of development and testing and proceded to look at the second option.
We held a meeting with the Netcool team. They provided a URL to send the alarms to and gave us examples of messages received by Netcool from another system. We were also told that we needed to package our alarms as JSON payloads.
In the Grafana GUI we proceeded to create a contact point with integration set to ”Webhook” and configured the URL provided by the Netcool team to send the alarms. There was an option to test the contact point and we tried it. We got a pop-up window with a message that the test had been successfully sent. Piece of cake!
Not so fast. The Netcool team informed us in our second meeting that the messages received had the wrong format. What did they mean? Grafana sends JSON format as required. Back to square one.
3.2 Implementation Strategy
After further research we decided to code a python script to receive the alarms from Grafana, parse them, clean them up, reformat them and resend to Netcool.
ENEA’s Stratum 4.4 alarms are very simple, there are only three severity levels: critical, warning and cleared. We defined alerts to generate the alarms using MYSQL queries.
Each of the three alarm severities types would generate an alert, so each one had their
own Alert Rule in the Grafana GUI, each one, in turn, with a label signaling the severity
level. We then created a Notification Policy that will get the three types of alarm and
send them to our python script. Our python script, in turn would send to Netcool.
We were able to get the alarms in our script, clean up a bit the JSON received and resend only the alarms to Netcool. More testing was needed. The Netcool team told us all the alarms were being sent together in the same JSON payload and we needed to split them into single JSON payloads. Also we needed to adjust the timing of the alarms.
- Challenges Faced
- Formatting of alarms: even though we were sending JSON, we needed to split the alarms into single JSON payloads, one for each alarm.
- Timing of the alarms needed to be adjusted so that we would get alarms within two minutes of each alarm showing in the Grafana dashboard.
- The testing was done using unsecure HTTP. We needed to used secure certificates to connect to the Netcool server.
- We had very tight time constraints to deliver all of the above
- Solutions and Best Practices
We started addressing the items above one by one. First, we used the default python json library to validate and modify the JSON payload received from Grafana and the message we pieced together with the fields agreed upon with the Netcool team.
Furthermore, we tuned down the times of the alarms in the Grafana configuration to the minimum so that we get alarms at most two minutes after they show in the Grafana dashboards.
We added secure certificates and performed all communications with Netcool using a secure channel.
We manage to fit the delivery of the feature within the time alotted by choosing python as the scripting language. It has libraries to perform all of the required tasks. We used the following libraries: json, http.server, urllib.request, urllib.parse, datetime and ssl.
We created a docker container to run the script and deployed the container into our kubernetes namespace.
The Netcool team created the Netcool configuration based on the agreed JSON payload and the stakeholders were satisfied with the outcome.
- Results and Impact
We were able to deliver the feature requested by integrating Grafana and Netcool. The stakeholders had made it clear from the beginning that delivery of the alarms to the Netcool system was mandatory and it was a gating requirement for the project to go live.
It was only reasonable to leverage the existing Netcool system for alarming as there were many other systems using Netcool and the NOC operators are used to it.
The stakeholders initially showed a little bit ot aprehension using the Grafana dashboards as it was something new to them but as the project progressed they became more and more familiarized with it and started incorporating its use into certain activities, such as, system upgrades and testing.
- Conclusion
This project task showcased the reality many companies, large and small, are faced with when adopting new technologies. It feels great to embrace the new technologies with enthusiasm but, at the same time, one must consider integrating with legacy systems, leveraging existing resources and know-how and, ultimately, saving the company time and money.
References