You are here
Microservices for Monitoring Scale and Resilience
Opsview Monitor 6 has been re-engineered from scratch to meet demanding enterprise and service provider requirements for scale, performance, flexibility, cloud-readiness, and increasingly-automated deployment and lifecycle management -- all while maintaining 100% compatibility with Nagios® plugins. We spoke with Alex Burzynski, Chief Architect, to better understand the details.
John Jainschigg: Why did Opsview undertake this redesign?
Alex Burzynski: “We were being held back by aspects of how Nagios Core was designed. Most important, Nagios comprises several code monoliths. These made it very hard to scale Opsview Monitor, except vertically. It we needed to monitor more hosts, or to store many more metrics per host, we could deploy our software on bigger, more powerful servers with more CPUs, RAM, and disk. And we could deploy multiple instances to split up monitoring for very large pools of hosts. This strategy was pretty successful. We kept improving efficiency within the large components we’d inherited from Nagios. And Opsview Monitor was recognized for scalability: even more than some SaaS monitoring solutions.”
JJ: So what was the problem?
AB:“Our largest enterprise customers kept growing their IT estates: looking to monitor thousands of hosts per Opsview Monitor instance, or even tens of thousands. Scaling vertically is a logarithmic function -- sooner, rather than later, no matter what resources are thrown at the problem, a limit is going to be reached. Opsview 5, with its monolithic approach, meant using large, expensive servers or virtual machines, which were big investments and represented potential single points of failure. Providing true high availability meant duplicating these large servers several times, at even greater expense. Managing growth meant a lot of advance planning for server expansion. And all this limited our flexibility to configure Opsview Monitor to handle unique customer requirements.”
JJ: What kind of architecture did you want to build, instead?
AB:“We wanted something more cloud-friendly. We wanted to be able to deploy Opsview Monitor anywhere: on commodity servers or VMs, on a customer’s premises or on any cloud provider. And then we wanted to be able to scale by adding more such inexpensive capacity, wherever it was needed. To do this, we broke up the big pieces we inherited from Nagios into roughly ten times the number of simpler, more scalable and tunable microservices for monitoring. We engineered these microservices to communicate using a robust message-brokering solution called RabbitMQ, and where needed, to save their states with a resilient, distributed key-value database called etcd.”
JJ: Why use RabbitMQ?
AB: “The real question is: why use any messaging or stream processing solution? Because they let services communicate with one another -- even from server to server -- without creating dependencies. Services don’t need to know every detail of how to call each other directly: just how to reach the messaging solution and which other service or services they want to talk to.
“This lets services be much simpler and more generic. It lets you scale them out much more easily and place them where resources are available. RabbitMQ also lets us create duplicate message queues, so we can process different aspects of the same data streams in parallel. And it helps us provide resiliency with a true cluster architecture: RabbitMQ can be configured to retain messages until receivers acknowledge that those message payloads have been processed; and if not -- for example, some recipient service has failed -- it can put those messages back in queue so they can be processed when a receiver is once again available.
“This said, our architecture is actually not strongly dependent on unique RabbitMQ features, and we are continuing to look at a range of solutions -- Apache Kafka being one example -- to evaluate if they’re better for helping us reach our performance, scalability, and other roadmap targets.”
JJ: What about etcd?
AB:“Opsview Monitor comprises several families of microservices for monitoring. Some of these are ‘stateless’ -- parts of the web front-end are good examples -- where we use Node.js to accept input from a browser and send back some information to display. Any stateless component can respond to any user or browser’s input. So we can scale such processes out or back as loads require, and pass requests to them, round-robin, through a load balancer.
“Other components are “stateful.” They need to keep track of peer services making up a service chain, or find a producer or consumer process of a specific type, etc. So services write this information to etcd, and then they or other services can retrieve it: recovering state or finding one another -- called “service discovery.” Etcd is very good for this because it’s designed to provide consistency over availability: so once you write something into it, that information is distributed quickly around the cluster, before it can be retrieved. This prevents a condition cloud computing people call “split brain,” where two parts of a complex application have different ideas about state.
JJ: What are the benefits of all this to users?
AB:“We can now scale different parts of Opsview Monitor individually, to optimize overall performance and efficiency on any bare-metal or virtualized infrastructure. As I said before, where we have tasks with variable duration and resource requirements, we can scale out workers and load balance them. For example, executing Nagios-compatible plugins to retrieve metrics from hosts might mean pinging a remote host, which is fast and easy, or it might mean running a Selenium browser-emulation session, which requires a lot of resources and can be very slow or even time out. So here, it makes sense to deploy a lot of single-threaded executors and let them perform new tasks as soon as they become free. We get great performance from this strategy: Opsview Collectors are now tested up to 10,000 hosts per unit.
“Elsewhere, it makes better performance sense to distribute complex work deliberately; assigning it, in a balanced way, to a limited number of parallel pipelines matched to CPU cores and optimized to handle expected dataflows. We use a fixed set of parallel services to write to databases, for example, since we can know and tune simultaneous-write capability for each database in advance. Another place we use this kind of carefully-orchestrated parallelism is in handling reloads: we’ve achieved a 5x improvement in how fast we can regenerate configuration, resulting in an average 100% speed improvement overall.”
JJ: “Aren’t all these services very complex to deploy?
AB: “Manually deploying them could become complicated. But we’re moving beyond manual deployment, and towards much more use of automation. Opsview Monitor 6 uses Ansible to automate deployment, and eventually a range of lifecycle-management functions, like adding Collector nodes or deploying Service Desk integration. Our goal is to use Ansible, wrapped in helper scripts, to enable reliable, rapid automated deployment of appropriately-optimized scale-out enterprise configurations of Opsview Monitor -- a process involving, among other requirements, the need to predict loads, calculate resources required to process them, and orchestrate deployment and startup of those processes.”