Monitoring is an integral part of any on-premise or off-premise IT setup. Whether it’s for monitoring performance of the applications or for checking the availability of infrastructure and services, monitoring is a critical part of today’s IT directly impacting the bottom line results.
Quite often, your monitoring stack might be born as a result of your application stack and the application hosting infrastructure. It’ll evolve over a time as new requirement comes in and application stack undergoes changes. Different application stacks can result in different monitoring stacks, eventually leading to ‘monitoring sprawl’. Choosing the right monitoring solution for your IT infrastructure is critical to the success of the business, and directly affects the bottom line. Rise of microservices, as well as today’s complex and dynamic environments due to containerization and virtualization, have made monitoring more crucial than ever.
A well researched and planned monitoring solution results in improved ROI by way of reduced downtime, reduced MTTR, improved SLAs and increased productivity of IT staff. Let’s look at some of the key considerations to take into account when deciding on a monitoring solution.
- Problems : It’s very important to have clarity on the problems that you are trying to solve using monitoring. Is it to tackle application performance issues or application availability issues, to get an idea of the existing infrastructure and its availability, or is it for monitoring compliance and so on? First define the basic functionalities you will need from your monitoring solution.
- Users : The kind of metrics you want to measure, in addition to the necessary granularity, visualization and subsequent analysis of this data is closely tied to the actual users of the monitoring solution. For example, developers might need monitoring data specific to their applications, whereas operators might need data related to capacity utilization, availability, security etc. Similarly, one set of users might want to see only the application failures, whereas another set of users might be interested in only the server hardware failures.
- Metrics : Once you have identified the users and key issues to be addressed by the monitoring solution, the next step is to identify the actual metrics needed, in addition to the granularity at which the this data must be available, and the aggregation and correlation that will be required. Granularity, aggregation and correlation for the same metric might vary depending on the intended users. For example, a developer might be interested in real-time CPU utilization for a specific application, whereas a person involved in sizing the infrastructure for that application might be just interested in the average monthly CPU utilization. Consequently the monitoring tool should be able to handle these varied requirements.
- Implementation: Here are few thoughts to keep in mind when reviewing and selecting your monitoring tools from implementation point of view: What is the overall acquisition cost of the solution? How will it integrate with any existing tools? Do you plan to integrate with collaboration platforms like Jira, Slack etc.? How easy or difficult are the tools to learn, and are the necessary skills available in the market? Could your selection potentially result in vendor lock-in? Is the tool based on open-standards? How will you scale the solution?
Finally, consider from the outset how your various monitoring tools will work in tandem, in addition to how you will manage and route alerts. Without a proactive plan in place, alerts from multiple monitoring sources can quickly escalate from annoying to chaotic. Some of the popular monitoring tools today are also the noisiest, which can result in unintended consequences like false alarms, ignoring critical data etc. Consider alert correlation to not only help you alleviate the noise without losing visibility, but also to garner powerful insights by harnessing data from multiple monitoring sources. Take Nagios, for example: The popular monitoring tool demonstrated an average compression ratio of over 97% when raw alerts were correlated to common, related incidents.
Hopefully this gives you an idea of the research and planning that should go into deploying a monitoring solution, as well as some of the questions that you should ask potential vendors. Remember: in order to successfully scale your solution and avoid unforeseen chaos down the road, it is critical to develop a comprehensive monitoring strategy in tandem with selecting your tools. Understanding how each tool in your stack will work in harmony with each other, in addition to how alerts will be managed, analyzed, and routed, will have significant impact on your MTTR and the root-cause identification of critical issues.