If you were to instrument and monitor VMware+Windows+AD environment of about 50 ESXis and 500 Windows VMs, with a about a dozen of LoB (line-of-business) applications and services, and had to start from scratch, how would you go about it?
If the question isn’t right for this community - sorry! Would you recommend a more appropriate community that’s also guided by “good reply game” and “yes-and” principles?
(The question isn’t necessarily specific to Grafana - yet having heard a lot of good things about it, I am hoping Grafana will play a prominent part in it .)
In case it’s relevant: it’s a retail chain with dozens of stores and about a dozen of LoB (line-of-business) applications and services. Before I was hired, most monitoring was done with , vCenter and vendor-specific tools (e.g. APC and Meraki). In the 10 months since I was hired, we’ve made good progress with monitoring and alerting in SolarWinds SAM. More work to do, including log aggregation, SIEM, potentially migrating o11y infra to the cloud. What would you recommend to look into assuming resources are low but the interest is high?
Or in other words… If you were in my situation, with SolarWinds, Splunk, some PowerShell scripting experience: what would be your next o11y toolset that could supplement or replace SolarWinds SAM? What would you do to best position yourself, your team and your org for the future in terms of o11y, its costs, efficiencies, scale, and long term viability?
Anything what has support for OpenTelemetry protocol or at least that vendor has OpenTelemetry receiver. You can have different vendors (there can be some LoBs, which don’t like default vendor and they prefer different one) for collection/agent part, which can be pushed via middle OpenTelemetry layer to final signal storage (on-prem or cloud - probably LGTM stack).
66% of respondents use 4 or more observability tools within their group, while 52% say their company uses 6 or more, including 11% that say their company uses 16 or more observability tools.
So don’t be sad if you end up with many tools - it’s normal (and the best setup will be if all of them can be connected via OpenTelemetry).
Dashboarding, PromQL or something similar for logs.
Things I am unlikely to get (but which we do need): auto-discovery (w/o coding too much or implementing CM/IaC tools for the purpose), where all or some of the nodes (machines) can be imported from AD, including on a regular schedule, and then OOTB or custom monitoring templates applied to them.
More distressed than sad. “Many tools” means a tool sprawl, a well known problem resulting in gargantuan amounts of technical debt. We don’t want to be in debt…
I think also all those o11y reports and studies miss out on the big picture: that state and county level grocery retail chains, warehouses, medium size manufacturing plants, etc. - don’t know what o11y is, and don’t want to know. They want to solve business problems. They want to know if their hardware, applications, services are behaving as expected. Historically this was all coming from the vendor (maker of the hardware and tools), yet these days it’s often a combo of network and physical hardware, maybe some hypervisors, maybe some cloud infra, maybe some AD, maybe some Linux, maybe some desktops (that would also be good to keep an eye on).
So they want a tool (ideally one, definitely not many) for their single sysadmin (or MSP) to keep an eye on SNMP, WMI, WimRM devices, and perhaps also a log aggregation system that would work OOTB and require minimum setup and maintenance.
SolarWinds has been able to do most of it historically. It’s expensive, does not scale well, and in a lot of places, is sitting dormant and abandoned because there’s no time or resources to set it up and maintain.
Good community support. (Lack of answers to the fairly simple question above may be an indicator that OTel / Grafana may be too heavy of a lift for your average sysadmin.)
Big 4 (CPU, memory, storage, network).
Ability to “walk” processes and services, and add them to monitoring templates on the fly.
Disk and processor queue lengths, capacity and resource exhaustion charts.
Custom PS scripts for monitoring file and directory sizes and counters.
Intelligent alerting with reusable macros and conditions. (E.g. set up Slack channel targets in one place, not in every Slack alert; set up firing conditions in each alert allowing arbitrary muting depending on the source of the alert and other variables.)
yet they chose SolarWinds
I think you are looking for the holy grail. definitely market opportunity!
So your requirements match more astronaut guide to o11y than hitchhiker
They need to crawl before they can soar. it might require they hire a legit sys admin with swiss army knife skills or outsource it. Cant expect 12 course meal when you got a chef with wooden utensils
You’re right… Thank you and @jangaraj for the suggestions.
If we think of IT management (including infra monitoring) moving to devops and IaC frameworks where every enlightened future sysadmin understands how to deploy and maintain everything gracefully, including monitoring tools - then @jangaraj’s suggestion of the LGTM stack makes a lot of sense - and this is where I should be heading, too.
(We aren’t there yet in terms of prevalence of devops/IaC in legacy IT, or the suitability of the stack for legacy infra monitoring applications - yet for this hitchhiker, it’s a map, a direction.)