-
What Grafana version and what operating system are you using?
-
What are you trying to achieve?
-
Overview:
We need a comprehensive monitoring solution for our Elasticsearch clusters using Prometheus as the data source and Grafana for visualization. Each Elasticsearch environment (prod, dev, uat, trn1, trn2, trn3) consists of over 100 servers. The primary goal is to monitor the health and performance of these clusters, with specific requirements for visualization, alerting, and historical data analysis.
Requirements:
- Dashboard Structure:
- Panel 1: Cluster Status Visualization
- Visualize each environment (prod, dev, uat, trn1, trn2, trn3) as separate icons.
- Icons should indicate the overall health of the environment:
- Green: All servers are operational.
- Yellow: 1 or 2 servers are down or inaccessible.
- Red: More than 2 servers are down or inaccessible.
- Clicking an environment icon should display detailed information in Panel 2.
- Panel 2: Server Status within Environment
- List all servers within the selected environment.
- Highlight servers that are down or inaccessible.
- Clicking a server hostname should display detailed resource consumption in Panel 3.
- Panel 3: Individual Server Resource Consumption
- Display detailed resource metrics (CPU, memory, disk usage, etc.) for the selected server.
- Include alerts for resource thresholds (e.g., CPU usage exceeding a set threshold).
- Provide the cause of server downtime if available.
- Clicking a specific resource metric should display historical data in Panel 4.
- Panel 4: Historical Resource Usage
- Show historical usage data for the selected resource metric.
- Provide options to view data over various time ranges (e.g., last 24 hours, last week, last month).
- Alerting:
- Configure alerts for the following scenarios:
- Server down or inaccessible.
- CPU usage exceeding a specified threshold.
- Any other critical resource alerts as deemed necessary.
- Set up email notifications for alerts to ensure timely response to issues.
- Data Collection:
- Ensure Prometheus is configured to scrape metrics from all Elasticsearch servers.
- Collect metrics such as server availability, CPU usage, memory usage, disk usage, and network activity.
- Integration:
- Ensure seamless integration between Prometheus and Grafana.
- Enable drill-down capabilities to move from environment-level views to server-level details and resource-specific historical data.
Additional Details:
- Accessibility: Ensure the dashboard is accessible to relevant team members and provides an intuitive interface for monitoring and troubleshooting.
- Scalability: The solution should handle the current number of servers and be scalable to accommodate future growth.
- Performance: The monitoring setup should have minimal impact on the performance of the Elasticsearch clusters.
Please ensure that the monitoring setup is robust and provides clear insights into the health and performance of our Elasticsearch clusters. If there are any questions or additional requirements, feel free to reach out.
Thank you!
-
How are you trying to achieve it?
-
What happened?
-
What did you expect to happen?
-
Can you copy/paste the configuration(s) that you are having problems with?
-
Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
-
Did you follow any online instructions? If so, what is the URL?