Optimizing Prometheus Configuration for Enhanced Performance


I am currently looking to optimize my Prometheus configuration to improve its performance, particularly concerning scrape settings and job configurations. Below is my existing configuration. We have already maximized timeout settings, and I suspect there might be issues with how different groups are configured. Here’s the relevant part of my config:

scrape_interval: 30s
scrape_timeout: 10s

  • OpenMetricsText1.0.0
  • OpenMetricsText0.0.1
  • PrometheusText0.0.4
    evaluation_interval: 15s


  • job_name: prometheus
    honor_timestamps: true
    track_timestamps_staleness: false
    scrape_interval: 1m
    scrape_timeout: 20s
    enable_compression: true
    enable_http2: true
    • targets: [‘xxxx:8180’, ‘xxxx:8180…’]
      group: ‘various_groups’

I have several questions:

  1. Compression and Transport Methods: I currently have enable_compression: true and enable_http2: true settings enabled for my Prometheus and Pushgateway jobs. Should I use these settings universally across all jobs to reduce network load and speed up data transport?
  2. Job Segmentation: How can I effectively segment scrape_configs into smaller, more manageable parts? My configuration has multiple targets grouped together, which could potentially be split by type or priority for better performance.
  3. Remote Write Optimization: Given the current settings in my remote_write configuration, how can I further optimize the parameters like capacity and max_shards to enhance data transmission efficiency without overwhelming the resources?
  4. Metric Filtering: Could you suggest some metrics that are commonly removed in setups similar to mine to decrease the load on the system? I am using relabel_configs to filter out unnecessary metrics but would appreciate specific examples of metrics that can be excluded.

I appreciate any insights or recommendations you could offer based on your experience. Thank you!