Grafana runs Out of Memory when querying longer timeframe dataset

  • What Grafana version and what operating system are you using?
    I am using gafana 7.0.0 on RHEL 7.3 linux.

  • What are you trying to achieve?
    I am running a query against influxdb for 6 months worth of data.

  • How are you trying to achieve it?
    I have influxdb connected by using the datasources.

  • What happened?
    The 24 hour timeframe works find but if I query 7 days worth then I get out of memory error. I also have chronograf run the same query on influxdb and I do not get any error and as a matter of fact it is much faster.

  • What did you expect to happen?
    We increased the memory from 16 to 32 gb and from 6 processors to 12 but there is no change in the behavior.

  • Can you copy/paste the configuration(s) that you are having problems with?

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
    I turned debug on in hopes of catching some issue but the logs were not very helpful. I did not get any error in the Grafana UI except out of memory error.

  • Did you follow any online instructions? If so, what is the URL?
    No I did not follow any url. I did look in the Github repo issue list but did not find something that I could use to tweak the configuration.

Can someone suggest what I should do as next steps?

Interesting. Can you point out exactly where the error occurs? Do you see any errors at all in the Grafana logs? Does the grafana server process actually crash due to OOM? Or is it just something that you see in the front end? If it’s the latter, are there any more details in your browser’s console or network request logs?

In my experience even a modestly sized Grafana server (4GB RAM) shouldn’t have memory issues with heavy queries, so what you describe sounds peculiar.

Finally, and more generally, I guess you’re not applying any aggregation over time in your query? It’s worth considering whether you actually need to query the raw data; if you’re simply plotting the data, then applying an aggregation may be sensible. Sorry if I have the wrong end of the stick here (I don’t know your use case), but just a thought.

1 Like

Thanks for your reply @svetb. The browser reports OOM. Yes it is the raw data and you got that right :). I have not started aggregating it yet. It is in consideration. When I use Chronograf which is part of tick stack the results are in seconds and it does not time out or throw any errors. When Grafana reports OOM I do not see anything in the logs to report OOM. I am also running top in another terminal and there is plenty memory to use. The Grafana, Influxdb / Chronograf are on the same server. We just increased the memory to 32 gb yesterday night from 16 and has not made any difference. I had the SA check the system logs and no OOM in /var/log/messages etc. I have not made any changes in the config.ini for grafana which could cause any issues. If I do a 24 hours on the dashboards, I see no issues and the dashboard shows up at acceptable speed. When I do a 7 days graph I see the OOM on the browser. I remember I did a browser reset (Edge) but no change. Chronograf performs regardless. Raw stats are collected once a minute. Also boss likes Grafana which looks like Graphite :frowning: .

1 Like

Right, so your browser is running out of memory due to the volume of data being thrown at it. As you noticed, upgrading your server won’t make a difference.

There may be various workarounds and tweaks you can try, but adding an aggregation is definitely the best solution. There’s basically little point in querying - and feeding your browser - millions (?) of data points if all you really need is a chart on a screen with a resolution of ~2K pixels.

Happy to try and point you in the right direction if you’re not sure how to approach that.

Yes. I definitely appreciate your help to point me in the right direction. I still cannot understand one thing is that why grafana runs out of memory when chronograf does not. Infact I can run larger datasets in less than 30 seconds on chronograf. I am using the same browser Edge the new one. If I can get my mind to understand that then I would look at how to do some work arounds and I cannot find any reason. Is it because of the resolution?

Are you sure you’re running exactly the same query in Grafana and Chronograf? Chronograf does do aggregation by default, unless you manually write a query that does not have it. In fact, that’s the default behavior in Grafana also. So it’s a bit hard to give you a good diagnostic without seeing the specific query/queries.

Either way, even if you’re running the same query, it’s possible that the Chronograf front end happens to be good at handling a payload with millions of points - while Grafana’s isn’t. Grafana does provide far more complex functionality for post-query data manipulation (i.e. in the front end), so it’s possible that this causes it to be less good at handling massive payloads. I don’t know…even though Grafana and Chronograf look kind of the same, they’re very different tools - so I don’t find it quite as surprising that their behavior might diverge when faced with an edge case.

Hi @svetb,

I went ahead and installed apache to act as a proxy and it did a little better but obviously not good enough. So can you guide me how to do an aggregation? Is that using the Telegraf plugin?

Thanks for your help.

Ketan

Aggregation doesn’t require any data modifications - it’s just down to how you query the data. See Explore data using InfluxQL | InfluxDB OSS 1.8 Documentation

If you share your query I can probably point you in the right direction.

Hi @svetb

Thanks for your help. Here are some queries from different dashboards:

Grafana:

SELECT “value” FROM “stat.avedur” WHERE $timeFilter GROUP BY “host”
SELECT “value” FROM “stat-amqp-store-step.count” WHERE $timeFilter GROUP BY “host”

SELECT “usage_idle” * -1 + 100 FROM “autogen”.“cpu” WHERE (“cpu” = ‘cpu-total’ AND “tag” = ‘totalcpu’ AND “host” = ‘servername1.example.com’ OR “host” = ‘servername2.example.com’) AND $timeFilter GROUP BY “host”

Chronograf:
SELECT mean(“value”) AS “mean_value” FROM “DB_Name”.“autogen”.“stat.media-read-decrypt.avedur” WHERE time > :dashboardTime: AND time < :upperDashboardTime: GROUP BY time(:interval:), “host” FILL(null)
SELECT mean(“value”) AS “mean_value” FROM “DB_Name”.“autogen”.“stat.count” WHERE time > :dashboardTime: AND time < :upperDashboardTime: GROUP BY time(:interval:), “host” FILL(null)

Regards,

Ketan

Hi @kedesai! Is it fair to leave it as “an exercise for the reader” to see how the Chronograf queries are quite different to the Grafana ones?

The GROUP BY time(:interval:) bit is the aggregation I was talking about. Documented here: Explore data using InfluxQL | InfluxDB OSS 1.8 Documentation

In Grafana, the equivalent of

SELECT "value" FROM "stat.avedur" WHERE $timeFilter GROUP BY "host"

with a time aggregation is

SELECT mean("value") FROM "stat.avedur" WHERE $timeFilter GROUP BY time($__interval), "host"

You can also add a FILL(null) clause at the end, like in the Chronograf queries; I don’t remember if that’s really necessary or just a nice-to-have.

Maybe give that a go and see if it helps?

1 Like

Thanks very much @svetb. I changed the dashboard which had like 75 graphs and once I use the aggregation they loaded in a flash even for a 90 day timeframe. I am really appreciate your guidance.

1 Like