Loading Grafana dashboard takes too long time

Hi!

I am running Grafana on a EC2 instance in AWS Cloud to visualize data contained in a Timescale database located in the Cloud.

My Grafana dashboard has 14 panels (8 time series, 3 bar charts, 1 pie chart, 2 stats) and all together it performs 69 SQL queries (including 11 queries done in the definition of the defined Grafana variables). These queries are partially done towards the raw data table and partially towards hour aggregate views. The dashboards contains also 17 data transformations. Most of the tests have been done by using a time range of 1 day (144 data points when using the raw data table and 24 data points when using the hour aggregate view).

Example of query:

SELECT mtimestamp, mvalue AS “internal temperature” FROM periodic_measurements WHERE (apartmentid = $apartment AND roomid = $room_id AND metric = 1 AND $__timeFilter(mtimestamp))

When I connect the dashboard to my alfa database (which has a raw data table of about 16 million entries), the dashboard loads in few seconds.

However, when I use the production database (which has over 1.000 million raw data entries) as a data source, the dashboard takes about 40 minutes to load.

The source database is a Timescale database (PostgreSQL-based) hosted in aiven Cloud service. It has 8 CPUs, 16GB RAM and 512GB of disk space. The server runs Postgres version 14.10. and both databases use Timescale version 2.11.2.

The server where Grafana is running is an EC2 instance in AWS running Linux 22.04. It is a m5a.xlarge instance which has 4vCPUs, 16 GB RAM and network bandwidth up to 10Gbps. The server runs Grafana 9.5.1. (free version).

I am accessing the EC2 Linux machine which hosts Grafana through Remote Desktop Connection from Windows 10 Pro (22H2) by using a connection with over 27Mbps for download and over 5 Mbps for upload.

During the tests I got the following results from monitoring the two servers (when using the above mentioned HW configuration):

  • Timescale server
    • CPU usage: 50%
    • IOPS read: 1.600
    • memory usage: 11%


  • EC2 machine
    • CPU usage: 38%
    • Network in (bytes) - 5 minutes: 5M
    • Network out (bytes) - 5 minutes: 210M

In order to decrease the loading time, I have tried using with different HW configurations, obtaining always the same loading time range. The HW configuration described above is the most capable one among those I have been testing.

Also, I have tried to minimize the amount of separate SQL queries that the dashboard is doing by using common panels but that has not improved the performances at all.

Any hints why my Grafana dashboard becomes so slow when reading from tables containing thousands of million of entries?

Thanks a lot in advance,

Bernardo Di Chiara

1 Like

Scary :scream:
Do you have indices on the columns you sre filtering on?

Also what happens if you run these queries against production using a test harness sql script that emulates ghe dashboard queries?

1 Like

Generally: enable Grafana tracing and inspect the longest traces. It is not clear what “the dashboard takes about 40 minutes to load” measures.

1 Like

Hi!

Thanks for your quick reply!

The raw data table does have indexes since all the columns that are used in the query (except obviously the value column) are part of a composite primary key:

CREATE TABLE periodic_measurements
    (mtimestamp TIMESTAMP WITH TIME ZONE NOT NULL,
     apartmentid INTEGER NOT NULL,
     roomid INTEGER NOT NULL,
     sensorid INTEGER NOT NULL,
     metric SMALLINT NOT NULL,
     mvalue NUMERIC NULL,
     PRIMARY KEY (mtimestamp,
                  apartmentid, roomid, sensorid, metric));

The materialized views have no unique indexes. Here is an example:

CREATE MATERIALIZED VIEW max_hour_aggregates
    (mtimestamp, apartmentid, roomid, sensorid, metric, mvalue)
    WITH (timescaledb.continuous) AS
    SELECT time_bucket('1 hour', mtimestamp, 'Europe/Helsinki'),
        apartmentid,
        roomid,
        sensorid,
        metric,
        MAX(mvalue)
    FROM public.periodic_measurements
    WHERE metric IN (1, 2, 5, 6, 7, 16, 19)
    GROUP BY time_bucket('1 hour', mtimestamp, 'Europe/Helsinki'),
        apartmentid,
        roomid,
        sensorid,
        metric
    WITH NO DATA;

However, when testing single queries in SQL CLI, the response time of a single query is acceptable (few seconds) and does not justify the long loading time of the full dashboard.

Kind regards,

Bernardo

Actually, when a continuous aggregate is created in Timescale, an index is automatically created for each GROUP BY column. The index is a composite index, combining the GROUP BY column with the time_bucket column.

Here is the proof:

testdb=> \d+ _timescaledb_internal._materialized_hypertable_205
                                  Table "_timescaledb_internal._materialized_hypertable_205"
   Column    |           Type           | Collation | Nullable | Default | Storage | Compression | Stats target | Description 
-------------+--------------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 mtimestamp  | timestamp with time zone |           | not null |         | plain   |             |              | 
 apartmentid | integer                  |           |          |         | plain   |             |              | 
 roomid      | integer                  |           |          |         | plain   |             |              | 
 sensorid    | integer                  |           |          |         | plain   |             |              | 
 metric      | smallint                 |           |          |         | plain   |             |              | 
 mvalue      | numeric                  |           |          |         | main    |             |              | 
Indexes:
    "_materialized_hypertable_205_apartmentid_mtimestamp_idx" btree (apartmentid, mtimestamp DESC)
    "_materialized_hypertable_205_metric_mtimestamp_idx" btree (metric, mtimestamp DESC)
    "_materialized_hypertable_205_mtimestamp_idx" btree (mtimestamp DESC)
    "_materialized_hypertable_205_roomid_mtimestamp_idx" btree (roomid, mtimestamp DESC)
    "_materialized_hypertable_205_sensorid_mtimestamp_idx" btree (sensorid, mtimestamp DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON _timescaledb_internal._materialized_hypertable_205 FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Access method: heap
1 Like

But a single query does not emulate a dashboard load.

Also this dynamic indices sounds very suspect

Building composite index on the fly smells like a performance bottle neck

Why do you have metric as part of the primary key? Curious
Is this materialized view crated on top of your olap database ?

True. I am going deeper in Grafana checking logs and traces.

The database is a Timescale database and contains time series: timestamp/key/value

The key is a composite key that includes metric. metric is the metric type (I have used narrow tables).

The database contains a raw data table and continuous aggregate views

The indexes are created when creating the empty tables and views. Then the raw data table is populated and then aggregation is performed.

Just to check whether Grafana is not able to re-order the columns which are part of the composite index, I have modified all the queries by moving the timestamp filter at the beginning of the WHERE clause like this:

SELECT mtimestamp, mvalue AS “internal temperature” FROM periodic_measurements WHERE ($__timeFilter(mtimestamp) AND apartmentid = $apartment AND roomid = $room_id AND metric = 1)

The result on the performance was the same.

1 Like

The problem was caused by certain queries used in the definition of Grafana variables.

Here is an example:

SELECT DISTINCT roomid FROM periodic_measurements WHERE apartmentid = $apartment AND roomid != 0 ORDER BY roomid;

Those queries were very heavy for the measurement database and one reason might be that the periodic_measurements table does not have dedicated index for just the variable apartmentid and no index for the variable roomid either.

I solved the issue by changing the data source of such problematic queries.

The load time has decreased to a couple of minutes or less. Also, the amount of transferred data has decreased in both the Timescale server and the Grafana EC2 machine as well as the CPU utilization in the Timescale server.

Thanks for the hints!

1 Like