"Intro to Mimir" webinar: Q&A

After hosting a session on “Intro to Grafana Mimir”, I would like to share the questions I wasn’t able to answer during the webinar. You can watch the full webinar here.

Does mimir need to use the AWS S3 “select” feature?

No, it doesn’t. Mimir just needs standard API calls (upload, head, get, get range, delete) which are typically available for any object storage, including S3-compatible storages.

Why is the time splitting on a per day basis? Was it something you decided empirically or there was some other reason?

Based on our experience, splitting queries by day is a good trade-off between parallelization and total number of partial (split) queries to execute. A very high number of partial queries may introduce too much coordination overhead, while a very low number of partial queries wouldn’t benefit much from parallelization.

Please keep in mind that queries on a single day are still parallelized by query sharding.

Are queries across all tenants supported?

Yes. It’s supported through a feature called tenant query federation.

What will happen if we don’t have the query-frontend and directly hit querier?

In Mimir, the query-frontend is a required component.

Can Mimir replace Cortex?

Yes, Mimir can replace Cortex. We’re aware of many community users who migrated from Cortex to Mimir and, at Grafana Labs, we migrated from Cortex to Mimir too back in time.

You can migrate a live cluster from Cortex to Mimir with no downtime. We published a migration guide, as well as a tool to convert your Cortex configuration to Mimir configuration.

The migration guide was tested with Cortex 1.10 and 1.11. More recent versions of Cortex have not been tested, so we recommend you to test the migration in a dev cluster and eventually reach out to the Mimir team if you need any help.

How is Mimir better than Cortex?

Mimir and Cortex development diverged over the time, and we haven’t followed Cortex development anymore, so I can’t comment about Cortex improvements after the version 1.11.

Speaking about Mimir, we did many improvements to scalability and performance (for example, see How we scaled Grafana Mimir to 1 billion active series), and we’ve reached a scale and performance we weren’t able to achieve with Cortex.

We’ve also introduced many features like out-of-order ingestion, native histograms, or cardinality analysis, in addition to the continuous performance improvements we make release after release.

Did you have a chance to fix cache setup issue for Mimir Helm chart?

We can definitely. Can you reach out with a link to the issue you’re experiencing, please?

If I get the error “too many outstanding requests”, does that mean I’m doing exceptional work?

The queriers may be underprovisioned and you should scale them out (add more replicas).

The “too many outstanding requests” is an error returned by the query-frontend (or query-scheduler) when the queries queue is full, which means Mimir is receiving more queries per second than the available capacity in the queriers (reason why it’s typically a signal that you should scale out the queriers).

You can also fine-tune the per-tenant queue size through the setting -querier.max-outstanding-requests-per-tenant.

Does Mimir aggregates the metrics over time?
Is down sampling supported? if not, are there plans to add it?

Mimir currently doesn’t support downsampling.

Generally speaking, there’s a common misconception that downsampling would reduce object storage utilization and costs. First of all, the object storage cost is a very small part of Mimir TCO (Total Cost of Ownership): in our experience, object storage “data storage” cost is about 5% of Mimir TCO. Second, downsampling doesn’t necessarily reduce the object storage utilization (at least may not do it significantly), because with downsampling we would need to keep multiple different copies of the samples of each series, in order to keep returning correct query results for any PromQL function.

Another argument towards downsampling is that it will improve query performance for large time range queries. That’s right. The less samples to process, the less CPU cycles required to execute a query. In Mimir we adopt many other techniques to deliver fast query results, reason why downsampling looks a bit lower priority for us, but still worth considering.

That being said, even if downsampling is not something we’re currently actively working on, it’s definitely on our list of things we discuss from time to time, and its priority may change any time. I invite you to join the discussion here.

I must have missed the reference to which the increases in query performance were compared?

I’m not sure about which specific comparison you’re referring to, but you find the webinar recording here and you can check it out yourself.

Thanks again to all people you joined the webinar!


P.S. There were are couple of questions about the Grafana Agent which I’ve skipped from this post because they’re outside my area of expertise. I suggest the authors of these questions to post them here.

If I am installing 3 MImir Instances in EC2 and prometheus in another EC2 instance how to make connection b/w this also how to scrape data of other EC2 servers?