How to use "Exact" de-duplication on query result?

First of all, I do not have too much experience with neither Grafana nor Loki, so please be patient with me.

I am using Grafana to count GET requests to our server. On the Explore panel, there is a feature to apply deduplication for identical requests (excluding time):

As you can see in this case, there are 86 duplicates, which would previously distort the results.

This is exactly what I need to apply to a query on my dashboard, but I can’t find any documentation or solution as to how and would really appreciate your help.

I stumbled on that thread already and it didn’t really help me.

My current query ist just

{container_name="<name>"} 

followed by a lot of exclusions (e.g. !="HeadlessChrome").

The counting is done separately in another panel, I just want a list of the logs for now.

And on the result of that query (or with the query itself) I want to have a similar de-duplication (or even the exact same thing) as with the “Exact” de-dup option on the Explore panel.

I guess it would be good to try and understand what you are trying to do precisely. I am guessing you are trying to get a count of identical log lines, or trying to get a sense of how many different log lines you have? If this is not what you want please correct me.

Consider the following logs:

i am logging
i am logging
this is different
i am logging

Are you looking for some sort of aggregation like:

{logline="i am logging"} 3
{logline="this is different"} 1

If so you can do something like this:

sum by (logline)
  (count_over_time({container_name="<name>"} | label_format logline=`{{ __line__ }}` [$__interval])
)

Needlessly to say on a heavy cluster with a lot of different logs this will be quite expensive.

1 Like

Sorry for coming back to this so late…

I am actually just trying to get rid of identical/duplicate log lines. This is about tracking clicks on referral links. If a user requests from the same IP, with the same device and on the same endpoint twice in a row, that click is not really relevant. Also, if some automated service requests multiple times in a row, that is not really valuable information for us.

Again, I basically just need to replicate the already existing feature from the Explore panel on my Dashboard.

We’re talking rougly 30000 clicks/lines per month. Queries with a time range up to a year should run within a reasonable amount of time (i.e. a few minutes – which they already take). Do you think it’s sensible to keep pursuing this? :sweat_smile:

Thanks for your help!

I think this is an interesting problem, and I can’t think of an elegant solution, because you not only want to deduplicate based on exact messages, you also want to deduplicate based on time. Presumably with the following logs:

user1 click on x
user1 click on x
user1 click on x
user2 click on y
<10 minutes passes>
user1 click on x

You’d want the end result to be:

user1 click on x
user2 click on y
<10 minutes passes>
user1 click on x

And I think that’s hard to do with existing tool. Two solutions I can think of:

  1. Use a recording rules and turn your logs into metrics. You do want to be very clear on that the metrics are “cleansed”, and that differences are expected. You also want to set the expectation that the deduplication isn’t necessarily sequential, but with a small enough interval it may be inconsequential. Consider the following logs with labels enclosed in {} and with interval of 10 seconds:
{user="user1",object="x"} user1 click on x
{user="user1",object="x"} user1 click on x
{user="user2",object="y"} user2 click on y
{user="user1",object="x"} user1 click on x

You can aggregate the metrics like so:

count by (user, object) (
  sum by (user, object, logline)
    (count_over_time({<SELECTOR>} | label_format logline=`{{ __line__ }}` [10s])
  )
)

And this should hopefully produce metrics as below (note that the deduplication caused by the aggregation is not sequential):

{user="user1",object="x"} 1
{user="user2",object="y"} 1

You can then write this metrics to a prometheus instance if you already have one using a recoding rule, and produce metrics like so:

user_clickstream_cleansed({user="user1",object="x"}) 1
user_clickstream_cleansed({user="user2",object="y"}) 1

And configure the recording rule to evaluate every 10 seconds. Caveat is of course the result is reduced to just metrics, but if that’s all you need maybe it’s sufficient.

  1. Take a programmatic approach. Write a simple API, query Loki on a short interval such as 5 minutes, download all incoming logs for the past 5 minutes, sort and remove duplicated messages, and write back to Loki. Note that when writing back you must write to a different log stream with different labels.

Facing exact same issue, used dedup and distinct in the logql query but it doesn’t help, is there any other solution using query to get the desired result?