I think this is an interesting problem, and I can’t think of an elegant solution, because you not only want to deduplicate based on exact messages, you also want to deduplicate based on time. Presumably with the following logs:
user1 click on x
user1 click on x
user1 click on x
user2 click on y
<10 minutes passes>
user1 click on x
You’d want the end result to be:
user1 click on x
user2 click on y
<10 minutes passes>
user1 click on x
And I think that’s hard to do with existing tool. Two solutions I can think of:
- Use a recording rules and turn your logs into metrics. You do want to be very clear on that the metrics are “cleansed”, and that differences are expected. You also want to set the expectation that the deduplication isn’t necessarily sequential, but with a small enough interval it may be inconsequential. Consider the following logs with labels enclosed in {} and with interval of 10 seconds:
{user="user1",object="x"} user1 click on x
{user="user1",object="x"} user1 click on x
{user="user2",object="y"} user2 click on y
{user="user1",object="x"} user1 click on x
You can aggregate the metrics like so:
count by (user, object) (
sum by (user, object, logline)
(count_over_time({<SELECTOR>} | label_format logline=`{{ __line__ }}` [10s])
)
)
And this should hopefully produce metrics as below (note that the deduplication caused by the aggregation is not sequential):
{user="user1",object="x"} 1
{user="user2",object="y"} 1
You can then write this metrics to a prometheus instance if you already have one using a recoding rule, and produce metrics like so:
user_clickstream_cleansed({user="user1",object="x"}) 1
user_clickstream_cleansed({user="user2",object="y"}) 1
And configure the recording rule to evaluate every 10 seconds. Caveat is of course the result is reduced to just metrics, but if that’s all you need maybe it’s sufficient.
- Take a programmatic approach. Write a simple API, query Loki on a short interval such as 5 minutes, download all incoming logs for the past 5 minutes, sort and remove duplicated messages, and write back to Loki. Note that when writing back you must write to a different log stream with different labels.