How to discard logs by default in promtail?

We need to be able to only process the logs that matches regular expressions and the remaining logs should be dropped.
We tried with the following promtail config file:

> pipeline_stages:
>           - match:
>               selector: '{job="test1"}'
>               stages:
>               - regex:
>                   expression: 'some regular expression'
>               - timestamp:
>                   source: timestamp
>                   format: "2022-01-01 00:03:06.555"
>           - match:
>               selector: '{job="test1"}'
>               stages:
>               - regex:
>                   expression: 'some regular expression'
>               - timestamp:
>                   source: timestamp
>                   format: "20022-01-01 00:03:06.555"
>               - labels:
>                   alabel:
>       - drop:
>               expression: ".*.*"]

With the above all logs are dropped because of the drop statement. If we remove it, all logs go through including those that do not match the regular expressions.
Any suggestions?

You need to do a regex capture so that you have something to match against. Assuming you are trying to catch the phrase “this is a legit log” in your log, maybe something like this would work:

pipeline_stages:
  - regex:
      expression: ^.*(?P<rcapture>this is a legit log).*$
      source: <IF_APPLICABLE>
  - labels:
      rcapture:
  - match:
      selector: '{rcapture="this is a legit log"}'
      stages:
        - DO_STUFF
  # Drop logs that didn't match.
  - match:
      selector: '{rcapture!="this is a legit log"}'
      action: drop
      drop_counter_reason: non_essential_log

Is that the only way? Because for logs that we want to drop we won’t know what it will contain and whether it will match legit logs i.e. logs that we want to go through. So, regex might end up matching logs that we want.

please post sample logs and mark them keep or discard? look at this thread

Your original request was that “we need to be able to only process the logs
that matches regular expressions and the remaining logs should be dropped”,
which implies that you can create regexes which match the logs to be processed
(and the remainder dropped).

If that is not the case, how do you identify which logs are of interest and
which should be discarded?

In cases such as this I often find it useful to imagine I am asking a person to
do the job, and explaining to them what they need to pay attention to and what
they should ignore.

Once you can express that, it’s generally just a matter of asking someone who
knows more about regexes than you do (because nobody ever knows enough about
regexes to solve their current requirement) how to put this in terms that a
computer can work with.

Antony.

1 Like

Sample logs:
[INFO] 2022-12-01 19:30 http code 404 Keep
[DEBUG] 2022-12-01 19:30 <some response time e.g. 15s> Discard
[INFO] 2022-12-01 19:30 request time 5 seconds Keep
[INFO] 2022-12-01 19:30 unknown text from applications Discard

Yes we can use regex to get http code and request time. Everything else should be discarded.
Do you mean we need to write a regex for each one to match and then we negate it for the drop? That would typically be very long regex.

What about making the default to “drop” and then when explicitly defining action: we keep those logs.

So the discard requirement is that is has only date time and nothing else?

You edited your response. So, yoir discard and keep look awfully identical

sorry had <unknown text from applications> and was not shown after copy paste

Yes, all the logs start with level and timestamp and some random thing afterwards. We know we need to match request time and http code status everything else we don’t care.

Another option probably is to drop on source e.g. timestamp. And when we don’t declare timestamp we have some default timestamp which we look for in the drop section.

1 Like

are there cases where you want to keep DEBUG

but you have keep on the below log and it has no http code.

[INFO] 2022-12-01 19:30 request time 5 seconds Keep

please provide a clean and accurate requirement ?

In this case then I’d say you are overthinking it. Logs are not like metrics, you can have junk data in your logs, as long as you have a way to filter out the part you don’t want later. In general, cost not being a consideration, it’s much better to keep your logging pipeline clean and easy and parse those logs for what you want down the line, provided you have the way to do so.

In this case, if all your want is HTTP return code and status, you can simply log everything to Loki, and parse them like so (don’t know your log structure, so just making up pattern):

For http code:

{some_label="some_value"}
  | pattern `[<_>] <_> <_> http code <code>`
  | __error__=="" | unwrap code

For request time:

{some_label="some_value"}
  | pattern `[<_>] <_> <_> request time <time_second> seconds`
  | __error__=="" | unwrap time_second