Improve my pipeline_stages

Hello,

I’m discovering Grafana, Loki and Promtail to process my Apache and Nginx logs.
I have created this pipeline_stages which works well to define the level label depending on the value of http_code:

  pipeline_stages:
  - match:
      selector: '{job="apache"}'
      stages:
      - regex:
          expression: '^\S+ \S+ \S+ \S+ \S+ \S+ \S+ \[.+\] "\S+ \S+ \S+" (?P<http_code>\d{3}) \S+ "[^"]*" "[^"]*" \S+ \S+ In:\S+ Out:.+:.+pct. \S+$'
      - labels:
          http_code:

  - match:
      selector: '{job="apache", http_code=~"(2|3)\\d{2}"}'
      stages:
      - static_labels:
          level: 'info'
  - match:
      selector: '{job="apache", http_code=~"4\\d{2}"}'
      stages:
      - static_labels:
          level: 'warn'
  - match:
      selector: '{job="apache", http_code=~"5\\d{2}"}'
      stages:
      - static_labels:
          level: 'crit'

At present, I have to create the http_code label on the first part to be able to match the static_labels afterwards.

Is it possible to optimize my pipeline_stages so that :

  • I specify my regex only once, as at present
  • I don’t export the http_code label
  • My selectors for defining the value of my static_labels can retrieve the value of http_code directly from the regex

My aim is to try and lighten processing as much as possible by avoiding unnecessary label exports for loki, and to group processing.

Thank your for your help! :grin:

New file version :

  pipeline_stages:
  - match:
      selector: '{job="apache"}'
      stages:
      - regex:
          expression: '^\S+ \S+ \S+ \S+ \S+ \S+ \S+ \[(?P<time>.+)\] "\S+ \S+ \S+" (?P<http_code>\d{3}) \S+ "[^"]*" "[^"]*" \S+ \S+ In:\S+ Out:.+:.+pct. \S+$'
      - labels:
          http_code:
      - match:
          selector: '{http_code=~"(2|3)\\d{2}"}'
          stages:
          - static_labels:
              level: 'info'
      - match:
          selector: '{http_code=~"4\\d{2}"}'
          stages:
          - static_labels:
              level: 'warn'
      - match:
          selector: '{http_code=~"5\\d{2}"}'
          stages:
          - static_labels:
              level: 'crit'
      - labeldrop:
          - http_code
      - timestamp:
          format: '2006-01-02T15:04:05-0700'
          source: time

I added the labeldrop, but I’m not sure that adding it and then removing is the best solution…

The only thing you need to parse is probably the timestamp. The rest I’d say you can just send to Loki as is, then use the pattern filter to parse the logs in real time.

For example, let’s say your Nginx logs look something like this:

127.0.0.1 - - [05/Jun/2024:20:59:50 +0000] "GET /api/something HTTP/1.1" 200

You could do:

{SELECTOR} | pattern `<_> <_> <_> [<_>] "<method> <path> <http_version>" <http_status>`

Lastly, a small nitpick. In my opinion you should not be setting level label based on your nginx http status. The level is supposed to denote whether the logs themselves are info or warn, not the content of the logs. But this is of course is just my opinion.