Promtail stages docker and multiline

Hello :wave:

Thanks for any help and feedback in advance :slight_smile: .

Objective/Intro
I’m trying to achieve multiline logging on a container (docker) based installation (kubernetes cluster) using loki and promtail through helm charts. My solution is somewhat working, except that it does not handle multiline messages which are split by hitting max_lines. But I have to admit that my current setup is partially a workaround for other issues related to how docker json logging is handled in combination with multiline processing in promtail.

Implementation
I have started from the documentation by having (some) applications inserting the &ZeroWidthSpace character into the logs.
Example partial logs (java stack trace), as the basis of promtail processing.

{"log":"\u0026ZeroWidthSpace;2022-02-11 09:10:47.352 ERROR 1 --- [nio-8080-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is java.time.format.DateTimeParseException: Text 'intentionallybreakingtimestamp' could not be parsed at index 0] with root cause\n","stream":"stdout","time":"2022-02-11T09:10:47.353602146Z"}
{"log":"\n","stream":"stdout","time":"2022-02-11T09:10:47.353639043Z"}
{"log":"java.time.format.DateTimeParseException: Text 'intentionallybreakingtimestamp' could not be parsed at index 0\n","stream":"stdout","time":"2022-02-11T09:10:47.353645616Z"}
{"log":"\u0009at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2046) ~[na:na]\n","stream":"stdout","time":"2022-02-11T09:10:47.353652811Z"}
{"log":"\u0009at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1948) ~[na:na]\n","stream":"stdout","time":"2022-02-11T09:10:47.353659141Z"}
{"log":"\u0009at java.base/java.time.ZonedDateTime.parse(ZonedDateTime.java:598) ~[na:na]\n","stream":"stdout","time":"2022-02-11T09:10:47.35366469Z"}
{"log":"\u0009at java.base/java.time.ZonedDateTime.parse(ZonedDateTime.java:583) ~[na:na]\n","stream":"stdout","time":"2022-02-11T09:10:47.353670223Z"}

These are the stages I have come up with;

      - docker: {}
      - multiline:
        # Use this special stage to combine specifically tagged lines into one multiline message
          firstline: '^​'
      - regex:
        # This regex adds the 'multiline' content to the extracted map
          expression: '^(?s)(?P<multiline>&ZeroWidthSpace;)'
      - labels:
        # Now label messages that have a 'multiline' in their extracted map
          multiline:
      - match:
        # Only filter messages that are multiline labelled
          selector: '{multiline="&ZeroWidthSpace;"}'
          stages:
            - replace:
              # Clean up zerospacewidth
                expression: '^(?s)(?P<zerowidthspace>&ZeroWidthSpace;)'
                replace: ''
            - replace:
              # Remove empty lines
                expression: '(?m)(?P<emptyline>^\s*\n)'
                replace: ''

Considerations/Issues

While working on this, this was basically my thought process;

  1. It seems required to have the docker{} stage first, as it handles all the json conversion and makes the initial extracted map and labels (stream, timestamp). Flipping this around with multistage does not seem possible as the docker stage won’t handle the character conversion of a multiline message well.
  2. Because of the docker/json handling, unfortunately the &ZeroWidthSpace character is not invisible and therefore cannot be used by the multistage with regex ^\x{200B}
  3. Each original and output message from the docker{} stage already has a newline (\n).
  4. multiline stage inserts additional newlines, which seem to come from here; loki/multiline.go at main · grafana/loki · GitHub
  5. This results in multiline messages which start with a &ZeroWidthSpace character and have an empty line every other line.
  6. Given that the &ZeroWidthSpace character is not invisible and I have these empty lines, I then go on to try to regex the multiline messages, apply the label and clean those up. However, because also splitting can occur on the multiline stage, by the max_lines parameter (default 128), any overflow of this cannot be captured and ends up with empty lines in the middle. An additional side-effect is that original empty lines (see example line 2) are also removed.

Help/Ideas?
Given the above, can you help me achieve the objective of robust multiline logging without the various described issues :slight_smile: ?
One feature request that might resolve this, is if the multiline stage could already populate an extracted map or label, including on any split messages (by the max_lines parameter).

Any ideas, anyone? Basically the path I’ve taken thus far leads to 2 issues;

  1. Long messages (>192 lines) are split, with all messages except the first having empty lines, every other line.
  2. I lose original empty lines logged by applications, as I can’t distinguish between original ones and the empty lines produced by the multiline stage.

I welcome any feedback :slight_smile: !