What to do with externally provided labels without compromising label cardinality?


(I posted an earlier version of this question to the Loki mailing list about a week ago. No one replied so I am trying here with a v2 of the question. Please let me know if anything is unclear.)

I’ve been playing around with the Loki logging driver for Docker. One thing I’ve noticed is that the driver by default adds the labels container_name and filename to each log entry. Both of these labels have a high cardinality and are thus not suitable to be used for indexing.

Thus my line of thought is that they need to be dropped in the pipeline stages before sending the log entry to Loki. However, the container_name label is useful to have so I would like to preserve it. There doesn’t seem to be support for annotation/metadata/non-indexable fields so the only way I can see that happening today is by rewriting the log message.

If I go with rewriting the message, I can drop the labels and add container_name to the message with the following tidbit of logging driver configuration:

loki-pipeline-stages: |
    - template:
        source: output
        template: '{{ .Entry }} container_name={{ .container_name }}'
    - labeldrop:
        - container_name
        - filename
    - output:
        source: output

If .Entry is a typical Golang log line such as the line below, then it can be parsed using the logfmt LogQL parser.

level=info ts=2021-01-05T08:23:07.811Z caller=main.go:429 msg=Listening address=:9093

However, that doesn’t work if .Entry is in JSON format since it would then ruin the chance of using the json LogQL parser for label extraction. I’m thinking that in this case the container_name label should instead be added as an additional field to the JSON log line. I’m not sure how to do that with a Golang template and if it really should be done like that.

Am I on the wrong path in my line of thinking?

Can this issue of preserving additional labels without paying the penalty of indexing them be achieved today in a good, generic way?

Would it make sense for Loki to have a concept of non-indexable fields, i.e. fields that are not available to the log stream selector but are available to the rest of the log pipeline?

Am I on the wrong path in my line of thinking?

Not at all, your line of thinking here is pretty spot on. However as you’ve discovered there isn’t a great solution to this problem.

There have been a few asks for solving this exact problem and there isn’t any sort of official solution for it. As you mentioned if the log isn’t JSON you can just add details to it, however that isn’t gonna work so well for JSON.

Would it make sense for Loki to have a concept of non-indexable fields

I’m not quite sure what the right solution to this looks like. I think I would lean towards modifying the log line to add additional data, this is the simplest and likely to be the least confusing (vs introducing new concepts such as non-indexed labels/fields). However we would need a way to handle this properly/explicitly in the client (which we don’t have yet).

I added this idea to the 2021 list of feature requests: What do you want from Loki in 2021? · Issue #3119 · grafana/loki · GitHub

Thanks for the well thought out discussion!

1 Like

Hi Folks!

I am actually having the same requirement. One of the applications we are sending logs to Loki is a .net Applicaiton which provides a lot of fields that can be helpfull on certain contexts. Some examples: CallerAssembly, CallerClassName, CallerMemberName, CallerFileName, CallerFileLineNumber, Application, ApplicationVersion, CorrelationId, ProcessId, ThreadId, MachineName, certificatePath, EventId, SourceContext, CandidateCount, Path, Endpoint, RoutePattern, EndpointName, ModelBinderProviders, RouteData, MethodInfo, Controller, AssemblyName, ActionId, ActionName, FilterType, Filters, Method, Filter, ValidationState, ActionResult

I end up kind of concatenating some of those with | as separator like:

Service[Information] => This is is the log entry | Field1=[AA] | Field2=[bb] | FieldC=[1234]

But then,as in some case I have a lot of fields, the log entry ends up just being too “messy”, which lead to some developers asking us to just “drop” the rest and log just keep the message for the sake of not being so messy.

But the trouble is that as long as I drop it from the message I lose the query possibilities… So what I was wondering would be if I have some way to tricky this situation by showing on grafana a simplified version of the log entry, while additionally allowing loki to query based on the full entry… Something like:

Log Entry:
Service[Information] => This is is the log entry | Field1=[AA] | Field2=[bb] | FieldC=[1234]

Grafana shows:
Service[Information] => This is is the log entry
(Would love to see those as parsed fields od grafana):

But I would be still able to query something like same way I am able to do it today with the full Log entry.

By reading this thread I think this is somehow similar to what @jonas4 proposed. It would be really great to have this “non-indexed” information which would be still queriable as it is part of the log entry field. I actually just upvoted the entry mentioned by @ewelch on the "What do you want from Loki in 2021? " thread…

Or am I missing something?


Another Posibility would be to have 2 fields on a Loki Log Entry:

  • Log Entry = The one which is visible today, could be the simplified version with the Service[Information] => This is is the log entry from my sample

  • Additional Log Entry = This one would be the additional fields that should be “queriable”(over RegEx not labels), the Field1=[AA] | Field2=[bb] | FieldC=[1234] from my sample

What you guys @ewelch , @jonas4 think about that?