Deal with non-standard nginx logs in loki/promtail


I’m not sure what the best scenario for dealing with non-standard (if I may call them so) logs in Nginx, such as those without an actual http request or a request which doesn’t contain the http method, request uri and or the http version. For example: - - [30/Dec/2023:07:00:06 +0000] "" 400 0 "-" "-" "-" - - [30/Dec/2023:06:26:20 +0000] "\x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\xD0\xCF\x9D/\xBE[\xEE\xC8\x9AG,\xCB\x00\x00\x8C\x05Qw\xE2VI,\xC9Y\x9A~\xB3F1\x8B>\xEA \x14ne\xD4\x9AZ\xEEp\xBC/8\xAA\x0Fw\x1C\xFC\xA3\xAE\x83\x96\xEFC\xD4\xEBT\x9By~\x12\x07\x5CF\x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0" 400 157 "-" "-" "-" - - [30/Dec/2023:06:26:20 +0000] "\x16\x03\x01\x00\xCA\x01\x00\x00\xC6\x03\x03\x18j\xA5/\xB3w\xAA\xDD@\xC1\xB4er\xEF\xEE\x09W\x9D\xB8\xE5\xEFS\xE9\x8C\xD6\xDB4\xED,\xDB\x91\x8E\x00\x00h\xCC\x14\xCC\x13\xC0/\xC0+\xC00\xC0,\xC0\x11\xC0\x07\xC0'\xC0#\xC0\x13\xC0\x09\xC0(\xC0$\xC0\x14\xC0" 400 157 "-" "-" "-"

As you can see, this is different from a traditional http request such as: - - [30/Dec/2023:10:40:06 +0000] "GET /login HTTP/1.1" 200 35153 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" "-"

where the request is split into three parts (method/uri/http version).

For this I’m using the following regex in promtail:

^(?P<host>[\w\.]+) - (?P<user>[^ ]*) \[(?P<ts>.*)\] "(?P<method>[^ ]*) (?P<request_url>[^ ]*) (?P<request_http_protocol>[^ ]*)" (?P<status>[\d]+) (?P<bytes_out>[\d]+) "(?P<http_referer>[^"]*)" "(?P<user_agent>[^"]*)"?

Which successfully matches the last line, but it’s got issues with the first lines.
How do you normally treat these cases? On the internet I’m only seeing solutions that simply ignores them, but I don’t think that’s really useful, especially when they become abusive and you want to act upon it.

I guess that depends on what you want to do with the information.

If you are creating a graph aggregating number of requests based on method / uri / HTTP version, then it makes sense to ignore errors when something doesn’t parse, because even if they do parse they would have empty fields anyway.

So if you are looking to determine how severe the empty log lines are, perhaps you can specifically match for log lines that don’t come with uri information (match for two double quotes), then match for log lines that do have uri information (match for double quote then .+ then double quote), and compare the count between the two, and create alerts if necessary.

The most important part would be (at least for now) the http codes. For instance, counting the lines which have a status of >= 400, which I know how to do.

I guess I can add a panel with this information without having to know what the request looks like.

On the other hand, I also want to see the logs themselves by status code.

Now that I think about it, I suppose you’re right and it probably doesn’t make a lot of sense to handle both cases (the three split strings vs a random one). On the other hand, what if I wanted to count requests/order logs by http method? I wouldn’t be able to get that if I just use ".+" for the request field.

At the moment this is what I’ve come up with that matches both cases (3 split words vs random request):

'^(?P<host>\d{1,3}(?:\.\d{1,3}){3}) - (?P<user>[^ ]*) \[(?P<ts>[^\]]+)\] "(?:(?P<method>[A-Z]+) (?P<request_url>[^\s]+) (?P<request_http_protocol>[^\s"]+)|(?P<request_random>.*?))" (?P<status>\d{3}) (?P<bytes_out>\d+) "(?P<http_referer>[^"]*)" "(?P<user_agent>[^"]*)"(\s+"(?P<http_x_forwarded_for>[^"]*)")?

Of course this is transitory, it works, and it helps me to understand promtail/loki better in the meantime.

Maybe using ".+" (I think "[^"]*" might be more efficient, but I’m not 100% sure) for the request field in promtail might be better after all and when I need something more specific (http method, request, http version), I add the logic in loki? Would that make sense?

By all means if you are looking for status code and URI then you’d want to match for three words. What I meant was that if you are also looking to see the difference between logs with URI and without URI information then you can compare the difference between matching for empty string and non-empty string.