Named Capture Group with Line format not working

Hey all! First time looking for help here. I’ve been working on an educational project at my job with intent to turn it to production.

I am ingesting logs via promtail Windows Events component - running on my Windows Domain Controller. Here is a piece of the log that has the pertinent information for what I am testing.

<Data Name='SubjectUserSid'>S-1-5-18</Data><Data Name='SubjectUserName'>username$</Data><Data Name='SubjectDomainName'>subdomain_domain</Data><Data Name='SubjectLogonId'>0x3e7</Data><Data Name='NewProcessId'>0x1cac</Data><Data Name='NewProcessName'>C:\Windows\System32\wbem\WmiPrvSE.exe</Data><Data Name='TokenElevationType'>%%1936</Data><Data Name='ProcessId'>0x37c</Data><Data Name='CommandLine'></Data><Data Name='TargetUserSid'>S-1-0-0</Data><Data Name='TargetUserName'>-</Data><Data Name='TargetDomainName'>-</Data><Data Name='TargetLogonId'>0x0</Data><Data Name='ParentProcessName'>C:\Windows\System32\svchost.exe</Data><Data Name='MandatoryLabel'>S-1-16-16384</Data>

I am trying to run the following query.

{channel="Security", host="domain_controller"} | json | event_id = 4688 | regexp "<Data Name='SubjectUserName'>(?P<sun>.*?)</Data>" | line_format "Username = {{.sun}}"

The result of this query is my log lines just end up "Username = ".

I validated the re2 expression on https://regex101.com/ using this pattern:
“(?P.*?)”

I also tried “Data Name=‘SubjectUserName’\u003e(?P.*?)\u003c/Data” without the json parser which should capture the same value.

Please help me figure out what exactly I’m doing wrong.

I don’t see anything wrong, and I tested your query and it seems to be working as well (see LogQL Analyzer | Grafana Loki documentation)

Perhaps you can try using the same link as well and test your entire log line and see if it returns expected results. If you still have issue perhaps a screenshot of your Grafana (assuming you are using that as frontend) would be helpful.

1 Like

Thank you for testing. The original log line is what is used against the regex and not the one that is cleaned up after the json parser. So I think I need a regex that will capture whatever is in between the < and > unicode character escape sequences.

\u003cData Name=‘SubjectUserName’\u003e-\u003c/Data\u003e\

Which in this case is a dash ( - ).

If that’s your original logline, then you should for sure remove json from your query because it’s not json formatted.

1 Like

That was it! I just needed to remove json from my query I’m hitting all the buttons on the controller until I get something to work.

Here is my working query. It queries for process creation events and formats the log line to show the user that the process was started under.

channel=“Security”, host=“DC1”} |= event_id":4688, | regexp Data Name='SubjectUserName'\\u003e(?P<sun>.*?)\\u003c | line_format `Username = {{.sun}}

I can see this query getting long with regex if I want to capture multiple variables.

Yeah. Couple of things you can try:

  1. You can use multiple regex capture:
.+\'SubjectUserSid\'\>(?P<SID>.[^\<]+)\<.+\'SubjectUserName\'\>(?P<UserName>.[^\<]+)\<.+ <and so on...>

But of course that depends on the order of the data blocks not changing.

  1. You can try doing some pre-processing in your log pipeline. Log agents such as fluentd and fluentbit can transform XML to JSON, may be worth a try.