Hi folks,
We’ve been running into CPU spikes on our Promtail agents because we have a huge list of pipeline_stages with complex regexes to scrub PII and secrets (like Stripe tokens and emails) before sending logs to Loki.
It’s getting hard to manage, so I’ve been experimenting with offloading the sanitization step entirely. I wrote a small Go sidecar that sits in the K8s pod and intercepts the stdout stream before Promtail even sees it.
Instead of just regex, it calculates Shannon Entropy on the fly to detect random API keys, and replaces identifiers with deterministic HMAC hashes (e.g., [HIDDEN:e9f1a2]). This way, we can still trace requests in Grafana without storing the actual sensitive data.
I open-sourced the experiment here: https://github.com/aragossa/pii-shield
Has anyone else moved away from heavy Promtail pipeline_stages towards edge-sanitization? Are there any pitfalls with Loki indexing when doing deterministic hashing at the pod level?