The tl;dr; here is that I am trying to debug what data is in Loki to figure out why my grafana dashboards are not accurate. I have tried using a promtail dry-run command (promtail --dry-run -config.file=/etc/promtail/config.yml
) but I receive the error:
level=error ts=2023-12-25T17:34:08.393335527Z caller=main.go:170 msg="error creating promtail" error="error creating loki server: listen tcp :9080: bind: address already in use"
It seems that Promtail already running prevents me from testing this. But, I don’t know how to test it otherwise. I am running Promtail in a docker container and whenever I spin it up, Promtail starts listening, and I can’t find out how to stop it to do a dry run.
Okay, now for the long story…
I am using Grafana/Loki/Promtail to provide useful feedback on logs for a public website. The website has several logs, and I am trying to use the trio so that I can see the count of errors by message. So, seeing “error A happened 5 times” and “error B happened 3 times”, etc. The website is not highly trafficked; probably hundreds of visitors per day and at most a few dozen log entries.
The site is hosted on a server with Plesk with Grafana already installed. I modified the default docker-compose.yaml to only implement Loki and Promtail. Using the Grafana dashboard in Plesk, I’ve been able to add Loki as a data source, and create some dashboards. It seems as though I have the queries working properly; using a bar gauge I am listing the label for the error message, along with the count of that label. Query looks like this:
sum by(msg) (count_over_time({job="homf"} [$__range]))
The problem that I am having is that my dashboards are very inconsistent. Say I come to check on them and see something like “error A 5 times, error B 4 times, error C 1 time”, then come back a few hours later and I see nothing.
At first, I was using a Promtail configuration applying a custom timestamp (because I wanted to get real results the first time it ingested the logs). After doing a bit of reading (like this) it seemed like using custom timestamps was a bad idea. I need real results immediately, not in a few hours. So, I created a second set of docker containers using a Promtail configuration that would not parse the timestamp from the log entry, to test side-by-side.
Promtail config
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: homf
pipeline_stages:
- match: # Parse the parts of log files to get labels
selector: '{job="homf"}'
stages:
- regex:
expression: '(?P<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d) \[(?P<ip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})\]\[(?P<userId>.+?)\]\[.+?\]\[(?P<level>.+?)\]\[(?P<category>.+?)\] (?P<msg>[^\$].+)'
- labels:
time:
ip:
userId:
level:
category:
msg:
# - timestamp: These lines are not commented out on the instance using custom timestamps
# source: time
# format: "2006-01-02 15:04:05"
- match: # Drop anything that doesn't have a message
selector: '{job="homf", msg!~"(.+)"}'
action: drop
static_configs:
- targets:
- localhost
labels:
job: homf
__path__: /var/www/vhosts/example.org/httpdocs/*/runtime/logs/*.log
Not using a custom timestamp seemed to help at first, but eventually I ended up seeing the same thing where the results in a dashboard would be inconsistent over time. Not to mention, each of these dashboards (one with custom timestmp and one without) would show different results.
I was noticing another error would happen frequently in the dashboards:
status 500 message: too many outstanding requests
After more reading (like Datasource proxy returning "too many outstanding requests" · Issue #4613 · grafana/loki · GitHub) I decided to update my Loki configuration to try to get rid of those errors. I updated my Loki config as suggested.
Loki config
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
# Added by Brian
# https://community.grafana.com/t/too-many-outstanding-requests-on-loki-2-7-1/78249/11
# https://github.com/grafana/loki/issues/4613
limits_config:
split_queries_by_interval: 0
query_scheduler:
max_outstanding_requests_per_tenant: 2048
# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
# reporting_enabled: false
After doing that, I had my results as expected in the dashboards using the Loki/Promtail config that uses the current time. It was as if all entries from all logs came in instantly; my dashboards were working (although obviously the timing was all wrong). The Loki/Promtail config that was using the timestamp from the log entries themselves was showing nothing at all. This was not that surprising since my previous experience assigning the custom time wasn’t good either. But, for some reason, I could see all of the labels in the logs properly in the explorer, just no data coming through.
I’ve tested the connection to both Loki data sources in Grafana, and it says they are successful. So, I wanted to do a dry-run promtail test to see if the logs were coming in properly, and that’s where I once again hit the error message that opened this topic. Seems I’ve hit a dead end and need some assistance.
In addition, between the time I started writing this topic, and now, the Grafana dashboards are once again showing no entries. Not sure how ten minutes ago the one using the current time showed everything from all the logs properly, and now shows no data.
In case it helps, here’s the docker-compose.yaml I am using. They’re basically the same for both instances using the automatic time and my custom log time, but here’s the one for the automatic timestamp:
docker-compose.yaml
version: "3"
networks:
loki-auto-time:
services:
loki:
image: grafana/loki:2.9.0
ports:
- "3101:3100"
volumes:
- /var/lib/docker/volumes/loki-auto-time/etc/loki/local-config.yaml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml
networks:
- loki-auto-time
promtail:
image: grafana/promtail:2.9.0
volumes:
- /var/log:/var/log
- /var/www/vhosts:/var/www/vhosts
- /var/lib/docker/volumes/promtail-auto-time/etc/promtail/config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
networks:
- loki-auto-time
The only other thing that I am seeing that is an issue, although it seems unrelated, is that in my dev tools browser console I see the error message:
WebSocket connection to 'wss://inspiring-feynman.174-138-46-244.plesk.page:8443/modules/grafana/service/api/live/ws' failed:
ANY insights would be greatly appreciated. I’ve wrestled with this for over a week now and every time I feel like I get somewhere, something inconsistent that I can’t explain happens.