Need help with best practices for debugging promtail - loki and grafana too

The tl;dr; here is that I am trying to debug what data is in Loki to figure out why my grafana dashboards are not accurate. I have tried using a promtail dry-run command (promtail --dry-run -config.file=/etc/promtail/config.yml) but I receive the error:

level=error ts=2023-12-25T17:34:08.393335527Z caller=main.go:170 msg="error creating promtail" error="error creating loki server: listen tcp :9080: bind: address already in use"

It seems that Promtail already running prevents me from testing this. But, I don’t know how to test it otherwise. I am running Promtail in a docker container and whenever I spin it up, Promtail starts listening, and I can’t find out how to stop it to do a dry run.

Okay, now for the long story…

I am using Grafana/Loki/Promtail to provide useful feedback on logs for a public website. The website has several logs, and I am trying to use the trio so that I can see the count of errors by message. So, seeing “error A happened 5 times” and “error B happened 3 times”, etc. The website is not highly trafficked; probably hundreds of visitors per day and at most a few dozen log entries.

The site is hosted on a server with Plesk with Grafana already installed. I modified the default docker-compose.yaml to only implement Loki and Promtail. Using the Grafana dashboard in Plesk, I’ve been able to add Loki as a data source, and create some dashboards. It seems as though I have the queries working properly; using a bar gauge I am listing the label for the error message, along with the count of that label. Query looks like this:

sum by(msg) (count_over_time({job="homf"} [$__range]))

The problem that I am having is that my dashboards are very inconsistent. Say I come to check on them and see something like “error A 5 times, error B 4 times, error C 1 time”, then come back a few hours later and I see nothing.

At first, I was using a Promtail configuration applying a custom timestamp (because I wanted to get real results the first time it ingested the logs). After doing a bit of reading (like this) it seemed like using custom timestamps was a bad idea. I need real results immediately, not in a few hours. So, I created a second set of docker containers using a Promtail configuration that would not parse the timestamp from the log entry, to test side-by-side.

Promtail config
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: homf
  pipeline_stages:
  - match: # Parse the parts of log files to get labels
      selector: '{job="homf"}'
      stages:
      - regex:
          expression: '(?P<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d) \[(?P<ip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})\]\[(?P<userId>.+?)\]\[.+?\]\[(?P<level>.+?)\]\[(?P<category>.+?)\] (?P<msg>[^\$].+)'
      - labels:
          time:
          ip:
          userId:
          level:
          category:
          msg:
#      - timestamp: These lines are not commented out on the instance using custom timestamps
#          source: time
#          format: "2006-01-02 15:04:05"
  - match: # Drop anything that doesn't have a message
      selector: '{job="homf", msg!~"(.+)"}'
      action: drop
  static_configs:
  - targets:
      - localhost
    labels:
      job: homf
      __path__: /var/www/vhosts/example.org/httpdocs/*/runtime/logs/*.log

Not using a custom timestamp seemed to help at first, but eventually I ended up seeing the same thing where the results in a dashboard would be inconsistent over time. Not to mention, each of these dashboards (one with custom timestmp and one without) would show different results.

I was noticing another error would happen frequently in the dashboards:

status 500 message: too many outstanding requests

After more reading (like Datasource proxy returning "too many outstanding requests" · Issue #4613 · grafana/loki · GitHub) I decided to update my Loki configuration to try to get rid of those errors. I updated my Loki config as suggested.

Loki config
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

# Added by Brian
# https://community.grafana.com/t/too-many-outstanding-requests-on-loki-2-7-1/78249/11
# https://github.com/grafana/loki/issues/4613
limits_config:
  split_queries_by_interval: 0

query_scheduler:
  max_outstanding_requests_per_tenant: 2048

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

After doing that, I had my results as expected in the dashboards using the Loki/Promtail config that uses the current time. It was as if all entries from all logs came in instantly; my dashboards were working (although obviously the timing was all wrong). The Loki/Promtail config that was using the timestamp from the log entries themselves was showing nothing at all. This was not that surprising since my previous experience assigning the custom time wasn’t good either. But, for some reason, I could see all of the labels in the logs properly in the explorer, just no data coming through.

I’ve tested the connection to both Loki data sources in Grafana, and it says they are successful. So, I wanted to do a dry-run promtail test to see if the logs were coming in properly, and that’s where I once again hit the error message that opened this topic. Seems I’ve hit a dead end and need some assistance.

In addition, between the time I started writing this topic, and now, the Grafana dashboards are once again showing no entries. Not sure how ten minutes ago the one using the current time showed everything from all the logs properly, and now shows no data.

In case it helps, here’s the docker-compose.yaml I am using. They’re basically the same for both instances using the automatic time and my custom log time, but here’s the one for the automatic timestamp:

docker-compose.yaml
version: "3"

networks:
  loki-auto-time:

services:
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3101:3100"
    volumes:
      - /var/lib/docker/volumes/loki-auto-time/etc/loki/local-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki-auto-time

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log:/var/log
      - /var/www/vhosts:/var/www/vhosts
      - /var/lib/docker/volumes/promtail-auto-time/etc/promtail/config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml
    networks:
      - loki-auto-time

The only other thing that I am seeing that is an issue, although it seems unrelated, is that in my dev tools browser console I see the error message:

WebSocket connection to 'wss://inspiring-feynman.174-138-46-244.plesk.page:8443/modules/grafana/service/api/live/ws' failed:

ANY insights would be greatly appreciated. I’ve wrestled with this for over a week now and every time I feel like I get somewhere, something inconsistent that I can’t explain happens.

I would say you want to solve too many issues in one post. I would say the best practise is to focus on one issue at the time.

I select last issue - websocket issue: make sure that middle-layer (in your case Plesk) has enabled support/configuration for secure websockets. For example users with Nginx have to define it as:

# Proxy Grafana Live WebSocket connections.
  location /api/live/ {
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header Host $http_host;
    proxy_pass http://grafana;
  }

So check your Plesk documentation how to configure that. It doesn’t look like Grafana issue, so nobody with Grafana knowledge will be able to help you.

Thanks for your reply @jangaraj. I will look into fixing this websockets issue for the time being. Once I figure out more there, I will respond with the results…

I assumed that my long post would be difficult to digest. To clarify, the main focus I need help with is why the panels in Grafana are not consistent. Everything else is explaining what I’ve done to get to the point where I’m at now.

When I initialize Loki and Promtail in the docker containers, without using a custom timestamp, I see all entries. After several minutes, I try to refresh the panels, and there are no longer any log entries. It would seem this is either an issue with Promtail, Loki, or Grafana. Update: most likely not Promtail because if it’s showing the entries properly labeled when I first create the docker containers, Promtail must be doing its job properly.

I am working with under 1000 log entries, so my first idea was to just query Loki for ‘everything’ it has. I figured using the promtail --dry-run command would help me see what’s going into Loki, but it does not work and I’m not sure how to get it to work. I am not sure how else to figure out what’s in Loki other than querying using curl via the command line, which has produced inconsistent results as well.

Okay, @jangaraj - taking it one starting with the websockets issue. For starters, through plesk, Grafana can be reached at urls like the following:

https://my-plesk-domain:8443/modules/grafana/service/d/e14dd4bf-9e7d-48a2-8c71-4be51fe7110c/dashboard-slug?orgId=1

I found the plesk nginx configuration for grafana. There’s multiple nginx config files in plesk, I think the one that is in use for grafana is /etc/sw-cp-server/conf.d/grafana_plesk.inc. The configuration in that file is a little bit different than the ‘default’ for grafana, so I had to make some tweaks. The file right now looks like this:

location /modules/grafana/service/ {
    proxy_set_header Host $http_host;
    proxy_pass http://127.0.0.1:3030/;

    set $r_scheme $scheme;
    if ($http_x_forwarded_proto = 'https') {
        set $r_scheme https;
    }
    rewrite ^/modules/grafana/service/login$ $r_scheme://$http_host/modules/grafana/service/login/generic_oauth redirect;
}

location ~ ^/modules/grafana/(d|p)/ {
    try_files $uri /modules/grafana/public/index.php?$query_string;
}
location /modules/grafana/relogin {
    try_files $uri /modules/grafana/public/index.php;
}

# BELOW HERE ARE MY ONLY CHANGES TO THE FILE
# Proxy Grafana Live WebSocket connections.
location /modules/grafana/service/api/live {
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header Host $http_host;
    proxy_pass http://127.0.0.1:3030/;
}

Note that the location is different, based on the URL, and the proxy_pass is different and matches the proxy_pass in the first block.

After doing that, there was an error (nginx: [emerg] unknown "connection_upgrade" variable) and the server would not load. I see in the default grafana config it does include a block for that, but I could not put it in that file since the http block was not allowed there. I found the file responsible for the main configuration with the http block (/etc/sw-cp-server/config). I updated it to include the required block (just a few lines, toward the end):

worker_processes  1;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"'
                      "'$fastcgi_script_name' '$fastcgi_path_info' '$document_root'";


    sendfile        on;
    reset_timedout_connection on;

    #keepalive_timeout  0;
    keepalive_timeout  65;
    #tcp_nodelay        on;

    #gzip  on;
    #gzip_disable "MSIE [1-6]\.(?!.*SV1)";

    server_tokens off;

    fastcgi_max_temp_file_size 0;
    fastcgi_buffers 16 16k;
    fastcgi_buffer_size 32k;
    fastcgi_pass_header X-Accel-Buffering;

    fastcgi_read_timeout 600;
    fastcgi_send_timeout 600;

    client_max_body_size 2048m;

    error_page 497 https://$hostname:$server_port$request_uri;

    # Added to try to get websockets to work w/grafana extension
    map $http_upgrade $connection_upgrade {
        default upgrade;
        ''      close;
    }

    include /etc/sw-cp-server/conf.d/*.conf;
}

After doing that, the error went away, and I could load plesk again as desired. However, the error about the websocket connection failing has not gone away. In my browser dev tools console, I still see:

transport_websocket.js:32 WebSocket connection to 'wss://my-plesk-domain:8443/modules/grafana/service/api/live/ws' failed:

I have tried to look into possible reasons this may not be working in plesk forums and grafana ones and no luck so far.

Just to add some info; I created new docker container with just grafana in it. I created a subdomain (logs.mydomain.com) in plesk, and I proxied it to connect to the grafana docker instance (instead of the default one in plesk). I added the loki data source, and imported the same dashboards as I was using in Plesk’s grafana extension. I implemented all of the same .conf updates, with the necessary changes for it being an actual domain in plesk rather than using plesk’s installed grafana extension. Still seeing an error message in the console transport_websocket.js:32 WebSocket connection to 'ws://logs.mydomain.com/api/live/ws' failed:

I guess next step might be to try to do a completely “basic” install using the default docker-compose.yaml ( https://raw.githubusercontent.com/grafana/loki/v2.9.1/production/docker-compose.yaml ) and see if that results in something different.