Why would Grafana / Loki give inconsistent results?

I have set up a basic install of Grafana/Promtail/Loki to try to have a convenient way to see the number of time an error is showing up in logs. My issue is that one minute it will give me different results from another. I took these screenshots to help explain.

This first screenshot shows me looking at the results around 2 pm (time at top).

I left my browser open to the same panels until about 6 pm and took another screenshot, just for documentation (time at top):

I then change the timeframe on the panel to 24 hours. This is to make sure I get all of the same results, plus anything else that happened in the past day, and anything that happened since the first time I checked. However, the results are almost completely different:

As you can see, the two 404 errors for login.php are no longer there, nor is the one for simple.php. This doesn’t make sense; those should still be there because based on the first time I visited the page, they must’ve happened just earlier this morning.

I’ve been having these issues consistently using many different setups. Sometimes it will show different results, sometimes it will show no results. In my most recent attempt to set these up, I did a very basic install using Loki’s suggested docker-compose.yaml with a couple of changes for my needs.

docker-compose.yaml
version: "3"

networks:
  loki:

services:
  loki:
    container_name: loki-logs
    image: grafana/loki:2.9.0
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - /var/lib/docker/volumes/loki-auto-time/etc/loki/local-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki

  promtail:
    container_name: promtail-logs
    image: grafana/promtail:2.9.0
    restart: unless-stopped
    volumes:
      - /var/log:/var/log
      - /var/www/vhosts:/var/www/vhosts
      - /var/lib/docker/volumes/promtail-auto-time/etc/promtail/config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml
    networks:
      - loki

  grafana:
    container_name: granfana-logs
    environment:
      - GF_PATHS_PROVISIONING=/etc/grafana/provisioning
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    entrypoint:
      - sh
      - -euc
      - |
        mkdir -p /etc/grafana/provisioning/datasources
        cat <<EOF > /etc/grafana/provisioning/datasources/ds.yaml
        apiVersion: 1
        datasources:
        - name: Loki
          type: loki
          access: proxy
          orgId: 1
          url: http://loki:3100
          basicAuth: false
          isDefault: true
          version: 1
          editable: false
        EOF
        /run.sh
    image: grafana/grafana:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    networks:
      - loki

Loki config
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

# Added by Brian
# https://community.grafana.com/t/too-many-outstanding-requests-on-loki-2-7-1/78249/11
# https://github.com/grafana/loki/issues/4613
limits_config:
  split_queries_by_interval: 0

query_scheduler:
  max_outstanding_requests_per_tenant: 2048

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false
Promtail config
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: homf
  pipeline_stages:
  - match: # Parse the parts of Yii's log files to get labels
      selector: '{job="homf"}'
      stages:
      - regex:
          expression: '(?P<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d) \[(?P<ip>(?:[0-9]{1,3}\.){3}[0-9]{1,3})\]\[(?P<userId>.+?)\]\[.+?\]\[(?P<level>.+?)\]\[(?P<category>.+?)\] (?P<msg>[^\$].+)'
      - labels:
          time:
          ip:
          userId:
          level:
          category:
          msg:
#      - timestamp:
#          source: time
#          format: "2006-01-02 15:04:05"
  - match: # Drop anything that doesn't have a message
      selector: '{job="homf", msg!~"(.+)"}'
      action: drop
  static_configs:
  - targets:
      - localhost
    labels:
      job: homf
      __path__: /var/www/vhosts/example.org/httpdocs/*/runtime/logs/*.log

Why would this be happening? I’ve tried to find ways to query Loki for “everything” to help me troubleshoot if these are really disappearing or if it might be something else, but I can’t find any good way to troubleshoot this.

I created another topic that has a lot more details of other things I’ve tried. It’s a bit verbose but there may be some useful information there ( Need help with best practices for debugging promtail - loki and grafana too ).

  1. Try updating Loki version to 2.9.2.

  2. What happens if you query for that error log specifically? Do you see a difference between shorter time frame and longer time frame in terms of number of log lines returned?

When I do a query selecting the ‘filename’ label in Grafana and the specific log, the same inconsistent behaviors occur.

Is there some way for me to query Loki for everything it has stored? I’ve tried using the API endpoints via the command line but they don’t seem to work as desired.

  1. Try updating Loki version to 2.9.2.

I went ahead and did this. Since they are running in docker containers, and I’m not sure how the data is stored in Loki, it essentially reset everything, and now all of my log entries show as coming in just now. So I’ll have to wait some time again to test the behavior.

On that subject, this is a frequent issue for me as well. Troubleshooting is slow because every time I have to restart these containers all of the log entries come in with the current timestamp, and I have to wait a day or so entries to come through with accurate timestamps. I found this topic ( Docker volume for loki container with --target=write ) that had some questions about where Loki’s data is stored for persistent volumes. Would creating volumes for the directories referenced in these three config options ( ingester.wal.dir; storage_config.boltdb_shipper.active_index_directory; ruler.wal.dir) be enough for me to be able to persist the data in Loki to help me test?

Your latest reply mentioned three problems, I’ll try to tackle them separately:

  1. If you restart the container and the logs disappear then that means you are not storing your logs correctly. Looking at your docker-compose i think you may want to create a directory on the host and mount into the container at the path /loki so that data are preserved between container restart.

  2. It sounds like the moment you restart promtail all logs are forwarded to Loki again. This is undesirable, especially if you fix #1 because you don’t want duplicated logs. Promtail keeps a position file (as you can see in your own configuration at /tmp/positions.yaml). You should want to preserve the file between restart, so I’d recommend you to create that file on your host and mount into the promtail container as RW.

  3. Your promtail isn’t parsing timestamp correctly. If you manage to fix #1 and #2, this is likely not going to be too big of an issue because presumably new logs would be forwarded to Loki as soon as they come in, maybe with a slight delay of a couple of milliseconds. But the best way to fix this is to parse for timestamp in promtail so that logs that are sent to Loki uses the timestamp specified in the logs themselves rather than the time of which when promtail attempts to collect and send logs.

Thank you this is information I’ve been wondering about.

I tried to create a directory to mount into the loki container, but it was having permission issues. I was creating the directory as root but the /loki folder in the container is owned by loki user. Similar permissions issue w/positions.yaml but I think I found out how to resolve that here ( Promtail fail to write positions.yaml file mounted · Issue #8543 · grafana/loki · GitHub). So, I ended up creating a named volume to map to the /loki folder and I created a local directory to map as the /tmp in promtail’s container. Not sure if this is the all that needs to be done, but I copied my new docker-compose.yaml below in case it helps anyone in the future.

I have tried this as well, but was having the same issues described here ( Timestamp isn't working for NGINX logs - #2 by armesh49 ). After loading the data with the custom timestamp I wouldn’t be able to get proper query results for hours. That forced me to stop using the custom timestamp to be able to speed up troubleshooting the other issues.

I am fine with restarting everything from scratch with empty logs and a clean install of this setup once I can figure out how to get consistent behavior out of these dashboards. At this point I’m pretty sure it’s Loki giving the inconsistent results. If it’s showing up in my Grafana Dashboards then Loki must be getting the data properly from Promtail. My next guess was to try to query loki for all of its entries and start to see if the issue was with Grafana or Loki. Perhaps mounting these volumes fill affect things. I’ll wait a bit and see, but still open to ideas on how else to troubleshoot.

docker-compose.yaml
version: "3"

networks:
  loki:

volumes:
  loki-main:

services:
  loki:
    container_name: loki
    image: grafana/loki:2.9.3
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - /var/lib/docker/volumes/loki/etc/loki/local-config.yaml:/etc/loki/local-config.yaml
      - loki-main:/loki
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki

  promtail:
    container_name: promtail
    image: grafana/promtail:2.9.3
    restart: unless-stopped
    volumes:
      - /var/www/vhosts:/var/www/vhosts
      - /var/lib/docker/volumes/promtail/etc/promtail/config.yml:/etc/promtail/config.yml
      - ${PWD}/promtail-tmp:/tmp
    command: -config.file=/etc/promtail/config.yml
    networks:
      - loki

  grafana:
    container_name: grafana
    environment:
      - GF_PATHS_PROVISIONING=/etc/grafana/provisioning
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    entrypoint:
      - sh
      - -euc
      - |
        mkdir -p /etc/grafana/provisioning/datasources
        cat <<EOF > /etc/grafana/provisioning/datasources/ds.yaml
        apiVersion: 1
        datasources:
        - name: Loki
          type: loki
          access: proxy
          orgId: 1
          url: http://loki:3100
          basicAuth: false
          isDefault: true
          version: 1
          editable: false
        EOF
        /run.sh
    image: grafana/grafana:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    networks:
      - loki

I’d suggest focusing on the biggest issue for now, which is logs being inconsistent as time goes on. I’d recommend you to keep it simple, send one test log, verify that it shows up in Loki, and verify that it gets written to file system after the idle time. Let it sit for however long you usually run into this issue with, and then verify again if it’s still in the file system, and that it still shows up in Loki.

Sounds good. What I’ve done, is set up the Grafana/Loki/Promtail trio on another server running a site using the exact same content management system. The only significant difference, which might matter here, is that one is on Ubuntu 18 and one is on Ubuntu 20.

Other than that, all configuration files are the same with the exception of one is sending all of the log files in a folder and the other is only sending one of the log files. I’ve got each open in different browser tabs. I’ll do the same tests and report back what I find.

Some additional notes that may be of interest, some for personal reference:

  • If I open a new browser session, and visit the dashboard after many hours of being away, it always seems to give correct results. This is based on me vetting the data against the individual log files. The inconsistencies seem to mainly happen when I’m refreshing multiple times per day.
  • On the first instance, I did validate that the entries showing in Grafana were accurate to what was in the logs when I checked today. If it starts working as expected, it may have been upgrading Loki that fixed the issue.
  • On the second instance, I have two panels in the dashboard, one that is querying for results based on the ‘job’ label and the other that is querying for results based on the ‘filename’ label. Since there is only one log being imported, these should always show identical results. If there are inconsistencies between those two, that may be a clue.

I will leave these dashboards open and do the same refreshing on each as before and see if the results are consistent or not, and whether if the two setups have different behavior. I’ll report back with more details later.

Okay - I wanted to wait a while until I reported back. It honestly looks like everything is working as I’d expect now on both the new setup as well as the original one. Based on that, I’d assume that it was the version of Loki that I was using (2.9.0). Which is kind of odd, considering I didn’t see anyone else having these issues, so it was possibly also something about my environment.

I will come back if I notice anything wrong, but at this point everything does look consistent. leaving the tabs open for 8 hours, then switching from the 6 to 24 hour timeframe doesn’t seem to have some results “disappear” like they were before. I still have no clue what was going on but at this point after testing this periodically over the last week I don’t have the feeling that I’m not getting an accurate report anymore.

Thanks for your responses @tonyswumac! Even if it was just updating the version that solved it, it may have been a while before I realized that a new version was available and then to actually try it.

No problem. The reason I recommended to upgrade was because there was actually a bug in 2.9.0 that resulted in a quick version update to 2.9.1. If you read the release note on the github repo you’ll find the bug.

1 Like