Loki upgrade from 2.8.2 to 3.1.1 is erroring with Azure blob storage

Hello, I am currently migrating from Loki version 2.8.2 (Helm chart loki-distributed 0.69.16) to Loki 3.1.1 (Helm chart 6.16.0) and I am following the upgrade guide. It appears that the structure for Loki to store logs in Azure object storage has changed and I think this is causing problems with the migration. I have created three new folders in Azure blob storage called chunks, ruler and admin. My current instance of Loki 2.8.2 just uses index as the folder name which includes the index named folders. Note: the parent folder for my blob storage is called loki.I have noticed that there are errors in the index gateway and distributor pods which point to Azure blob storage problems e.g.

  • level=debug ts=2024-10-16T13:32:56.177743926Z caller=reporter.go:208 msg="failed to read cluster seed file" err="Get \" https://REMOVED.blob.core.windows.net/loki/chunks/loki_cluster_seed.json?timeout=1 \": context deadline exceeded"
  • level=error ts=2024-10-16T13:22:38.608194874Z caller=table.go:354 table-name=loki_index2_20012 org_id=self-monitoring traceID=14d2dd943b76a5c1 msg="index set has some problem, cleaning it up" err="Get \" https://REMOVED.blob.core.windows.net/loki/chunks?comp=list&delimiter=%2F&prefix=index%2Floki_index2_20012%2F&restype=container&timeout=1 \": context deadline exceeded"

I’d like to note that I do not have a loki_cluster_seed.json file as I did not know one was needed (I cannot find documentation about this). I have created an empty file as a test but I still see these errors. I am using the correct account details for azure such as the account name and account key in my helm values file:

loki:

  migrate:
    fromDistributed:
      enabled: true
      memberlistService: loki-loki-distributed-memberlist

  schemaConfig:
    configs:
      - from: 2023-05-19
        object_store: azure
        store: tsdb
        schema: v12
        index:
          prefix: loki_index2_
          period: 24h
      - from: 2024-10-18
        object_store: azure
        store: tsdb
        schema: v13
        index:
          prefix: loki_index3_
          period: 24h
  ingester:
    chunk_encoding: snappy
  tracing:
    enabled: true
  querier:
    max_concurrent: 4

  storage:
    type: azure
    azure:
      # Name of the Azure Blob Storage account
      accountName: REMOVED
      # Key associated with the Azure Blob Storage account
      accountKey: REMOVED
      # Comprehensive connection string for Azure Blob Storage account (Can be used to replace endpoint, accountName, and accountKey)
      # connectionString: <your-connection-string>
      # Flag indicating whether to use Azure Managed Identity for authentication
      useManagedIdentity: false
      # Flag indicating whether to use a federated token for authentication
      useFederatedToken: false
      # Client ID of the user-assigned managed identity (if applicable)
      # userAssignedId: <your-user-assigned-id>
      # Timeout duration for requests made to the Azure Blob Storage account (in seconds)
      requestTimeout: 120
      # Domain suffix of the Azure Blob Storage service endpoint (e.g., core.windows.net)
      #endpointSuffix: <your-endpoint-suffix>
    bucketNames:
      chunks: "loki/chunks"
      ruler: "loki/ruler"
      admin: "loki/admin"

I have set the requestTimeout to 120 but I keep seeing the &timeout=1 parameter in the error logs.

I want to see old logs (indexed by Loki 2.8.2) in Grafana using the new Loki 3.1.1 instance so I can confirm that once the old loki instance is turned off, I can still access old logs as we store them for 60 days.

In general I don’t recommend making other changes while upgrading, unless there is a compatibility issue. I think your first priority should be getting back to a working state, are you able to revert your cluster deployment to what it was before?

You can ignore the seed file, it’s for anonymous statistics collection that Loki does (which you can disable). So I’d check your other logs to see what other errors are present. I am not aware of any directory structure change in Loki, so I don’t think you needed to change this at all.

My recommendation would be to revert to a working state, then plan for another upgrade with the following steps:

  1. Upgrade Loki to version 3.1.1 with as little change as possible. Note that you’d want to explicitly disable structured metadata before you change schema to v13.
  2. Once it’s working, create a new index period to upgrade to v13.
  3. After v13 is done, change configuration to enable structured metadata if desired.
  4. Test this process in a lab or dev cluster first.

Alternatively, since you do have a retention of 60 days, you can always deploy a new cluster, and let the old cluster sit for 60 days.

1 Like

Hi Tony,

Thank you for reaching out. I have tried to keep the upgrade as simple as possible. I am following the migration guide from the loki-distributed helm chart to the loki helm chart. I still have the old instance running alongside the new instance but I can’t really afford to keep both running for 60 days across multiple clusters.

I have been been passing in this flag -validation.allow-structured-metadata=false to get around the v12 / v13 schema issue. I noticed that quite early on and did hope that was the root cause of my problem.

There are no errors across all loki 3 components apart from ones related to Azure storage. The two errors in my OP are ones I see constantly. I am using the same credentials as my Loki 2.8.2 instance deployment, on the same cluster.

All I want to do is to be able to see the older logs in the new instance (Loki 3.1.1). I have enabled debug logging on all of the components and I am still not getting any closer to a fix. I have started testing different authentication methods with Azure such as the managed token and federated identity but I still get the same error. Do you have any other ideas?

Again, I really appreciate your help.

Thank you,
Alex

Since you are upgrading, and not migrating, I don’t think you should follow the migration guide. The migration guide essentials guides you to deploy one loki cluster from two helm charts, but you are not doing that, you are trying to create two clusters, with different versions. While I think it’s do-able, I don’t think you should have two Loki clusters sharing the same storage.

I still think you should test the upgrade process from my comment above and either upgrade in place, or operate two clusters side-by-side. I don’t think it would be too expensive to run two clusters, as soon as you migrate you can remove all write targets from your old cluster, and you can gradually scale down read targets as time goes by. Logs from object storage will naturally retire as well.

If you want to continue to troubleshoot, I’d say try a couple of things:

  1. Hit the /config endpoint from both clusters, and do a diff (share the diff too, if you can).
  2. What do you get if you try to manually perform an API call to the new cluster that’s not working? Try both read and write.