S3 Performance Question

moonorb · February 28, 2024, 8:57pm

Hi,

I am using SSD with S3 as backend. I am pushing large amount of logs to Loki via K6(300K logs every 2 mins). My EKS and S3 are in the same region. If I try to search a string from the previous day(24 hours) I start seeing issues with the backend afterwords. It seems there is a problem with writing to S3 during the read operation. Is there anything that can be done for this? Thanx in advance

Below is my config:

loki:
auth_enabled: false
commonConfig:
path_prefix: /var/loki
replication_factor: 3

storage:
bucketNames:
chunks: xxxxx
type: s3

schemaConfig:
configs:
- from: “2024-02-12”
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v12
store: tsdb
storage_config:
aws:
s3: s3://xxxxxxxxxxxxx
insecure: false
s3forcepathstyle: true
http_config:
insecure_skip_verify: true
tsdb_shipper:
active_index_directory: /var/loki/tsdb-index
cache_location: /var/loki/tsdb-cache
cache_ttl: 1h
shared_store: s3
resync_interval: 5m
rulerConfig:
storage:
type: local
local:
directory: /var/loki/rules
limits_config:
query_timeout: 300s
retention_period: 168h
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 30MB
ingestion_rate_mb: 40
ingestion_burst_size_mb: 60
query_scheduler:
max_outstanding_requests_per_tenant: 32768
ingester:
chunk_encoding: snappy
server:
http_server_write_timeout: 310s
http_server_read_timeout: 310s
serviceAccount:
name: xxxxx
annotations:
eks.amazonaws.com/role-arn: “xxxxxxxxxxxxxxx”
write:
resources:
requests:
cpu: 200m
memory: 2Gi
limits:
memory: 4Gi
read:
resources:
requests:
cpu: 4000m
memory: 2Gi
limits:
memory: 4Gi
cpu: 4000m
test:
enabled: false

tonyswumac · February 28, 2024, 11:34pm

Your problem statement is rather vague, what do you mean by “a problem with writing to S3 during read operation”?

Also how many readers and writers do you have?

moonorb · February 29, 2024, 12:21am

Thanx for your response. I have 3 Read and 3 Write Pods. The write operation occurs via a cronjob triggering a K6 job to push the logs every 2 minutes. When there is no other operation it all works well.

If I start filtering for text within a 24 hour window(which takes about 2-3 minutes to complete) I eventually get the result.

However if I go back and search the complete log for the past 15 minutes for example I see a gap in the logs which should have been pushed via K6 overlapping with the time I initiated the search. I dont see any errors in K6 logs or write pods… Thanx again in advance.

tonyswumac · February 29, 2024, 2:50am

How you deploying your Loki cluster? Can you show the result of the /ring endpoint on one of the writers?

Sounds to me like you are not actually using simple scalable mode (read / write traffic aren’t separate). I’d double check and make sure.

moonorb · February 29, 2024, 4:18am

Hi Tony,

Thanx for your help. I am deploying Loki SSD using the related helm chart. I didnt think ring was mandatory. Here is the result of the curl command(curl ‘http://localhost:3100/ring’) to one of the write pods. Is ring mandatory?

Ring Status

Current time: 2024-02-29 04:12:01.583764656 +0000 UTC m=+28351.302502624

            <tr>
        
        <td>loki-write-0</td>
        <td></td>
        <td>ACTIVE</td>
        <td>10.155.57.118:9095</td>
        <td>2024-02-28T20:19:34Z</td>
        <td>2.584s ago (04:11:59)</td>
        <td>128</td>
        <td>31.2%</td>
        <td>
            <button name="forget" value="loki-write-0" type="submit">Forget</button>
        </td>
        </tr>
    
        
            <tr bgcolor="#BEBEBE">
        
        <td>loki-write-1</td>
        <td></td>
        <td>ACTIVE</td>
        <td>10.155.58.44:9095</td>
        <td>2024-02-28T20:18:26Z</td>
        <td>5.584s ago (04:11:56)</td>
        <td>128</td>
        <td>34%</td>
        <td>
            <button name="forget" value="loki-write-1" type="submit">Forget</button>
        </td>
        </tr>
    
        
            <tr>
        
        <td>loki-write-2</td>
        <td></td>
        <td>ACTIVE</td>
        <td>10.155.56.25:9095</td>
        <td>2024-02-28T20:17:25Z</td>
        <td>1.584s ago (04:12:00)</td>
        <td>128</td>
        <td>34.8%</td>
        <td>
            <button name="forget" value="loki-write-2" type="submit">Forget</button>
        </td>
        </tr>
    
    </tbody>
</table>
<br>

    <input type="button" value="Show Tokens" onclick="window.location.href = '?tokens=true'"/>

I use helm charts to deploy SSD Loki with the following values.

loki:
auth_enabled: false
commonConfig:
path_prefix: /var/loki
replication_factor: 3

storage:
bucketNames:
chunks: xxxxxxxxx
type: s3

schemaConfig:
configs:
- from: “2024-02-12”
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v12
store: tsdb
storage_config:
aws:
s3: s3://xxxxxxxxx
insecure: false
s3forcepathstyle: true
http_config:
insecure_skip_verify: true
boltdb_shipper:
active_index_directory: /var/loki/boltdb-index
cache_location: /var/loki/boltdb-cache
tsdb_shipper:
active_index_directory: /var/loki/tsdb-index
cache_location: /var/loki/tsdb-cache
cache_ttl: 1h
shared_store: s3
resync_interval: 5m
rulerConfig:
storage:
type: local
local:
directory: /var/loki/rules
limits_config:
query_timeout: 300s
retention_period: 168h
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 30MB
ingestion_rate_mb: 40
ingestion_burst_size_mb: 60
query_scheduler:
max_outstanding_requests_per_tenant: 32768
ingester:
chunk_encoding: snappy
server:
http_server_write_timeout: 310s
http_server_read_timeout: 310s
serviceAccount:
name: xxxxxxxxx
annotations:
eks.amazonaws.com/role-arn: “xxxxxxxxx”
write:
resources:
requests:
cpu: 200m
memory: 2Gi
limits:
memory: 4Gi
read:
resources:
requests:
cpu: 4000m
memory: 2Gi
limits:
memory: 4Gi
cpu: 4000m
test:
enabled: false
monitoring:
dashboards:
enabled: true
rules:
enabled: false
alerts:
enabled: false
serviceMonitor:
enabled: true
selfMonitoring:
enabled: false
lokiCanary:
enabled: false

Instance ID	Availability Zone	State	Address	Registered At	Last Heartbeat	Tokens	Ownership	Actions

tonyswumac · February 29, 2024, 4:05pm

That looks pretty normal. Then I would double check on your frontend (if you are using the helm chart it should be nginx) and make sure it’s working normally.

I do not think the issue you are having is related to Loki reader or writer containers directly. If you are using simple scalable mode and the frontend nginx is configured correctly, traffic should be routed to reader and writer accordingly meaning they should not interfere with each other.

There are also a lot of metrics exposed by Loki, I’d recommend looking at the S3 related ones and see if there is any latency spike or errors.

Topic		Replies	Views
Loki -Queries older that few hours are timing out (S3 backend) Grafana Loki	2	531	February 7, 2024
Grafana loki S3 cost spike Grafana Loki loki	6	189	September 6, 2024
Grafana + Loki-Distributed + Promtail on EKS Configuration loki	0	716	November 9, 2023
How to configure loki, do not get data from s3 when querying, when I use s3 to store index and chunk Grafana Loki	2	749	June 21, 2022
Problems to migrate Logs to new Loki instance Grafana Loki loki	4	376	July 3, 2024

S3 Performance Question

Ring Status

Related topics