S3 Performance Question

Hi,

I am using SSD with S3 as backend. I am pushing large amount of logs to Loki via K6(300K logs every 2 mins). My EKS and S3 are in the same region. If I try to search a string from the previous day(24 hours) I start seeing issues with the backend afterwords. It seems there is a problem with writing to S3 during the read operation. Is there anything that can be done for this? Thanx in advance

Below is my config:

loki:
auth_enabled: false
commonConfig:
path_prefix: /var/loki
replication_factor: 3

storage:
bucketNames:
chunks: xxxxx
type: s3

schemaConfig:
configs:
- from: “2024-02-12”
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v12
store: tsdb
storage_config:
aws:
s3: s3://xxxxxxxxxxxxx
insecure: false
s3forcepathstyle: true
http_config:
insecure_skip_verify: true
tsdb_shipper:
active_index_directory: /var/loki/tsdb-index
cache_location: /var/loki/tsdb-cache
cache_ttl: 1h
shared_store: s3
resync_interval: 5m
rulerConfig:
storage:
type: local
local:
directory: /var/loki/rules
limits_config:
query_timeout: 300s
retention_period: 168h
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 30MB
ingestion_rate_mb: 40
ingestion_burst_size_mb: 60
query_scheduler:
max_outstanding_requests_per_tenant: 32768
ingester:
chunk_encoding: snappy
server:
http_server_write_timeout: 310s
http_server_read_timeout: 310s
serviceAccount:
name: xxxxx
annotations:
eks.amazonaws.com/role-arn: “xxxxxxxxxxxxxxx”
write:
resources:
requests:
cpu: 200m
memory: 2Gi
limits:
memory: 4Gi
read:
resources:
requests:
cpu: 4000m
memory: 2Gi
limits:
memory: 4Gi
cpu: 4000m
test:
enabled: false

Your problem statement is rather vague, what do you mean by “a problem with writing to S3 during read operation”?

Also how many readers and writers do you have?

Thanx for your response. I have 3 Read and 3 Write Pods. The write operation occurs via a cronjob triggering a K6 job to push the logs every 2 minutes. When there is no other operation it all works well.

If I start filtering for text within a 24 hour window(which takes about 2-3 minutes to complete) I eventually get the result.

However if I go back and search the complete log for the past 15 minutes for example I see a gap in the logs which should have been pushed via K6 overlapping with the time I initiated the search. I dont see any errors in K6 logs or write pods… Thanx again in advance.

How you deploying your Loki cluster? Can you show the result of the /ring endpoint on one of the writers?

Sounds to me like you are not actually using simple scalable mode (read / write traffic aren’t separate). I’d double check and make sure.

Hi Tony,

Thanx for your help. I am deploying Loki SSD using the related helm chart. I didnt think ring was mandatory. Here is the result of the curl command(curl ‘http://localhost:3100/ring’) to one of the write pods. Is ring mandatory?

Ring Status

Ring Status

Current time: 2024-02-29 04:12:01.583764656 +0000 UTC m=+28351.302502624

            <tr>
        
        <td>loki-write-0</td>
        <td></td>
        <td>ACTIVE</td>
        <td>10.155.57.118:9095</td>
        <td>2024-02-28T20:19:34Z</td>
        <td>2.584s ago (04:11:59)</td>
        <td>128</td>
        <td>31.2%</td>
        <td>
            <button name="forget" value="loki-write-0" type="submit">Forget</button>
        </td>
        </tr>
    
        
            <tr bgcolor="#BEBEBE">
        
        <td>loki-write-1</td>
        <td></td>
        <td>ACTIVE</td>
        <td>10.155.58.44:9095</td>
        <td>2024-02-28T20:18:26Z</td>
        <td>5.584s ago (04:11:56)</td>
        <td>128</td>
        <td>34%</td>
        <td>
            <button name="forget" value="loki-write-1" type="submit">Forget</button>
        </td>
        </tr>
    
        
            <tr>
        
        <td>loki-write-2</td>
        <td></td>
        <td>ACTIVE</td>
        <td>10.155.56.25:9095</td>
        <td>2024-02-28T20:17:25Z</td>
        <td>1.584s ago (04:12:00)</td>
        <td>128</td>
        <td>34.8%</td>
        <td>
            <button name="forget" value="loki-write-2" type="submit">Forget</button>
        </td>
        </tr>
    
    </tbody>
</table>
<br>

    <input type="button" value="Show Tokens" onclick="window.location.href = '?tokens=true'"/>

I use helm charts to deploy SSD Loki with the following values.

loki:
auth_enabled: false
commonConfig:
path_prefix: /var/loki
replication_factor: 3

storage:
bucketNames:
chunks: xxxxxxxxx
type: s3

schemaConfig:
configs:
- from: “2024-02-12”
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v12
store: tsdb
storage_config:
aws:
s3: s3://xxxxxxxxx
insecure: false
s3forcepathstyle: true
http_config:
insecure_skip_verify: true
boltdb_shipper:
active_index_directory: /var/loki/boltdb-index
cache_location: /var/loki/boltdb-cache
tsdb_shipper:
active_index_directory: /var/loki/tsdb-index
cache_location: /var/loki/tsdb-cache
cache_ttl: 1h
shared_store: s3
resync_interval: 5m
rulerConfig:
storage:
type: local
local:
directory: /var/loki/rules
limits_config:
query_timeout: 300s
retention_period: 168h
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 30MB
ingestion_rate_mb: 40
ingestion_burst_size_mb: 60
query_scheduler:
max_outstanding_requests_per_tenant: 32768
ingester:
chunk_encoding: snappy
server:
http_server_write_timeout: 310s
http_server_read_timeout: 310s
serviceAccount:
name: xxxxxxxxx
annotations:
eks.amazonaws.com/role-arn: “xxxxxxxxx”
write:
resources:
requests:
cpu: 200m
memory: 2Gi
limits:
memory: 4Gi
read:
resources:
requests:
cpu: 4000m
memory: 2Gi
limits:
memory: 4Gi
cpu: 4000m
test:
enabled: false
monitoring:
dashboards:
enabled: true
rules:
enabled: false
alerts:
enabled: false
serviceMonitor:
enabled: true
selfMonitoring:
enabled: false
lokiCanary:
enabled: false

Instance ID Availability Zone State Address Registered At Last Heartbeat Tokens Ownership Actions

That looks pretty normal. Then I would double check on your frontend (if you are using the helm chart it should be nginx) and make sure it’s working normally.

I do not think the issue you are having is related to Loki reader or writer containers directly. If you are using simple scalable mode and the frontend nginx is configured correctly, traffic should be routed to reader and writer accordingly meaning they should not interfere with each other.

There are also a lot of metrics exposed by Loki, I’d recommend looking at the S3 related ones and see if there is any latency spike or errors.