Hello!
I’m not very proficient in English, so please excuse the use of a translation tool.
I’m currently using Datadog, but due to the high cost of log ingestion, I’m planning to switch to Loki. Since I haven’t worked with Kubernetes before, I built the system using ECS and Docker, and I adopted an SSD deployment approach.
In Datadog, about 1 TB of logs are ingested per day, and roughly 250 GB of that is moved daily to the Grafana Loki system. While both the write servers and backend servers are performing well, I’m quite troubled by the poor performance of the read servers.
Write Server Specifications
- EC2: c7g.xlarge (20 instances)
- vCPU: 2, Memory: 3.75 GB, Tasks: 40
Log Traffic
- 250 GB per day (with plans to increase to 1 TB in the future)
What I Want to Achieve
- To search for fields like “user” within 10 seconds over a 250 GB (24-hour range) dataset
- To search for fields other than “user” (which are hard to specify) within the time limits over a 250 GB (24-hour range) dataset
In Datadog, whether searching for “user” over a 24- or 48-hour range or even information that isn’t usually searched, the results are returned within 10 seconds.
However, on my server, in order to search for “user”, I have to narrow the time range to 30 minutes, and if I try a range longer than that, the server sometimes goes down and returns a 502 error.
Is it that my write infrastructure specifications are insufficient? Or do my requirements better suit an ELK stack setup? I started with Grafana and Loki to reduce costs and to try Tempo, but I’m worried that the expenses might end up being much higher than expected. Although I’m using spot EC2 instances for about 80% of the write servers, I’m curious what specifications others are running on.
Just in case, I’m also sharing my Loki configuration. Thank you for reading this long message.
auth_enabled: true
server:
http_listen_port: 3100
log_level: warn
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 67108864
grpc_server_max_send_msg_size: 67108864
memberlist:
join_members: ****
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
left_ingesters_timeout: 30s
bind_addr: ['0.0.0.0']
bind_port: 7946
gossip_interval: 2s
rejoin_interval: 1m
common:
path_prefix: /loki
replication_factor: 2
compactor_address: ****
ring:
kvstore:
store: memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
chunk_idle_period: 15m
chunk_retain_period: 1m
autoforget_unhealthy: true
wal:
flush_on_shutdown: true
schema_config:
configs:
- from: 2023-01-01
store: tsdb
object_store: aws
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
cache_ttl: 24h
aws:
bucketnames: ****
region: ****
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
volume_enabled: true
allow_structured_metadata: true
max_line_size_truncate: true
split_queries_by_interval: 30m
tsdb_max_query_parallelism: 512
querier:
max_concurrent: 4
compactor:
working_directory: /loki/compactor
frontend:
# (The frontend configuration details were repeated in your message.)