Hi all, I have a cluster of 4 x loki instances running in monolitic mode ( -target=all
).
Each of these compute nodes have the following specification:
- 2 x CPU cores (arm64, AWS EC2 c6g.large instances, 8 x cores total across all 4 nodes)
- 4 GB of RAM (16GB total across all 4 nodes)
Loki version is 2.6.1, and is deployed using the precompiled binaries.
We store the logs in AWS S3.
We are currently shipping some access logs to this loki cluster, and we want to pull some metrics data from these logs.
As an example, we want to pull the top 10 ip addresses from these logs, but I am having issues getting some decent performance out of this cluster, so seeking some advice here.
We are wanting to investigate our ELK stack, using Loki as a replacement.
The query we are running is a variation of the Top Visitor IPs from the Acquisition and Behaviour section in the Loki/Grafana example dashboard here: Grafana
As an example, I would like to pull the top IPs for the dates 2023-03-06 - 2023-03-09
.
We have the following query:
topk(5, sum by (haproxy_ip_address) (count_over_time({component="haproxy",environment="live",platform="web"} |= "ip_address" | json | __error__="" and haproxy_ip_address!="" [$__range])))
If I use this query in Grafana, I can only at the most get the top IP addresses for the last couple of hours.
I understand that this operation is expensive, as per this comment: Performance issue - #2 by ewelch
Is there any alternative approach to this type of query?
With this example, I would like ot be able to get the top IP addresses for the last 14 days.
An example of what these log lines look like as they are ingested by loki:
{
"environment": "live",
"file": "/var/log/haproxy.log",
"geoip": {
"city_name": "Sydney",
"continent_code": "OC",
"country_code": "AU",
"country_name": "Australia",
"latitude": "",
"longitude": "",
"metro_code": "",
"postal_code": "2000",
"region_code": "NSW",
"region_name": "New South Wales",
"timezone": "Australia/Sydney"
},
"haproxy": {
"Tc": "0",
"Tw": "0",
"active_request": "0",
"backend_concurrent_connections": "1",
"backend_name": "WEB",
"backend_queue": "0",
"captured_request_cookie": "-",
"captured_response_cookie": "-",
"client_port": "62333",
"frontend_concurrent_connections": "17",
"frontend_name": "HTTP",
"http_method": "GET",
"http_timestamp": "14/Mar/2023:16:14:13.101",
"http_version": "HTTP/2.0",
"ip_address": "8.8.8.8",
"process_concurrent_connections": "18",
"referrer": "www.fakewebsite.com",
"response_time": "0",
"server_concurrent_connections": "1",
"server_connection_retries": "0",
"server_name": "WEB_2",
"server_queue": "0",
"size": "287",
"status_code": "304",
"termination_state": "--NI",
"time_to_receive": "0",
"unix_timestamp": "1678770853",
"uri": "https://www.fakewebsite.com/siteassets/fake.min.css",
"user_agent": "like Gecko) Chrome/111.0.0.0 Safari/537.36|2.0||"
},
"host": "lb-web-001",
"message": " 8.8.8.8:62333 1678770853 [14/Mar/2023:16:14:13.101] HTTP WEB/WEB_2 0/0/0/0/0 304 287 - - --NI 18/17/1/1/0 0/0 {www.fakewebsite.com|like Gecko) Chrome/111.0.0.0 Safari/537.36|2.0||} \"GET https://www.fakewebsite.com/siteassets/fake.min.css HTTP/2.0\" 1984",
"platform": "web",
"source_type": "file",
"time": "2023-03-14T05:14:13.101Z"
}
My other question, that ties in to this question is, is the sizing of my loki cluster large enough?
Or is this a case of throwing a huge amount of more compute at the cluster it in order to get some better performance out of this setup?
Here is some output from logcli hitting the load balancer that sits in front of out loki cluster, querying for a string (that does not exist) for the same time period as my previous example:
time ./logcli-linux-arm64 query --timezone=UTC --from="2023-03-06T00:00:00Z" --to="2023-03-09T00:00:00Z" '{component="haproxy",environment="live",platform="web"}|= "this-string-does-not-match-anything"' --stats
Ingester.TotalReached 0
Ingester.TotalChunksMatched 0
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.TotalChunksRef 0
Ingester.TotalChunksDownloaded 0
Ingester.ChunksDownloadTime 0s
Ingester.HeadChunkBytes 0 B
Ingester.HeadChunkLines 0
Ingester.DecompressedBytes 0 B
Ingester.DecompressedLines 0
Ingester.CompressedBytes 0 B
Ingester.TotalDuplicates 0
Querier.TotalChunksRef 1277
Querier.TotalChunksDownloaded 1277
Querier.ChunksDownloadTime 35.320173727s
Querier.HeadChunkBytes 0 B
Querier.HeadChunkLines 0
Querier.DecompressedBytes 12 GB
Querier.DecompressedLines 8103575
Querier.CompressedBytes 1.4 GB
Querier.TotalDuplicates 0
Cache.Chunk.Requests 0
Cache.Chunk.EntriesRequested 0
Cache.Chunk.EntriesFound 0
Cache.Chunk.EntriesStored 0
Cache.Chunk.BytesSent 0 B
Cache.Chunk.BytesReceived 0 B
Cache.Index.Requests 0
Cache.Index.EntriesRequested 0
Cache.Index.EntriesFound 0
Cache.Index.EntriesStored 0
Cache.Index.BytesSent 0 B
Cache.Index.BytesReceived 0 B
Cache.Result.Requests 0
Cache.Result.EntriesRequested 0
Cache.Result.EntriesFound 0
Cache.Result.EntriesStored 0
Cache.Result.BytesSent 0 B
Cache.Result.BytesReceived 0 B
Summary.BytesProcessedPerSecond 1.8 GB
Summary.LinesProcessedPerSecond 1160333
Summary.TotalBytesProcessed 12 GB
Summary.TotalLinesProcessed 8103575
Summary.ExecTime 6.983834629s
Summary.QueueTime 1.119285312s
real 0m7.019s
user 0m0.021s
sys 0m0.017s
And here is the config that I are currently running:
auth_enabled: false
server:
http_listen_port: 3100
http_server_read_timeout: 300s
http_server_write_timeout: 300s
http_server_idle_timeout: 290s
grpc_server_max_concurrent_streams: 500
grpc_server_max_recv_msg_size: 10000000
grpc_server_max_send_msg_size: 10000000
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members:
- loki_servicediscovery_cluster_name.loki_servicediscovery_endpoint:7946
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
frontend:
max_outstanding_per_tenant: 2048
compress_responses: false
frontend_worker:
grpc_client_config:
max_send_msg_size: 1.048576e+08
parallelism: 10
querier:
max_concurrent: 40
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 3
final_sleep: 30s
chunk_idle_period: 1h
chunk_retain_period: 0s
chunk_encoding: snappy
wal:
dir: /loki-data/wal
distributor:
ring:
kvstore:
store: memberlist
query_range:
parallelise_shardable_queries: true
cache_results: true
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_bytes: 2GB
validity: 24h
compactor:
working_directory: /loki-data/compactor
shared_store: s3
retention_enabled: true
compaction_interval: 30m
retention_delete_delay: 2h
retention_delete_worker_count: 150
### Instance and object limit configuration
limits_config:
split_queries_by_interval: 12h
enforce_metric_name: true
reject_old_samples: true
reject_old_samples_max_age: 168h
max_global_streams_per_user: 10000
ingestion_rate_mb: 20
max_query_series: 100000
max_query_length: 0
max_query_parallelism: 32
### Retention config
# Global retention period applied if none of the below stream match
# 31 days
retention_period: 744h
retention_stream:
# 7 days retention for dev logs
- selector: '{environment="dev"}'
priority: 1
period: 168h
# 14 days retention for ote logs
- selector: '{environment="ote"}'
priority: 1
period: 336h
# 31 days retention for live logs
- selector: '{environment="live"}'
priority: 1
period: 744h
schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki-data/index
cache_location: /loki-data/index_cache
shared_store: s3
cache_ttl: 168h
resync_interval: 5m
max_chunk_batch_size: 300
aws:
s3: s3://aws_region/s3_bucket
region: aws_region
chunk_store_config:
chunk_cache_config:
fifocache:
max_size_bytes: 2GB
validity: 24h
Many thanks, any advice appreciated.