we are benchmarking different approaches to run Loki in a multi-environment setup where there’s no direct Connectivity between Dev/stage/live and we have a bunch of clusters for each environment + and s3 bucket for each environment
our idea for the simple scalable mode approach:
run Loki write target in the main cluster per Dev/Stage,
run Loki Read and Write targets in Live Environment (main) which will host Grafana
update Loki read join memberlist in Live with Write endpoints from each environment
endpoints of Loki write of dev/stage should be secured with authentication and mTLS
then connect grafana to the main Loki read in Live
Logs should be stored in different s3 buckets per environment.
querying log of Dev: grafana → Loki read → Loki write(Dev) → fetch logs from Dev s3 bucket + Dev Ingester
can you please give us some advices/guidance if this will work or not
If you don’t have direct connectivity between your environments then you probably have to operate one Loki cluster in each environment, and you will need both read and write targets.
Then you’ll need a Grafana instance somewhere, which will need connectivity to your Loki clusters. This will be a problem as well if you don’t have connectivity between your environments.
We want to have one single Datasource in grafana for all environments that’s why we want to have read target In Live connected to write targets in other environments + write target from live
is this feasible or not, in case it’s not can you please explain why?
since there’s no connectivity between Envs we want to make Loki write Public + secure it with mTLS
In my opinion you would be much better off operating one Loki cluster in your main environment along with Grafana.
Let’s first consider your first potential solution, of having write targets in dev and stage environments. First your biggest problem is security. mTLS is good, but it won’t secure your endpoint, so you would have to limit your public endpoint with ingress whitelisting. Second problem is cost, at least in most cloud provider each eIP would be additional cost.
If you are considering public connectivity already, it would be much easier to operate your entire Loki cluster in your main environment. Keep everything internal, but expose the write endpoint API via an external load balancer, and whitelist your egress IPs from your dev and stage environments.
having complete isolation between environments is one of our PCI DSS requirements.
we can have only one central Loki exported to the internet and secured with Basic authentication + mTLS, and add the endpoint to Promtails in each cluster.
but for PCI DSS requirements we want to isolate data for each environment within its own S3 bucket and also make the setup more scalable
we can test the following setup:
→ Operating Loki cluster(deploying all simple scalable mode components) in the main environment.
→ main Loki read ↔ grafana communication is internal
→ Expose to the internet write component per each environment via a load balancer
→ Add write endpoints to memberlist.join_members parameter of Loki configmap to join the cluster
→ whitelist the IPs of the main Loki read in dev/stage write loki
I don’t understand your comment here, you are already considering exposing write endpoint between environments.
If you operate Loki cluster in one environment and expose the API, you have one exposed endpoint. If you operate Loki writer in all environments and expose the writer gRPC port, you have number of writer * number of environment endpoints, not to mention your Loki cluster is now spanning across multiple environments instead of being confined into one.
If you want, share a rough diagram and we can discuss further.
I’m struggling to make Loki writes in main k8s clusters in Dev/Stage to join Loki cluster in the Main k8s cluster in the Live environment
below is the diagram for the POC that I’m working on:
If you really want to do this, then you should configure each writer as separate cluster. And in your Loki cluster in live environments you’d simply live with the fact that logs from dev and stage would be delayed and you won’t see them until they are written to S3 buckets. I would strongly recommend against this. I think it’s theoretically possible, but I’ve not done it, and I don’t think it’s a good idea.
All you are exposing is Loki API through an application load balancer. You can even add whitelisting on the ALB, and this is already less attack surface than your diagram.