Loki or apache spark

Hello @tonyswumac

I am working with a huge syslog file (72 gb) I am debating whether tp use apache pyspark to process the file or loki.

What would be your recommendation?


I would probably just go with Loki. My opinions on Spark is a bit outdated though, since I haven’t been actively involved in the data warehouse space in a couple of years now.

In order to use PySpark you’d need a more traditional bigdata set up, with a compute component (Spark or MapReduce), and storage component (most commonly HDFS, can be based on object storage of course). There are of course prebuilt solutions like EMR or Qubole, but it still takes more effort (and probably more expensive) than standing up just a Loki cluster, which comes with both compute distribution and storage layers. Obviously a traditional bigdata set up can scale to crazy level, if it were a 72TB file I’d probably say go with Spark :stuck_out_tongue:

LogQL is admittedly a bit lacking in terms of features, but you could pull data with API from Loki then use other analytics tools such as Pandas instead if you need more complex functionalities.

1 Like

Thanks @tonyswumac

it most probably in the near future where I am working which is my concern :scream:

So loki can read direct from file on disk: 72gb just fine? probably would need that much ram

Loki doesn’t have an import function, so the easiest way to get the file into Loki would probably be Promtail (assuming the content is structured). You’ll probably want to put in some sort of rate limit on promtail configuration so you don’t run over the Loki cluster.

You probably don’t need a big cluster for just 72GB of file. I’d try with a single standalone Loki with like 8GB of memory, or simple scalable cluster with two nodes if needed.

Grafana Cloud gives you 50GB of free storage, would be a good place to play around with.

1 Like

Content is syslog-ish sprinkled with some custom funky data that might require some parsing of sort on loki side or is it prom?
Right now I am successfully parsing it using python with regex
Ever used databricks?

Then you should be able to parse the file with logql once you land it in Loki.

I’ve used databricks briefly before when I used to work in data warehouse space, not much experience with it though.

1 Like

looking great @tonyswumac

thanks so much!

1 Like


can I configure loki with the help of promtail to crawl a windows folder with multiple unknown amount of subfolders to scrape data from log files therein? some sort of auto discovery

@tonyswumac looking for your help with this please. I can open a new thread if needed but wanted to keep it together.

I have limited experience with Windows, but on Linux you can use ** to match directories recursively. See Configuring Promtail for service discovery | Grafana Loki documentation.

1 Like

Awesome, let me try that. Much better than copying things out to a common folder.