Looking at your logs - it looks like you are running out of disk space:
t=2019-04-25T06:22:13+0000 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalContext ruleId=48 name="Disk Space Alert () " error=“Could not find datasource database is locked” changing state to=alerting
If you have enough disk space then looks like it be related to your file system. (See here for an example of file system problems with Sqlite).
Yes systems which I am monitoring is running out of disk space, but not Grafana itself. It is getting these: “Could not find datasource database is locked” error messages. These message are in some cases visible also in Alert List visualization recent changes list. They are not visible in current status list at all.
What file system are you running on? Sqlite is a file-based database so it does not work on all types of file systems (see my previous reply for an example).
If you have a lot of traffic and are doing a lot of writes to the database then maybe you have reached the limit for sqlite and it is time to switch to MySql or Postgres. But this is unlikely unless you have a very large number of alerts or users.
Grafana is run on xfs and database is located on ext4.
How do you define lot of trafic?
Only two users and around 200 alerts so to me it doesn’t sound too much.
No, that is not a lot of traffic and xfs and ext4 are standard file systems. Looking at the error messages I’m not sure if they are sqlite errors. Which datasource is returning the errors?
I noticed this on a server with a lot of contention — it’s definitely gotten worse in the 6.x series — between Prometheus and Grafana for a particular storage partition so making sure it’s on a dedicated partition should help.
I have a Grafana 8.3.3 setup using default sqllite config.
I run about 150 alert rules.
I’m getting a lot of those errors in the alerts panel:
could not find datasource: database is locked
Looking at the logs, I also found a lot of
msg="failed to fetch alert rule" err="database is locked"
even
msg="failed to save alert state" err="database is locked"
The system is definitely not busy. It runs in a VM on a single ext4 partition with 35 Go free space, 7 Go RAM with 65% free, two cores, almost idle right now.
The alert rules are scheduled with a daily interval but I suspect they are all triggered at the same time, so that could be 150 threads trying to access the DB at the same time.
Could that be the cause?
Does this mean I’m already out of scale and I should move to another DB?