Lucene search

K
veeamVeeam softwareVEEAM:KB4613
HistoryJun 12, 2024 - 12:00 a.m.

Backup Failing With `Too many snapshots` When Using Longhorn as a Storage Provisioner

2024-06-1200:00:00
Veeam software
www.veeam.com
4
backup
failing
longhorn
storage
provisioner
veeam
support
knowledge base
software

AI Score

6.9

Confidence

High

Challenge

Veeam Kasten for Kubernetes backup action for longhorn volumes fails with the error message:

too many snapshots created

Cause

When integrating with CSI-based volumes, Veeam Kasten for Kubernetes employs VolumeSnapshot resources to create snapshots during backup operations.

With Longhorn, upon the creation of a VolumeSnapshot and its corresponding VolumeSnapshotContent resource by the snapshot-controller, Longhorn generates a snapshots.longhorn.io resource and synchronizes it to produce a Longhorn backend snapshot. As part of its retention policy, Veeam Kasten for Kubernetes deletes the VolumeSnapshotContent resource to remove the snapshot. However, Longhorn does not automatically delete the snapshots.longhorn.io resource it created; the snapshot is merely flagged as removed but not purged from the system.

Over time, this can lead to an accumulation of snapshots for a volume, especially if backups are frequent. Eventually, this may cause the backup process to fail when the number of snapshots reaches Longhorn’s maximum limit of 254 per volume.

Below is an example of the snapshot count for an application that was set to retain 8 snapshots in Veeam Kasten for Kubernetes.

_#PVC in one sample namespace _
**❯ kubectl get pvc -n postgresql** 
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE 
data-postgres-postgresql-0   Bound    pvc-fafda05d-314e-420f-bf37-d7365b31ea1c   8Gi        RWO            longhorn       24h 
  
_#count of VolumeSnapshot resource _
**❯ kubectl get volumesnapshot -n postgresql --no-headers|wc -l**
8 
  
_#Count of Longhorn snapshot CRs _
**❯ kubectl get snapshots.longhorn.io -n longhorn-system |grep pvc-fafda05d-314e-420f-bf37-d7365b31ea1c |wc -l**
85 

Below is the screenshot from Longhorn UI showing hidden snapshots that are marked as removed but not purged.

not purged

Solution

Currently, Longhorn does not automatically purge the removed snapshots when the volumesnapshot/volumesnapshotcontent resources are deleted from the k8s cluster.

Starting in Longhurb version 1.4.1, a new type of recurring job was introduced: snapshot-cleanup. This job type will purge removed snapshots and system snapshots.

Issue Prevention

Within Longhorn, configure a recurring job for the snapshot-cleanup task type.

From Longhorn UI

Select a Group if the default group needs to be added (Having default in groups will automatically schedule this recurring job to any volume with no recurring job).

cleanup

Use the below kubectl command to create the recurringJob resource from the CLI.

cat << EOF | kubectl create -f -   
apiVersion: longhorn.io/v1beta2   
kind: RecurringJob   
metadata:   
  name: snapshot-cleanup   
  namespace: longhorn-system   
spec:   
  concurrency: 1   
  cron: 0 * * * *   
  groups:   
  - default   
  labels: {}   
  name: snapshot-cleanup   
  retain: 0   
  task: snapshot-cleanup 

Copy

More Information

The recurring Job creates a K8s cronjob resource, which in turn runs a snapshot-cleanup pod as per the cron expression specified during the job creation.

Below is the log from the snapshot-cleanup pod that ran after the creation of the recurring job.

❯ kubectl logs snapshot-cleanup-28069140-c8cm5 -n longhorn-system 
 
time="2023-05-15T11:00:00Z" level=debug msg="Setting allow-recurring-job-while-volume-detached is false" 
time="2023-05-15T11:00:00Z" level=debug msg="Get volumes from label recurring-job.longhorn.io/snapshot-cleanup=enabled" 
time="2023-05-15T11:00:00Z" level=debug msg="Get volumes from label recurring-job-group.longhorn.io/default=enabled" 
time="2023-05-15T11:00:00Z" level=info msg="Found 1 volumes with recurring job snapshot-cleanup" 
time="2023-05-15T11:00:00Z" level=info msg="Creating job" concurrent=1 groups=default job=snapshot-cleanup labels="{\"RecurringJob\":\"snapshot-cleanup\"}" retain=0 task=snapshot-cleanup volume=pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5 
time="2023-05-15T11:00:01Z" level=info msg="job starts running" labels="map[RecurringJob:snapshot-cleanup]" namespace=longhorn-system retain=0 snapshotName=snapshot-90135f33-93ce-4de4-829b-4dd01db2d827 task=snapshot-cleanup volumeName=pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5 
time="2023-05-15T11:00:01Z" level=info msg="Running recurring snapshot for volume pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5" labels="map[RecurringJob:snapshot-cleanup]" namespace=longhorn-system retain=0 snapshotName=snapshot-90135f33-93ce-4de4-829b-4dd01db2d827 task=snapshot-cleanup volumeName=pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5 
time="2023-05-15T11:00:01Z" level=debug msg="Purged snapshots" labels="map[RecurringJob:snapshot-cleanup]" namespace=longhorn-system retain=0 snapshotName=snapshot-90135f33-93ce-4de4-829b-4dd01db2d827 task=snapshot-cleanup volume=pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5 volumeName=pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5 
time="2023-05-15T11:00:01Z" level=info msg="Finished recurring snapshot" labels="map[RecurringJob:snapshot-cleanup]" namespace=longhorn-system retain=0 snapshotName=snapshot-90135f33-93ce-4de4-829b-4dd01db2d827 task=snapshot-cleanup volumeName=pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5 
time="2023-05-15T11:00:01Z" level=info msg="Created job" concurrent=1 groups=default job=snapshot-cleanup labels="{\"RecurringJob\":\"snapshot-cleanup\"}" retain=0 task=snapshot-cleanup volume=pvc-84d3d7d0-3abc-427c-a959-5ccc7da912a5 

Please refer to Longhorn documentation below to read more about recurring jobs.

To submit feedback regarding this article, please click this link: Send Article Feedback
To report a typo on this page, highlight the typo with your mouse and press CTRL + Enter.

AI Score

6.9

Confidence

High