31 января на реплике Gitlab-а было отставание. Администратор принял решения стереть реплирку и запустить кластер заново. В результате ошибки был удален мастер базы.

При попытки восстановления оказалось что бекап не делался и не проверялся на ошибки. Все 5 видов независимого бекакапирования не помогло. Повезло что был случайно сделанный LVM snapshot.

Пояснение Gitlab

1) LVM snapshots are by default only taken once every 24 hours.
2) Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored.
3) Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
4) The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
5) The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
6) Our backups to S3 apparently don’t work either: the bucket is empty
7) We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.