Roadmap and work estimates for Obnam version 1.0
Date: 2011-04-06
Latest update: 2011-12-03
I was asked what it would take to make Obnam good enough that I would be willing to rely on it as my sole form of backup. I might not ever want to do that: I may decide I'm too paranoid for a single form of backup to ever be satifactory, however much I might trust Obnam itself. However, it's an excellent thought experiment, if nothing else.
I will dub that version of Obnam "1.0". It's not a trust-inspiring version number, of course. This document explains the criteria I have for 1.0, and what needs to be done, and how much work I expect it to be.
I have several big worries about relying only on Obnam for backups:
- Obnam might lack essential features to be relevant solution for all my use cases.
- The repository format used by Obnam might change, requiring me to back up all my data again.
- Obnam might have a bug so that it fails to back up some data, or loses or corrupts backed up data, without being obvious, and I only discover this when I try to restore.
- Obnam might not be robust enough to handle the real world: it might not be able to deal with the filesystem being in turmoil while data is being backed up, or network connections breaking, or other such issues. This would mean Obnam needs a lot of hand-holding to work, and that would be valuable time away from other things I should be doing.
- Obnam might require excessive resources (CPU time, memory, disk space, network bandwidth, etc), making it impractical to use on real data.
- I might lose access to the program. As long as Obnam pretty much only lives on my own systems, they might plausibly all get corrupted or maliciously wiped, at the same time.
I discuss each of these below.
Essential features
At the moment, on 2011-04-06, the essential features Obnam is lacking are encryption, and proper support for concurrent clients.
I am working on encryption, and the basics work. Lacking features are proper support for multiple users of the same repository, with their own keys, and key/user management related to encryption. These features are not necessary for basic encryption, but they are necessary to implement so that the repository format does not change because of them.
Support for concurrent clients is mainly lacking in how locking is used, or sometimes not used. For 1.0, it would be good enough for me to have locking that is correct, but does not allow good concurrency, e.g., it might allow only one client to back up at the same time. However, there should be a way to add better concurrency later, without breaking the repository format or requiring all clients to be upgraded at the same time.
There is also no way currently to remove a client from the backup repository.
Obnam does not currently support ACLs or xattrs. I don't use them myself, but this would be an essential feature for others.
TODO:
- Finish encryption support (done in 0.16):
- "obnam list-keys"
- show which keys can access which toplevels
- report toplevels that user does not have access to
- "obnam add-key"
- add key to all shared toplevels
- add key to user's own per-client toplevel, if it exists
- optionally add key to other per-client toplevels as well
- alternatively, add key only to specified toplevels
- "obnam remove-key"
- remove key from all shared toplevels
- remove key from user's own per-client toplevel, if one exists
- optionally remove key from other per-client toplevels
- alternatively, remove key only from specified toplevels
- "obnam list-keys"
- Fix the concurrency issue (almost done, too much mutual exclusion now):
- design the right way to lock with multiple clients
- implement the design, or a subset, depending on the amount of work
- Implement client removal from repository. (Done in 0.16.)
- remove per-client toplevel
- remove all per-client data chunks
- remove client from clientlist
- remove client's keys from shared toplevels
Repository format
The repository format is, I belive, stable now, except for details related to locking. I don't have any plans to change the format otherwise at this time. After Obnam reaches 1.0, the format might need to change, to support new features. It is possible that this might require incompatible changes, requiring backing things up from scratch. I don't know if that can be avoided.
However, this problem can be greatly mitigated by allowing me to run version 1.0 for all normal backups, and future development versions against other backup repositories, and have both the stable and development version installable in parallel.
Given that, I don't think this risk would stop me from relying on Obnam as my sole backup system.
Bugs
I have tried to be careful in developing Obnam, and I have fairly extensive test suites. These do not prevent bugs from happening.
Regardless of who created a backup system, it needs to be tested. The best way to guard against bugs in the backup system is to verify that it works, before relying on it. So, I would need to do at least the following, before relying on Obnam:
- make a full backup of all files, and a Summain manifest of the files
- the filesystem shall be idle, so that nothing changes during the backup and Summain runs
- make incremental daily backups, for several days, after actively using
the system
- ditto manifest, and ditto idle
- restore every backup, and verify that all data is restored correctly
- use the manifests to verify
TODO:
- Design and implement a verification test. (done)
- Execute the verification test, make note of problems.
- Fix any problems found.
- Repeat until it goes flawlessly.
Robustness
It is not enough for a backup system to work when everything goes just right. Backup systems are meant for dealing with emergencies, which are by definition situations where very little goes right.
Even when there is not an emergency, a backup system should be robust enough to handle the kinds of problems the real world consistently has. For example, the filesystem will change during a backup, data gets corrupted, and network links break. A good backup system needs to handle all of these in some suitable way, without requiring constant attention from the user. Otherwise the user will get annoyed and switch to another system.
Obnam currently has trouble with files changing or going missing while it is backing them up. It does not deal well with some of the data in the backup repository going missing or being corrupt.
It also does not handle the network link going missing, but that is, I think, acceptable. That is an issue best dealt with running backups frequently. Obnam already has a checkpoint feature, where it will continue a backup from the latest checkpoint if a backup run crashes for any reason.
An important part of robustness is noticing problems. Obnam has the "verify" and "fsck" commands, but they could be improved. That's not all that important for 1.0, however.
TODO:
- Fix Obnam bug about files changing or going missing. (done now)
- develop a test framework to inject such errors systematically during test runs, perhaps by having a wrapper around the Obnam VFS layer: allow N operations, then remove a file
- deploy test framework
- fix any bugs found
- Fix Obnam bug about repository data being corrupt or missing. Restore
must be possible for the remaining data. Fsck must be able to notice
missing files, and report what is still recoverable. Any chunk file
must merely result in a hole in the restored file. Missing B-tree
nodes must still allow other (reachable) nodes to be used. (done now)
- develop test that removes chunks and checks that restores work otherwise
- develop test that removes B-tree nodes and verifies that files in the remaining nodes are restored correctly
- fix any problems found
Performance
For Obnam to be usable for me, I would need to be able to do a full backup of my laptop, to a brand new backup repository, overnight. In numbers: 250 gigabytes of data, in both large and small files, in 12 hours. Using encryption. Over SFTP to localhost.
I should further need to be able to run a daily incremental backup in less than an hour. Say, up to ten gigabytes of data.
TODO:
- Re-implement the Obnam benchmark setup to allow for encryption, network backups, and different profiles. (done)
- Run, profile, and optimize, until performance goal is met.
Losing the program
This risk is easily dealt with by uploading Obnam to Debian, so it gets mirrored all over the world. If the whole planet gets destroyed, I'll have other things to worry about than my backups.
TODO:
- Review and fix Debian packages of Obnam and its dependencies. Then upload them to Debian. (done)
Work estimate
EDIT: Hah, it's months later, and all my estimates are already blown. I'm removing them, since they're not even amusing anymore.