Open bugs in Obnam
See bug-reporting page for hints on what to include in a bug report, if you're unsure.
See also bugs that are done, and bugs.
See also:
- done
- non-wishlist
- bugs-1.0-blockers for bugs that need to be dealt with before 1.0
- Bugs in Debian
It'd be handy if "obnam restore" could exclude parts of the data, based on similar patterns as "obnam backup" does. --liw
An idea for speed improvement: store small chunks not in the per-client B-tree, but a separate per-client B-tree. I suspect they are currently making the metadata too sparse in the per-client B-tree. However, this needs to be measured. --liw
It looks as if changes to xattrs (but nothing else) will not trigger a backup in Obnam. They should. --liw
Suggested by RafaĆ Gwiazda: "obnam verify" should report all errors, instead of terminating after the first problem found.
--liw
It would be good to have obnam provide an option to enable/disable paramiko logging. It is often voluminous and might not be helpful when reading logs for debugging. --liw
<SLi> liw, I run obnam backup for a small while (probably not until even the first checkpoint) on a new repository, then ctrl-c and say obnam fsck, I get AssertionError on assert 'key_size' in ns_temp.get_metadata_keys().
Two things:
- using
assertis wrong in the code, it should raise an exception that results in a humane error message - crashing before the first checkpoint should not result in an error in the first place
--liw
I've fixed the assertion in larch (it's now a more user-friendly error message). The crash still occurs, but I'm not sure if it's reasonable for Obnam to try to correct that. The same error may happen due to other reasons, and automatically fixing things is too likely to break things. So I'll leave the bug open, but take it off the 1.0 blockers list.
--liw
See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=654211 for info.
See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=655094 for details.
See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=649769. --liw
The gpg agent problem is sorted, but there's a mysterious crash. Does not seem important enough at this point to be a 1.0 blocker. --liw
Obnam should, at least optionally, use fsync or other methods to ensure that everything gets committed to disk by the kernel by the end of a backup run. --liw
I want this to not have a huge performance impact, though. Learning
from the lessons of dpkg, sqlite/liferea/firefox, etc, and using fsync/fdatasync
and sync_file_range in the right ways is going to be necessary. --liw
For better performance, it would be good if Obnam could keep the shared chunks B-trees locked for shorter periods of time. This can be implemented by only updating them at the end of the backup generation (including checkpoints), rather than updating along the way when making a backup. --liw
It would be good to have an architecture diagram of the internals of obnam, to make life easier for new code contributors. --liw
Obnam should do some processing in the background, for example uploading
of data to the backup repository. This would allow better use of the
bottleneck resource (network). Below is a journal entry with my thoughts
on how to implement that. It may be out of date by now, but we'll see.
I have a Python module to simplify the use of multiprocessing to do
jobs in the background (which avoids the Python global interpreter lock,
in case that matters). --liw
Here's a design for Obnam concurrency that came to me the other day while walking.
The core of Obnam (and larch) is quite synchronous: read data from file, read B-tree nodes, push chunks and B-tree nodes into repository. Some of that can be parallelized, but not easily: it's already tricky code, and making it even more tricky is going to require very strong justification.
Things like encrypting and decrypting files need to be done in parallel with other things, for speed. These things are not really in the core, and indeed are provided by plugins.
So here's a way to them in parallel:
- the core code stays synchronous, the way it is now
- whenever larch code needs to read a B-tree node, it blocks until it gets it
- the node is read, synchronously, from wherever, and put into a background processing queue (using Python multiprocess)
- the code that waits for the node to be processed polls the queue, and handles any other background jobs that happen to finish while it waits, and returns the desired node when it gets it
- when larch writes a node (after it gets pushed out of the upload queue inside larch), it is put into a background processing queue
- at the same time, if there were any finished background jobs, they're handled (written to repo)
- at the end of the run, the main loop makes sure any pending background jobs finish and are handled
There's a complication that the B-tree code may need a node that is not yet written to the repository, since it is still going through a background processing queue.
I'm going to need to restructure how hooks process files that are written to or read from the repository. Writing should happen asynchronously: files are put in a queue and processed in the background, and then written to the actual repository when background processing is finished. Reading needs to happen synchronously, since there's a B-tree call waiting for them, but to handle the case of needing a node that is still being processed in the background, we need to keep track of what nodes are in the background, and wait for them to be done before reading them.
Reading would thus be something like this, implemented in the
Repository class:
while wanted file is in write queue:
process a write queue result
read file from repository
process file data through hooks
return file
The write queue is more complicated (again handled somehow in the
Repository class):
- a
multiprocessing.Queueinstance for holding pending jobs- a job is a (pathname, file contents) pair
- another
Queueinstance for holding unhandled results- (pathname, file contents) pair, where the contents may have changed
- a
setfor holding file identifiers (paths) that have been put into the pending jobs queue, but not yet processed from the results queue
Each plugin can provide one or more Unix commands (filters) through which the file contents gets piped. The background processes run each filter in turn, giving the output of the previous one as input to the next one.
To handle a result from a background job, the following needs to be done:
- remove the pathname from the
set - write the filtered file contents into the repository
To implement this, I'll do this:
- All changes should be in
HookedFS write_fileandoverwrite_fileput things into the pending jobs queue, and also call a new methodhandle_background_resultscatgets changed to wait for files in the write queue, callinghandle_background_resultshandle_background_resultswill do what is needed
This design isn't optimal, since writing things to the repository isn't being done in parallel with other things, but I'll tackle that problem later.
It would be nice if Obnam could tell you how much you've saved thanks to the de-duplication. --liw
obnam force-lock currently doesn't work. As a workaround, remove the lockfiles (all files named lock inside the repository) by hand.
find [repository path] -name lock -exec rm '{}' \;
--weinzwang
I confirm that I see this too. This bug exists because I changed how Obnam uses locks: it now locks each directory properly, instead of just the per-client directory. However, I haven't fixed "force-lock" to deal with other locks, so now it's not possible to force the locks for other directories than the per-client one. This is awkward.
To fix this, Obnam needs to know that it can safely remove the locks. There's two cases:
- the lock was created by some other client; in this case, the user (not Obnam automatically) needs to decide if it is safe to remove the lock: just running "obnam force-lock" should not do that, instead the user should provide an option like "--really-force-locks" or something
- the lock was created by the same client, i.e., Obnam running on the same host; in this case, if the Obnam process no longer exists, the lock can be safely removed, otherwise the locks should not be removed (again, unless "--really-force-locks" is used)
To implement this, we need Obnam to store the hostname and process id of the Obnam instance that created the lock, preferably in a way that does not leak sensitive information easily (don't store the client name in cleartext, but the md5sum of it, or something).
--liw
As of 0.27, force-locks unconditionally breaks locks, but the lock files will contain sufficient information to allow us to be more intelligent about the breaking of locks in the future.
--kinnison
--
This is not good enough -- I'd like obnam to be able to break locks more kindly -- but it's good enough for 1.0, I think, so removing the blocker tag. --liw
Obnam apparently depends on "utimensat", which only appears in more recent versions of glibc. Unfortunately, CentOS5's glibc does not have this routine, and we are not upgrading to CentOS6 just yet.
[hash@hydrogen]$ uname -a
Linux hydrogen.localdomain 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:45:44 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
Trying too import obnam produces this error:
>>> import obnamlib._obnam
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: /usr/local/lib/python2.6/site-packages/obnamlib/_obnam.so: undefined symbol: utimensat
>>>
-- hash
utimensat was added to the Linux kernel in version 2.6.22 (2007), and to glibc in version 2.6 (2007). It is specified in Posix 2008. It's needed to set sub-second timestamps on files on restore. As a workaround, you can change the _obnammodule.c to call utimes instead. If you can provide a clean patch, I'll be happy to add it to Obnam proper.
--liw
Normally, GPG can encrypt to a key using only the public key, without needing the private key available. I'd like to have the same model for obnam encrypted backups: perform a backup to a key using only the public key, without having the private key available. That way, I could perform backups of various systems using public keys whose corresponding private keys I keep entirely offline and secured.
Unfortunately, Obnam requires the secret in order to do backups. It needs to download, and decrypt, some metadata from the backup repository in order to add new backups there. See the ondisk and encryption pages for details. --liw
I've read those pages, but as far as I can tell, the ability to decrypt part of the repository doesn't seem like an inherent requirement to meet obnam's goals, just an implementation detail. And as far as I can tell, without some change in this area, obnam does not support letting a client perform backups without giving that client full access to previous backups.
Use case: I want to back up to a storage system that I trust for availability but not for privacy; thus, I need encryption, and I can't do "pull" backups since I don't trust the storage server with access to the system that needs backing up. However, I also want the backups to protect me from security breaches, by giving me the ability to follow the standard advice for how to deal with a system break-in: nuke everything and restore from the last-known good backup. How can I have a "last-known good backup" if the backed-up system can access the old backups?
I understand the desire to do what you want to do, but I haven't found a technical solution to do it. If you can come up with one, then write it up and send me a pointer. --liw
It'd be nice if Obnam could do this, but in the meantime, I think Duplicity can work this way. --AP
If Obnam had a local metadata cache, could this be implemented? If Obnam had the metadata locally, could it do a quick verification that the metadata is the same as that on the remote repository, and then backup to the remote repository "blindly", using only the local cache as a guide? --AP
A concrete existence proof: tarsnap does this. You can do tarsnap backups with only a restricted public key, and that key need not also provide access to decrypt old backups. -- Josh
I've re-opened this bug since there's ongoing discussion. I'll have a think and respond later. --liw
Could Obnam keep a local cache of remote repository metadata, like Duplicity does? I was looking through an Obnam log and noticed how many remote file accesses it does for each file it backs up. If it cached that data locally, it could do it much faster. The last-modified time of the repository could be stored in the remote repo, and if it matches the local cache, a lot of time could be saved.
That is indeed a good idea. Unfortunately, the correctness of caches is often tricky, so I've been putting off implementing this until more important things work first. Also, not caching metadata forces me to do other things to make Obnam fast. But I'd like to do the caching too, some day. --liw
Restore should be able to continue a restore if it gets interrupted. Restore should, perhaps optionally, skip files that already exist, and continue partial files. --liw
The "obnam clients" command should work even for non-clients (modulo encryption), so that you can get a list of the client names without having to guess one. --liw
Currently, there seems to be no easy way to forget all (or all but the newest) checkpoint generations. Something like
obnam --keep 1c forget
would be nice.
-- weinzwang
Would Obnam benefit from being able to use other compression tools, like bzip2, p7zip, and xz?
It would! However, implementing this is slightly tricky until I make Obnam can run external tools
in the background, rather than waiting for them to finish. The compression it uses right now is
done with the Python standard library gzip module, so it doesn't require executing an
external program, so the issue doesn't rise. However, we need the backgrounding feature
anyway, since encryption uses an external tool, and so there's a big performance impact
from using encryption. After backgrounding works, adding arbitrary compression filters will
be easy. --liw
obnam fsck's memory needs are so big I'm not able to check my repo:
[6402316.939346] Out of memory: Kill process 18036 (obnam) score 906 or sacrifice child [6402316.939685] Killed process 18036 (obnam) total-vm:2754908kB, anon-rss:1876312kB, file-rss:640kB
--weinzwang
It would be very useful to have a simple way to get a list of the generations a given file appears in, preferably with the relevant metadata.
A way to walk through the backup repository would also be useful - think smbclient-like functionality; or maybe as already mentioned a fuse filesystem.
When --one-file-system is used, it would be nice to not cross bind-mounts. No idea how to figure that out, but it must be possible. --liw
You could look at the inode numbers for . and ./foodir/.. and check they're the same? -- kinnison
Supporting multiple compression methods would be nice.
Also one idea is to allow using one compression method (or no compression) while actually doing the backup, and then have a separate tool, to be run on otherwise idle time, recompress the data using some other, slower algorithm to conserve disk space without making the backup process slower.
-- SLi
Obnam should, optionally, ask for a gpg passphrase, for the key specified with --encrypt-with, so that a user without a gpg agent will be able to do encrypted backups. Obnam should read the passphrase if its ask-passphrase setting is true, and it has access to a terminal. It should not have a setting for the passphrase itself, just for reading it from a terminal (just so that people who don't know better don't put their passphrase in a config file or similar).
Those running obnam from cron will need to have a passphraseless key, since there's no way to give obnam a passphrase in that case, without storing it in the crontab or a config file, and then it's no better than not having a passphrase.
See Debian bug #649769.
--liw
From my understanding, having a symmetric passphrase stored in a config file is not useless at all. My purpose in encrypting the backup data is to prevent the remote server from having my data in plain-view; or if I back it up to an external drive, I wouldn't want it to be accessible to anyone who picks it up. But if someone gains access to my config file, he'll have direct access to all of my data anyway--he wouldn't need to access my backups.
If I use a passphrase, then if my house burns down and I lose everything, I can get a new computer and download my data and decrypt it with my passphrase--which is long enough to be unfeasible to crack, yet completely memorized by me.
If I use a key, then if my house burns down and I don't have a working copy of my key outside my house, my backups are totally useless, and I really HAVE lost everything. (Sure, I should take precautions to keep from losing my key--but things happen.)
--Adam
It's possible to get obnam to request a passphrase when running from cron:
- Ensure 'use-agent' is enabled in ~/.gnupg/gpg.conf.
- Ensure the gpg-agent is running, and GPG_AGENT_INFO is set in your regular environment. Note that if obnam already asks for an enccryption passphrase when run normally, then 1 & 2 are already correctly set.
Ensure the environment obnam is called from in cron is exporting GPG_AGENT_INFO correctly. This means you must set and export the GPG_AGENT_INFO environment variable in your cron script. gpg writes this information to ~/.gnupg/gpg-agent-info-$(hostname), so in your cron script you must have:
source "~/.gnupg/gpg-agent-info-$(hostname)" && export GPG_AGENT_INFO
Then call obnam as normal.
This will only work on a desktop system where there is someone to notice that a pinentry window has popped up. However it looks like there may be a way to forward the gpg-agent socket over ssh, and thus run obnam with encryption from cron on a headless remote machine (See here). You'd probably have to store the private key on the remote machine though.. so not sure how useful that would be.
--Scott
From Chris:
python2.6 is available in epel, so it should be simple. I've hit the "In order to install package foo, you need package bar".
I'm sure this is a simple thing to do if I actually read the instructions, but for those of us who can't be bothered, a single tarball that just worked would be nice.
Ideally one that could install in my homespace for testing purposes.
From liw:
I think that might be workable. It's almost all pure Python, and can be run directly from the source directories, so a tarball with all the source projects included, plus a little script to set up PATH and PYTHONPATH would work.
obnam should accept SIZE values as well as TIME values for the --checkpoint option. For instance:
obnam backup $HOME --checkpoint=5min
This would be very handy for connections which vary a lot in speed and quality.
-- weinzwang
Too much code is excluded from test coverage. For example, as much code as possible in plugins should be unit tested (so coverage can be meaningfully tested). --liw
It would be good to have a FUSE filesystem for restoring data. --liw
Obnam should work on RHEL, Scientific Linux, and CentOS, and other popular distros such as Fedora, Ubuntu. Help porting these would be welcome. --liw
It would be nice for Obnam to have a tool to answer the question "how much space will be freed if I remove these generations?"
- need to find list of chunks that are used only by the specified gens
- perhaps also count B-tree reduction? cound nodes that are unshared by the relevant trees, or only shared by the trees to be deleted
- however, the B-trees are going to be a fraction (a few percent) of the size of the chunk data, so they're not really worth it
--liw
This is an idea for optimizing Obnam.
Store MD5 of string containing names + relevant metadata of all files in a directory, then re-compute that when backing up a directory: if checksums are same, then no file in the directory has changed, and there's no need to check each of them separately, saving many tree lookups
Consider only non-dirs, since subsubdir can change without it being visible at grandparent level. Thus, recursing is always necessary.
--liw
Obnam should support multiple repositories, to be chosen at invocation time.
- all repositories configured in config files
- nicknames for repositories so it's easy to choose
- --repository should accept nicknames
- choose many repositories for one run
- use all available repositories by default
--liw
Could this also be achieved by running Obnam from a wrapper script that uses a different repository for each run? Could Obnam be run in parallel instances backing up the same data to different repos? Is that possible now? --Adam
Adam, it can certainly be done by using wrapper scripts (I've been doing that), and while I haven't actually tried it, there should be no problem with backing up to multiple repositories concurrently, though you may need to fiddle with the configs so that they use different log files. --liw
After some thinking, I think I don't want nicknames for repositories, I want "profiles".Here's a concrete suggestion:
[config]
encrypt-with = CAFEF00D
profile = all
log = /var/log/obnam/obnam-%(profile)s.log
[profile "online"]
repository = sftp://liw@personal.backup.server.example.com/~/repo/
use-if = ping -c1 personal.backup.server.example.com
[profile "usb-drive"]
repository = /media/Drum/obnam-repo/
use-if = test -d /media/Drum/obnam-repo/
[profile "at-work"]
repository = /mnt/backups/
use-if = ping -c1 fs.work.example.com
pre-command = sudo mount /mnt/backups
post-command = sudo umount /mnt/backups
- if --profile=all, then iterate automatically over all profiles
- otherwise, use only the chosen profiles
- some day: run some/all profiles in parallel in one obnam instance; initially, user may run parallel obnam instances
- log file should embed profile name somehow
- should profile be selected based on user too? that can be done with "use-if = test $USER = liw"; better support can be added later, if there's a need
--liw
Obnam is currently using paramiko as the SFTP implementation. It is a bit more limited than the SFTP protocol is, and so some stuff that Obnam should be doing, such as restoring hardlinks across SFTP, are not possible. There may also be some bugs with regards to timestamp handling.
Possible fixes:
- patch paramiko to support more of SFTP
- switch to twisted's conch or libssh or http://pypi.python.org/pypi/ssh/ or python-ssh2
--liw
Would it be beneficial to choose the size of data chunks based on the type of data, or its size? Text data might benefit from small chunks, whereas video data might benefit from large chunks, for example. Possibly the size of a file would be a sufficient indicator.
The goal would be to reduce the number of chunks, without reducing the effectiveness of de-duplication.
This needs to be measurements of real data to see if there are interesting parameters.
--liw
Moving forward in small increments than individual chunks would allow more rsync-like behavior, and more chance of finding duplicate data. This might be worthwhile, for some users, some of the time. It should be configurable, though, since it's also potentially going to be a big performance problem. --liw
If a file is sparse, and has a large hole, it would be good to skip over
it with SEEK_HOLE and SEEK_DATA. --liw
fsck has all the necessary information it needs to reconstruct the chunksums and chunklist B-trees in a repository. It should do so, at least when requested.
--liw
Mysterious logging errors with sftp access to live data.
sftp-root and sftp-repo run on xander: mysterious logging errors
Traceback (most recent call last):
File "/usr/lib/python2.6/logging/__init__.py", line 776, in emit
msg = self.format(record)
File "/usr/lib/python2.6/logging/__init__.py", line 654, in format
return fmt.format(record)
File "/usr/lib/python2.6/logging/__init__.py", line 436, in format
record.message = record.getMessage()
File "/usr/lib/python2.6/logging/__init__.py", line 306, in getMessage
msg = msg % self.args
TypeError: not enough arguments for format string
- happened with repo on localfs, live data on sftp (sftp2)
- running with both on localfs (sftp3): worked fine
- so there's something wrong, possibly in paramiko
- tried a script to transfer 140 gigs of data with paramiko, but this failed, the transfer speed is very slow (less than 1 MiB/s)
--liw
Removing 1.0 blocker tag: access to live data over sftp is limited enough that it's not something I want to guarantee for 1.0. Fixing the limitations may require patching paramiko or replacing it with another sftp implementation. --liw
Obnam should support non-linux file types:
- http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html
- http://www.gnu.org/s/hello/manual/libc/Testing-File-Type.html
--liw
How does Obnam perform if the live data has massive numbers of hardlinks? Such hardlink trees are not rare, but it would be good to at least know how Obnam handles them. Perform a benchmark comparing three cases:
- a very large number (one million?) of small files, all unique
- same, but all files identical
- same, but all files hardlinks to the same content
Ideally, Obnam should perform about the same for all. --liw
Joey asked:
have you done anything in obnam to deal with it needing to keep the symmetric key, decrypted, in RAM?
yeah, it's tough. probably could be avoided by having gpg decrypt the passphrase and pipe it to the encrypting gpg .. but then gpg would constantly be using the public key
It might be possible to have a C extension that holds the symmetric key, locks it into RAM, and feeds it to gpg whenever necessary, via a file descriptor.
--liw
From Joey Hess:
My take on this is that, by choosing to use a tool that uses hashes, I am giving up (near-)absolute certainty for speed, or space, or whatever. So it's important that the hash type be good at collision resistance (for example, no two likely filenames should hash the same; "/etc/passwd" should only tend to collide with blobs that are very unlike a filename). It's also important that the tool be upfront about using hashes, and about what hash it uses. And if it's not designed to allow swapping the hash out when it gets broken, I will trust it less (hello git).
Ah, the replacement of hash functions is an interesting problem.
For pathnames, it's not at all important, I think, except perhaps for performance, since pathnames will be compared byte-by-byte instead of by hashes.
For file data, replacing is easy, if one is willing to back up everything from scratch. Supporting several hashes in the same backup store is a little bit more work, but not a whole lot: instead of having just one tree for mapping checksums to chunk identifiers, one would have one per checksum algorithm.
--liw