Open bugs in Obnam

If you have a problem with Obnam, please send mail to the mailing list. Don't add one to this list. See the contact page for information about the list. This wiki page is meant to help developers keep track of confirmed bugs, and not as a support channel. (This has changed on 2012-07-08.)

See bug-reporting page for hints on what to include in a bug report, if you're unsure.

See also bugs that are done, and bugs.

See also:

When "obnam verify" shows progress, it's based on number of files. It should be based on number of bytes instead, for better progress reporting. --liw

Posted Wed May 1 07:26:02 2013 Tags:

S.B. suggests that backup generations have an optional description.

  1. Named generations -- There are certain generations that are more important than others. Some are automatically created by Obnam itself, some are routinely scheduled, and some were explicitly created. For example, I always run Obam immediately before traveling with my laptop in case it gets stolen or broken. The same goes for backups before major system upgrades. It would be nice to have something approximately analogous to the Windows "restore point" functionality, which has a description field. Sometimes they are only automatically created system checkpoints. But if the user explicitly creates a new restore point, he can add the description "before traveling to Europe" or "before upgrading OS" or whatever. Similarly, the automatic backup script could be programmed to label it as "cron backup".
Posted Sun Mar 24 17:16:31 2013 Tags:

S.B. suggests that generations could be tagged so they aren't automatically deleted.

  1. Unforgettable generations -- In scenarios similar to the above, I would also find it useful to be able to mark certain important generations as "unforgettable". That way, when I run an automatic time based forget command, I can be sure that it will preserve certain milestone generations, even if they weren't the last generation of the month or the week or the day or whatever.
Posted Sun Mar 24 17:16:31 2013 Tags:

From Lars Kruse:

Thus the situation seems to be that obnam stores all relevant chunks, but due to the symlink pointing to something outside of the repository I cannot restore

From liw: This is obviously a bug. Obnam should either give an error if the backup root is a symlink, or back it up as a symlink, not recurse into the directory and then backup that and only then ruin everything by backing up the symlink as a symlink.

Posted Sat Feb 16 15:45:29 2013

Ben Kelly reported on August 31, 2012, that he's seeing crashes due to file descriptor leaks. See list mail archive for logs and suggested patches. I have not been able to reproduce this, however. --liw

Posted Fri Feb 8 20:16:16 2013

A small performance improvement on startup may be possible by replacing len(key('',0,0)) with struct.calcsize(self.fmt):

python -m timeit -s \
  'import struct; checksum_length = 32; fmt = "!%dsQQ" % checksum_length'\
  'struct.calcsize(fmt)'
1000000 loops, best of 3: 0.201 usec per loop

vs

python -m timeit -s \
  'import struct; checksum_length = 32; fmt = "!%dsQQ" % checksum_length'\
  'len(struct.pack(fmt,"",0,0))'
1000000 loops, best of 3: 0.753 usec per loop

I can't attach files, so I can't attach the bundle I think you prefer, but here's the diff, and I'll email you the bundle as an attachment.

=== modified file 'obnamlib/checksumtree.py'
--- obnamlib/checksumtree.py        2011-07-26 13:23:55 +0000
+++ obnamlib/checksumtree.py        2013-02-08 07:09:22 +0000
@@ -33,7 +33,7 @@
                  upload_queue_size, lru_size, hooks):
         tracing.trace('new ChecksumTree name=%s' % name)
         self.fmt = '!%dsQQ' % checksum_length
-        key_bytes = len(self.key('', 0, 0))
+        key_bytes = struct.calcsize(self.fmt)
         obnamlib.RepositoryTree.__init__(self, fs, name, key_bytes, node_size, 
                                          upload_queue_size, lru_size, hooks)
         self.keep_just_one_tree = True

=== modified file 'obnamlib/chunklist.py'
--- obnamlib/chunklist.py   2011-07-26 13:23:55 +0000
+++ obnamlib/chunklist.py   2013-02-08 07:09:22 +0000
@@ -35,14 +35,15 @@

     def __init__(self, fs, node_size, upload_queue_size, lru_size, hooks):
         tracing.trace('new ChunkList')
-        self.key_bytes = len(self.key(0))
+        self.fmt = '!Q'
+        self.key_bytes = struct.calcsize(self.fmt)
         obnamlib.RepositoryTree.__init__(self, fs, 'chunklist', self.key_bytes, 
                                          node_size, upload_queue_size, 
                                          lru_size, hooks)
         self.keep_just_one_tree = True

     def key(self, chunk_id):
-        return struct.pack('!Q', chunk_id)
+        return struct.pack(self.fmt, chunk_id)

     def add(self, chunk_id, checksum):
         tracing.trace('chunk_id=%s', chunk_id)

=== modified file 'obnamlib/clientlist.py'
--- obnamlib/clientlist.py  2011-07-26 13:23:55 +0000
+++ obnamlib/clientlist.py  2013-02-08 07:09:22 +0000
@@ -49,7 +49,7 @@
         tracing.trace('new ClientList')
         self.hash_len = len(self.hashfunc(''))
         self.fmt = '!%dsQB' % self.hash_len
-        self.key_bytes = len(self.key('', 0, 0))
+        self.key_bytes = struct.calcsize(self.fmt)
         self.minkey = self.hashkey('\x00' * self.hash_len, 0, 0)
         self.maxkey = self.hashkey('\xff' * self.hash_len, obnamlib.MAX_ID, 
                                    self.SUBKEY_MAX)
Posted Fri Feb 8 07:20:19 2013

cliapp now allows overriding parts of the logging setup, and obnam should use that as soon as cliapp has been released.

Posted Sun Jan 20 20:48:12 2013

If you accidentally backup some large or sensitive files, but don't want to delete all the generations they're in, it would be handy for Obnam to be able to delete just the specific files from the generations, and leave the rest.

Posted Sat Dec 29 15:02:42 2012 Tags:

Obnam does not currently seem to notice when the sftp connection breaks. It should, and it should then abort the backup. --liw

Posted Tue Dec 4 13:27:25 2012 Tags:

Obnam should, arguably, use ctime changes to trigger backups, so that if a file's size and mtime are the same, because whatever fool program modified a file reset the mtime, obnam will still backup the changed data.

I have code for this, but it requires a repository format change, and breaks the upgrade from format 5 to 6. --liw

Posted Sun Nov 25 17:59:51 2012 Tags:

Problem: If chunk size is reasonably large (say, a megabyte), then most files will be smaller, and the repository ends up with a large number of identical files.

Idea: collect chunks into groups, called "salsa tins".

  • salsa tin = list of chunks
  • salsa tin has an id
  • chunk id = salsa tin id + suitable number of extra bits for index into list
  • chunk id may be 64 bits total, or 64+32, or whatever seems convenient
  • no chunk gets stored alone, only in salsa tins

This lets a client put things into the repository at will, without synchronisation or locking beyond what the filesystem provides (exclusive creation of files).


Having multiple chunks in a single file complicates the logic for managing files in the repository, and deleting unused chunks.

Therefore, an alternative idea: instead of shoving multiple chunks into one file, allow files to use parts of chunks. Currently a file's metadata lists the chunks that have its contents. Change this to be a list of (chunk id, offset, length) triplets, where offset and length specify a part of a chunk. This way, a client can create one chunk that contains the data of many small files, and they can all just use the relevant part of the chunk. Managing removal of those files is easy: it is the current code without modification.

--liw

Posted Fri Nov 23 16:13:10 2012 Tags:

Obnam needs a way to remove clients from the repository. The current remove-client command just deals with encryption.

Suggested-by: Daniel Silverstone

Posted Wed Nov 7 19:43:46 2012 Tags:

Obnam needs a way to rename clients in the client list.

Suggested-by: Daniel Silverstone

Posted Wed Nov 7 19:43:46 2012 Tags:

Would it be faster to use the sftp put and get methods instead of the current open/read/write/close code for transferring files to and from the repository?

Posted Wed Nov 7 19:43:46 2012 Tags:

When larch is processing a journal (committing or deleting at startup, committing at end), obnam should be showing useful progress reporting for that.

Posted Sat Nov 3 13:19:21 2012 Tags:

I've done a fair bit of work to optimise away a whole bunch of round trips over sftp. For example, renaming a file over sftp can fail, if the target name already exists. So we try to remove it first. The pretty way of doing that would be to check if the target exists before trying to remove it. However, that would mean an extra round trip, when the target does exist.

Pretty:

1. does target exist?
2. if it exists, remove
3. rename

Ugly:

1. try to remove target, and ignore a failure due to target not
   existing
2. rename

Unfortunately, doing things the ugly way results in the log file containing a lot of tracebacks from paramiko. They're benign, but they look scary, and they fill the log with useless text.

Reported-By: Jordi Marqués

Posted Sun Oct 21 21:03:50 2012 Tags:

It would be practical if "obnam restore" could skip files based on various criteria, at least pathname.

Posted Sun Oct 21 19:46:47 2012 Tags:

Review obnam black box tests and document how you're supposed to write them.

  • hacking section in README
Posted Sat Oct 20 19:28:35 2012 Tags:
  • Review 01:53:50> http://doriath.cz/obnam/rollingsum-split-2.patch ...faster, but I don't like it much. :-)
    • http://doriath.cz/obnam/rollingsum-split-2.patch ...but maybe a different rolling checksum function (and/or a different window size) would be better, it needs more testing.
Posted Sat Oct 20 19:28:35 2012 Tags:

obnam backup should report actual transferred bytes in addition to live data bytes, for reporting of overhead, and should calculate speed with overhead too

Posted Sat Oct 20 19:28:35 2012 Tags:

ugly error message from obnam:

00h28m12s 42015 files; 108.21 MiB (224.7 KiB/s)
/home/liw/.mozilla/firefox/td0gERROR: Cannot back up
/home/liw/.mozilla/firefox/td0gy7bo.default/urlclassifier3.sqlite-journal:
(2, 'No such file or directory',
'/home/liw/.mozilla/firefox/td0gy7bo.default/urlclassifier3.sqlite-journal')
  • should start error message on new line
  • should not print Python tuple

--liw

--

I believe the tuple printing is now fixed. --liw

Posted Sat Oct 20 19:28:35 2012 Tags:

Document how to check that Obnam is actually encrypting anything. --liw

Posted Sat Oct 20 19:28:35 2012 Tags:

Would be nice to have a script to gather information for reporting a bug about obnam?

  • Python version
  • versions of obnam, larch, cliapp, and any other of my projects obnam uses
  • config dump
  • is _obnam available?
  • mount point, file system types (/proc/mounts on linux)
  • linux kernel version, or other uname -a output
Posted Sat Oct 20 19:28:35 2012 Tags:

Make obnam fsck remove extraneous files (e.g., tmp*). --liw

Posted Sat Oct 20 19:28:35 2012 Tags:
<zeri> liw: I tried out obnam yesterday and noticed you guys
    simply call gpg -c --batch for the symmetric encryption part
    <zeri> liw: also if I am not misstaken you sort of have a master
    key for the repository (64 bit hex number encrypted for all keys
    in the repository that is used as passphrase)
<zeri> liw: since you do not sepcify the
    s2k-algo,s2k-mode,s2k-count configuration options of gpg as
    well as the compression-algo option a gpg.conf that is usually
    considered good will slow down obnam to a couple of bytes/sec
    <zeri> liw: since keyderivation is of no use if a sufficiently
    big random secret is used you might want to consider specifying
    --s2k-mode 1 to disable most of the keystrengthening in gpg and
    simply hash the password once with a salt ... speeding up the
    encryption of every block at least by one order of magnitude
    <zeri> (default behaviour is the compute a hash chain of at least
    1024 length up to 65011712 which was in my gpg.conf)
<zeri> also I didn't check whether you do compression in obnam
    but gpg can do that for you as well but it was turned off in my
    gpg.conf (I used gpg primarily for large tar balls where the one
    time overhead doesn't matter)
<zeri> liw: then again I might be totally wrong and overlooked
    some switch to do all that without chaning the code :)
<zeri> oh and I didn't emphasise this yet ... this hash chain
    i talked about before is computed for every "chunk" which were
    between a couple of bytes and 16k in my test ... the effort to
    compute this hashchain (to optain encryption/authentication keys)
    exceeds the effort to ecrypt 16k with any blockcipher by far
    I suppose
<zeri> and it's sole purpose is to prevent weak passwords for
    being guesed in short time (since the computational effort to
    test a password is equivalent to computing this hash chain thus
    slowing bruteforce down by the factor of the length of the chain)
<zeri> "64 bit hex number encrypted" << that should have been 64
   digits :)
Posted Sat Oct 20 19:28:35 2012 Tags:

The following is ugly:

  File "/usr/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 667, in _read_response
    raise SSHException('Server connection dropped: %s' % (str(e),))
SSHException: Server connection dropped: 
Exception OSError: OSError(32, 'Broken pipe') in <bound method SFTPFile.__del__ of <paramiko.SFTPFile object at 0x18861cd0>> ignored

There should not be a traceback, but a proper error message.

Posted Sat Oct 20 19:28:35 2012 Tags:

Write Obnam plugin to exclude files based on size.

Posted Sat Oct 20 19:28:35 2012 Tags:

Exclude based on mime type of file.

Posted Sat Oct 20 19:28:35 2012 Tags:

Add an Obnam cmdtest to verify v6 repos can be restored from.

Posted Sat Oct 20 19:28:35 2012

Should obnam split VFS for repo access and live data access? repo needs so much less, after all. This would perhaps also make it easier to add support for things such as S3. -liw

Posted Sat Oct 20 19:28:35 2012 Tags:

make "obnam backup --pretend" print a list of would-be-backed-up files to stdout instead of tty

Posted Sat Oct 20 19:28:35 2012 Tags:

obnam should allow specifying the ssh key for sftp to use

Posted Sat Oct 20 19:28:35 2012 Tags:

Instead of in-place conversions, which are error prone and clunky, a better way would be nice. Maybe some kind of dump/undump pair, using a streamable format?

Posted Sat Oct 20 19:28:35 2012 Tags:

The progress reporting for "obnam forget" seems to be broken. When forgetting more than two generations, the progress display is stuck at

forgetting generations: 2/64 done

until the very end. It's updated to 64/64 right before finishing.

-- weinzwang

Posted Thu Sep 6 09:10:11 2012 Tags:

From mail:

I think there is problem in keep policy, see attached file, --keep=Nh always do same as --keep=Nd

Seems to be true. Need cmdtest test case for reproduction.

Posted Mon Aug 13 21:21:49 2012

I decided to try running fsck on a remote repo from a user account that doesn't have the GPG key used to encrypt with. I got this error, which isn't helpful. I think, at least, it should mention the possibility that there is a missing key.

ERROR: Unknown filter tag encountered: '\x8c\r\x04\x03\x03\x02n\x7fX8\xa5\xd7\xf1\xbd`\xc9[\xba\xc0\x11kU\xd0\x8f\xf3\r\x97J@\x0c\xc9\x11\xaan\xefxx\x89\xe9\x9b\xac\x96\xce \x1cD{:\x02\x10I\xda\xbb;\x0e\xac\xd3d9\x18(\x0c\xe4\x8a\xf89\xed\x9a!\x85\xd0t\xd0\xaa^0\xee\x17{\xe3\x9d\xa3\x89O\xc6\x04$c\x96\xee\xf3\xbb{.|\xd5\xb0\x96Z\xb3\x11\x0f&B\xf0'

Someone else also reported this:

Minor improvement: using Obnam 1.0 on Debian Unstable, the error message when using an encrypted repository without specifying any key is not helpful. Please consider using a human-readable error message to warn the user that the key is missing (or wrong).

Here is an example of error message when using "obnam fsck" without the key:

Traceback (most recent call last):
[lots of debug info]
File "/usr/lib/python2.7/dist-packages/obnamlib/hooks.py", line 122, in run_filter_read
  tag, content = data.split("\0", 1)
ValueError: need more than 1 value to unpack
Posted Thu Jun 21 06:37:49 2012 Tags:

Obnam version 1.0

Host: Gentoo amd64

When excluding a file that already exists on the backup, the file is not removed from the last backup, but it is kept, without being updated.

The file excluded should not appear on generations created with the exclude filter.

How to reproduce:

#############################################
#!/bin/sh

rm -rf /tmp/test
mkdir -p /tmp/test/backup
touch /tmp/test/backup/file1
touch /tmp/test/backup/file2

# 1st run, exclude file1
# only file2 is present on backup
obnam -r /tmp/test/repo --exclude=/tmp/test/backup/file1 backup /tmp/test/backup
obnam -r /tmp/test/repo ls

# 2nd run
# none excluded, both files present on backup
obnam -r /tmp/test/repo backup /tmp/test/backup
obnam -r /tmp/test/repo ls

# 3rd run, modify files and backup with exclude=file1
# both files present, file2 is updated on backup, file1 is neither updated, or removed
echo "000" > /tmp/test/backup/file1
echo "000" > /tmp/test/backup/file2
obnam -r /tmp/test/repo --exclude=/tmp/test/backup/file1 backup /tmp/test/backup
obnam -r /tmp/test/repo ls

cd /
rm -rf /tmp/test/backup
obnam -r /tmp/test/repo restore /tmp/test/backup
ls -lha /tmp/test/backup
#############################################

OUTPUT:

+ rm -rf /tmp/test
+ mkdir -p /tmp/test/backup
+ touch /tmp/test/backup/file1
+ touch /tmp/test/backup/file2

# 1st run, exclude file1
# only file2 is present on backup
+ obnam -r /tmp/test/repo --exclude=/tmp/test/backup/file1 backup /tmp/test/backup
Backed up 2 files, uploaded 0.0 B in 0s at 0.0 B/s average speed
+ obnam -r /tmp/test/repo ls
Generation 2 (2012-06-20 09:23:41 - 2012-06-20 09:23:41)
drwxr-xr-x    24 root     root           4096 2012-06-20 05:27:42 /
drwxrwxrwx    19 root     root            480 2012-06-20 07:23:40 /tmp
drwxrwx---     4 jordi    jordi            80 2012-06-20 07:23:41 /tmp/test
drwxrwx---     2 jordi    jordi            80 2012-06-20 07:23:40 /tmp/test/backup
-rw-rw----     1 jordi    jordi             0 2012-06-20 07:23:40 /tmp/test/backup/file2

# 2nd run
# none excluded, both files present on backup
+ obnam -r /tmp/test/repo backup /tmp/test/backup
Backed up 3 files, uploaded 0.0 B in 0s at 0.0 B/s average speed
+ obnam -r /tmp/test/repo ls
Generation 5 (2012-06-20 09:23:41 - 2012-06-20 09:23:41)
drwxr-xr-x    24 root     root           4096 2012-06-20 05:27:42 /
drwxrwxrwx    19 root     root            480 2012-06-20 07:23:40 /tmp
drwxrwx---     4 jordi    jordi            80 2012-06-20 07:23:41 /tmp/test
drwxrwx---     2 jordi    jordi            80 2012-06-20 07:23:40 /tmp/test/backup
-rw-rw----     1 jordi    jordi             0 2012-06-20 07:23:40 /tmp/test/backup/file1
-rw-rw----     1 jordi    jordi             0 2012-06-20 07:23:40 /tmp/test/backup/file2

# 3rd run, modify files and backup with exclude=file1
# both files present, file2 is updated on backup, file1 is neither updated, or removed
+ obnam -r /tmp/test/repo --exclude=/tmp/test/backup/file1 backup /tmp/test/backup
Backed up 2 files, uploaded 4.0 B in 0s at 13.8 B/s average speed
+ obnam -r /tmp/test/repo ls
Generation 8 (2012-06-20 09:23:42 - 2012-06-20 09:23:42)
drwxr-xr-x    24 root     root           4096 2012-06-20 05:27:42 /
drwxrwxrwx    19 root     root            480 2012-06-20 07:23:40 /tmp
drwxrwx---     4 jordi    jordi            80 2012-06-20 07:23:41 /tmp/test
drwxrwx---     2 jordi    jordi            80 2012-06-20 07:23:40 /tmp/test/backup
-rw-rw----     1 jordi    jordi             0 2012-06-20 07:23:40 /tmp/test/backup/file1
-rw-rw----     1 jordi    jordi             4 2012-06-20 07:23:41 /tmp/test/backup/file2

# when restoring, they appear, the new file2, and the old file1.
+ rm -rf /tmp/test/backup
+ cd /
+ obnam -r /tmp/test/repo restore /tmp/test/backup
--h--m--s 4 files 4 B (100 %) 1.4 KiB/s /tmp/test/backup
+ ls -lha /tmp/test/backup
total 4,0K
drwxrwx--- 2 jordi jordi 80 jun 20 09:23 .
drwxrwx--- 4 jordi jordi 80 jun 20 09:23 ..
-rw-rw---- 1 jordi jordi  0 jun 20 09:23 file1
-rw-rw---- 1 jordi jordi  4 jun 20 09:23 file2
Posted Wed Jun 20 10:34:10 2012

The Idea is to change or extend the --exclude-caches feature so that one can configure which filename to look for that will make obnam skip the directory


Changing --exclude-caches seems wrong to me: it has a specific purpose (to implement the cache directory tagging spec, http://www.bford.info/cachedir/spec.html).

Adding a new option to ignore directories that contain a specific file (or directory) would be fine.

--liw

Posted Mon Jun 18 20:26:06 2012 Tags:

It would be good for Obnam to do the whole-file checksum with a different checksum algorithm, or by using a suitable salt, to catch problems with single-chunk files, e.g., when there is a hash collision. --liw

Posted Fri Jun 15 23:07:02 2012 Tags:

First of all, I realise that Obnam stores full paths because it is necessary for saving every file in the system, even when belonging to different users.

However, for certain cases where just a backup of a directory is needed, this could be flexibilized, letting the backup store only the path that is given in the command line, following rsync's spirit.

When does this show up? For example, when migrating from any other backup system, the easier way would be to dump all the generations from the older backup system, one by one, to a temporal place. For each generation, Obnam is run in order to replicate the same history. However, since Obnam stores full paths, the path to the temporary directory used for the migration is also stored. This can happen in production servers, where making the conversion into the original directory where the data belongs to is not possible.

Rickard Nilsson suggested on the mailing list to have a "root" option that could be used for stripping the first part of the path. The stripped part would be the one not mentioned in the command line. That way, the backup will have a path computed like this: root + given_path.

~$ obnam backup --repository=/media/backups/... --root=/ mydata

...would be stored into /mydata instead of in /home/${USER}/mydata

Thank you for taking this into consideration!


If this gets implemented, I suggest the following:

  • The Repository class will provide a hook for mangling the pathnames.
  • The hook will get the pathname as it exists in live data and will return the pathname to store in the backup.
  • The hook will be called at every point where live data pathnames are used by Repository.
  • Someone writes a plugin that adds the suitable functionality.

--liw


For the fun of it I added a mangle_filanem() method to Repository. What I quickly learned: If you backup /root/bar the process backs up "/", "/root", "/foo/bar". In reality you only want "bar" in the backup.

So either the mangling hook is allowed to drop paths entirely. But this feels very crude.

I propose not to change Repository and think of Repository just getting virtual paths from its callers. So instead the functions calling into Repository should be changed. In this case the backup command. I have stopped here.

-- Elrond

Posted Fri Jun 15 09:28:45 2012 Tags:

Obnam needs to be able to rename a client.

Posted Thu Jun 14 20:55:22 2012 Tags:

It would be cool if Obnam could warn about common client names, such as localhost or localhost.localdomain.

Posted Thu Jun 14 20:55:22 2012 Tags:

From Enrico: It might be good to have a way for Obnam to automatically exclude certain kinds of common stuff, such as web browser caches, Liferea caches, etc. This should be easy to enable, and should be off by default (safe defaults are important).

Posted Tue Jun 5 09:17:32 2012 Tags:

It would be nice if "obnam forget" could have more detailed progress reporting than at the granularity of one generation. --liw

Posted Tue Jun 5 09:17:32 2012 Tags:

As a system administrator I'd like to see data on how the memory and cpu usage of different obnam operations depend on the number of files and their sizes before adopting obnam for my own systems.

I imagine that in the simplest case you could use something like

for total_size in 50M 1G 50G 400G 1T 2T 3T; do
   for file_size in 1k 2k 4k 8k 16k 32k 1M 64M; do
       generate_filesystem $total_size $file_size original/
       /usr/bin/time -f "%e real %P rss" backup original/ backup/
   done
done

to get cpu and memory usage of backup operations. Repeat same for obnam fsck if you are afraid that its memory usage could depend on total_size or file_size.

Posted Tue Jun 5 09:09:57 2012 Tags:

The --seivot-branch option should not be required for ./run-benchmark. --liw

Posted Mon Jun 4 19:40:29 2012 Tags:

One of the common (?) use cases of obnam is to backup laptops and workstations. Sometimes we grab ISOs etc and have them lying around, but they shouldn't be backed up.

It'd be handy if obnam had a couple of extra options. One to warn about large files, the other to skip large files. They are not mutually exclusive.

This would have saved me backing up a 700MB ISO. Ouch.


Great idea. --AP


I second that great idea :) leto

Posted Fri Jun 1 05:09:58 2012 Tags:

Currently, obnam fsck reports chunks that are unused:

chunk 16541095925909528379 not used by anyone

but doesn't do anything about it. There should be an option to remove those unused chunks from the repository. --weinzwang

Posted Sat May 26 22:12:59 2012 Tags:

It'd be handy if "obnam restore" could exclude parts of the data, based on similar patterns as "obnam backup" does. --liw

Posted Sat May 5 15:29:31 2012 Tags:

It would be good to have obnam provide an option to enable/disable paramiko logging. It is often voluminous and might not be helpful when reading logs for debugging. --liw

Posted Sat Apr 28 16:42:05 2012 Tags:
Posted Sun Apr 22 14:41:03 2012 Tags:

Obnam should, at least optionally, use fsync or other methods to ensure that everything gets committed to disk by the kernel by the end of a backup run. --liw

I want this to not have a huge performance impact, though. Learning from the lessons of dpkg, sqlite/liferea/firefox, etc, and using fsync/fdatasync and sync_file_range in the right ways is going to be necessary. --liw

Posted Sun Apr 22 10:44:42 2012 Tags:

It would be good to have an architecture diagram of the internals of obnam, to make life easier for new code contributors. --liw

Posted Sun Apr 22 09:08:35 2012 Tags:

Obnam should do some processing in the background, for example uploading of data to the backup repository. This would allow better use of the bottleneck resource (network). Below is a journal entry with my thoughts on how to implement that. It may be out of date by now, but we'll see. I have a Python module to simplify the use of multiprocessing to do jobs in the background (which avoids the Python global interpreter lock, in case that matters). --liw


Here's a design for Obnam concurrency that came to me the other day while walking.

The core of Obnam (and larch) is quite synchronous: read data from file, read B-tree nodes, push chunks and B-tree nodes into repository. Some of that can be parallelized, but not easily: it's already tricky code, and making it even more tricky is going to require very strong justification.

Things like encrypting and decrypting files need to be done in parallel with other things, for speed. These things are not really in the core, and indeed are provided by plugins.

So here's a way to them in parallel:

  • the core code stays synchronous, the way it is now
  • whenever larch code needs to read a B-tree node, it blocks until it gets it
  • the node is read, synchronously, from wherever, and put into a background processing queue (using Python multiprocess)
  • the code that waits for the node to be processed polls the queue, and handles any other background jobs that happen to finish while it waits, and returns the desired node when it gets it
  • when larch writes a node (after it gets pushed out of the upload queue inside larch), it is put into a background processing queue
  • at the same time, if there were any finished background jobs, they're handled (written to repo)
  • at the end of the run, the main loop makes sure any pending background jobs finish and are handled

There's a complication that the B-tree code may need a node that is not yet written to the repository, since it is still going through a background processing queue.

I'm going to need to restructure how hooks process files that are written to or read from the repository. Writing should happen asynchronously: files are put in a queue and processed in the background, and then written to the actual repository when background processing is finished. Reading needs to happen synchronously, since there's a B-tree call waiting for them, but to handle the case of needing a node that is still being processed in the background, we need to keep track of what nodes are in the background, and wait for them to be done before reading them.

Reading would thus be something like this, implemented in the Repository class:

while wanted file is in write queue:
    process a write queue result

read file from repository
process file data through hooks
return file

The write queue is more complicated (again handled somehow in the Repository class):

  • a multiprocessing.Queue instance for holding pending jobs
    • a job is a (pathname, file contents) pair
  • another Queue instance for holding unhandled results
    • (pathname, file contents) pair, where the contents may have changed
  • a set for holding file identifiers (paths) that have been put into the pending jobs queue, but not yet processed from the results queue

Each plugin can provide one or more Unix commands (filters) through which the file contents gets piped. The background processes run each filter in turn, giving the output of the previous one as input to the next one.

To handle a result from a background job, the following needs to be done:

  • remove the pathname from the set
  • write the filtered file contents into the repository

To implement this, I'll do this:

  • All changes should be in HookedFS
  • write_file and overwrite_file put things into the pending jobs queue, and also call a new method handle_background_results
  • cat gets changed to wait for files in the write queue, calling handle_background_results
  • handle_background_results will do what is needed

This design isn't optimal, since writing things to the repository isn't being done in parallel with other things, but I'll tackle that problem later.

Posted Sun Apr 22 09:03:12 2012 Tags:

It would be nice if Obnam could tell you how much you've saved thanks to the de-duplication. --liw

From discussion on the mailing list (archive:

  • how much space will be freed if I remove this generation?
  • or a set of generations?
  • how much space has been saved by de-duplicated chunks in the whole repo?

Also, how much space has been saved with compression (alone or with encryption?

Also, store per-generation data in the generation for faster retrieval.

  • Bytes added by this generation.
  • Number of added/changed/removed files in this generation.

--liw

Posted Fri Apr 6 16:45:16 2012 Tags:

obnam force-lock currently doesn't work. As a workaround, remove the lockfiles (all files named lock inside the repository) by hand.

find [repository path] -name lock -exec rm '{}' \;

--weinzwang


I confirm that I see this too. This bug exists because I changed how Obnam uses locks: it now locks each directory properly, instead of just the per-client directory. However, I haven't fixed "force-lock" to deal with other locks, so now it's not possible to force the locks for other directories than the per-client one. This is awkward.

To fix this, Obnam needs to know that it can safely remove the locks. There's two cases:

  • the lock was created by some other client; in this case, the user (not Obnam automatically) needs to decide if it is safe to remove the lock: just running "obnam force-lock" should not do that, instead the user should provide an option like "--really-force-locks" or something
  • the lock was created by the same client, i.e., Obnam running on the same host; in this case, if the Obnam process no longer exists, the lock can be safely removed, otherwise the locks should not be removed (again, unless "--really-force-locks" is used)

To implement this, we need Obnam to store the hostname and process id of the Obnam instance that created the lock, preferably in a way that does not leak sensitive information easily (don't store the client name in cleartext, but the md5sum of it, or something).

--liw


As of 0.27, force-locks unconditionally breaks locks, but the lock files will contain sufficient information to allow us to be more intelligent about the breaking of locks in the future.

--kinnison

--

This is not good enough -- I'd like obnam to be able to break locks more kindly -- but it's good enough for 1.0, I think, so removing the blocker tag. --liw


Making the lock breaking more benign and intelligent is a wishlist. Adding tag. --liw

Posted Tue Apr 3 07:57:46 2012 Tags:

Could Obnam keep a local cache of remote repository metadata, like Duplicity does? I was looking through an Obnam log and noticed how many remote file accesses it does for each file it backs up. If it cached that data locally, it could do it much faster. The last-modified time of the repository could be stored in the remote repo, and if it matches the local cache, a lot of time could be saved.


That is indeed a good idea. Unfortunately, the correctness of caches is often tricky, so I've been putting off implementing this until more important things work first. Also, not caching metadata forces me to do other things to make Obnam fast. But I'd like to do the caching too, some day. --liw

Posted Thu Mar 8 22:44:24 2012 Tags:

Restore should be able to continue a restore if it gets interrupted. Restore should, perhaps optionally, skip files that already exist, and continue partial files. --liw

Posted Sun Mar 4 13:33:23 2012 Tags:

The "obnam clients" command should work even for non-clients (modulo encryption), so that you can get a list of the client names without having to guess one. --liw

Posted Sun Mar 4 13:32:17 2012 Tags:

Currently, there seems to be no easy way to forget all (or all but the newest) checkpoint generations. Something like

obnam --keep 1c forget

would be nice.

-- weinzwang

Posted Tue Feb 28 14:35:03 2012 Tags:

Would Obnam benefit from being able to use other compression tools, like bzip2, p7zip, and xz?


It would! However, implementing this is slightly tricky until I make Obnam can run external tools in the background, rather than waiting for them to finish. The compression it uses right now is done with the Python standard library gzip module, so it doesn't require executing an external program, so the issue doesn't rise. However, we need the backgrounding feature anyway, since encryption uses an external tool, and so there's a big performance impact from using encryption. After backgrounding works, adding arbitrary compression filters will be easy. --liw

Posted Tue Feb 28 07:16:35 2012 Tags:

When --one-file-system is used, it would be nice to not cross bind-mounts. No idea how to figure that out, but it must be possible. --liw

You could look at the inode numbers for . and ./foodir/.. and check they're the same? -- kinnison

The inode check will not work if foodir is a symlink. --mathstuf

Posted Sun Jan 15 18:39:03 2012 Tags:

Supporting multiple compression methods would be nice.

Also one idea is to allow using one compression method (or no compression) while actually doing the backup, and then have a separate tool, to be run on otherwise idle time, recompress the data using some other, slower algorithm to conserve disk space without making the backup process slower.

-- SLi

Posted Mon Jan 2 14:17:15 2012 Tags:

Obnam should, optionally, ask for a gpg passphrase, for the key specified with --encrypt-with, so that a user without a gpg agent will be able to do encrypted backups. Obnam should read the passphrase if its ask-passphrase setting is true, and it has access to a terminal. It should not have a setting for the passphrase itself, just for reading it from a terminal (just so that people who don't know better don't put their passphrase in a config file or similar).

Those running obnam from cron will need to have a passphraseless key, since there's no way to give obnam a passphrase in that case, without storing it in the crontab or a config file, and then it's no better than not having a passphrase.

See Debian bug #649769.

--liw

From my understanding, having a symmetric passphrase stored in a config file is not useless at all. My purpose in encrypting the backup data is to prevent the remote server from having my data in plain-view; or if I back it up to an external drive, I wouldn't want it to be accessible to anyone who picks it up. But if someone gains access to my config file, he'll have direct access to all of my data anyway--he wouldn't need to access my backups.

If I use a passphrase, then if my house burns down and I lose everything, I can get a new computer and download my data and decrypt it with my passphrase--which is long enough to be unfeasible to crack, yet completely memorized by me.

If I use a key, then if my house burns down and I don't have a working copy of my key outside my house, my backups are totally useless, and I really HAVE lost everything. (Sure, I should take precautions to keep from losing my key--but things happen.)

--Adam

It's possible to get obnam to request a passphrase when running from cron:

  1. Ensure 'use-agent' is enabled in ~/.gnupg/gpg.conf.
  2. Ensure the gpg-agent is running, and GPG_AGENT_INFO is set in your regular environment. Note that if obnam already asks for an enccryption passphrase when run normally, then 1 & 2 are already correctly set.
  3. Ensure the environment obnam is called from in cron is exporting GPG_AGENT_INFO correctly. This means you must set and export the GPG_AGENT_INFO environment variable in your cron script. gpg writes this information to ~/.gnupg/gpg-agent-info-$(hostname), so in your cron script you must have:

    source "~/.gnupg/gpg-agent-info-$(hostname)" && export GPG_AGENT_INFO

Then call obnam as normal.

This will only work on a desktop system where there is someone to notice that a pinentry window has popped up. However it looks like there may be a way to forward the gpg-agent socket over ssh, and thus run obnam with encryption from cron on a headless remote machine (See here). You'd probably have to store the private key on the remote machine though.. so not sure how useful that would be.

--Scott

Posted Sun Jan 1 15:33:57 2012 Tags:

obnam should accept SIZE values as well as TIME values for the --checkpoint option. For instance:

obnam backup $HOME --checkpoint=5min

This would be very handy for connections which vary a lot in speed and quality.

-- weinzwang

Posted Mon Dec 12 14:12:30 2011 Tags:

It would be good to have a FUSE filesystem for restoring data. --liw


I didn't get very far today, but here's a draft: http://p.sipsolutions.net/d9720a80c9cc5e99.txt, there's still a lot of XXX in there but maybe it helps somebody get started. I probably won't have much time to work on it.

You start it with "obnam fuse /mountpoint", or you can do "obnam fuse -- -h" to get fuse help. Need the "--" so obnam doesn't parse the "-h" itself ...

--Johannes

Posted Sat Dec 10 21:54:36 2011 Tags:

It would be nice for Obnam to have a tool to answer the question "how much space will be freed if I remove these generations?"

  • need to find list of chunks that are used only by the specified gens
  • perhaps also count B-tree reduction? cound nodes that are unshared by the relevant trees, or only shared by the trees to be deleted
  • however, the B-trees are going to be a fraction (a few percent) of the size of the chunk data, so they're not really worth it

--liw

Posted Sat Dec 10 21:54:36 2011 Tags:

This is an idea for optimizing Obnam.

Store MD5 of string containing names + relevant metadata of all files in a directory, then re-compute that when backing up a directory: if checksums are same, then no file in the directory has changed, and there's no need to check each of them separately, saving many tree lookups

Consider only non-dirs, since subsubdir can change without it being visible at grandparent level. Thus, recursing is always necessary.

--liw

Posted Sat Dec 10 21:54:36 2011 Tags:

Obnam should support multiple repositories, to be chosen at invocation time.

  • all repositories configured in config files
  • nicknames for repositories so it's easy to choose
  • --repository should accept nicknames
  • choose many repositories for one run
  • use all available repositories by default

--liw


Could this also be achieved by running Obnam from a wrapper script that uses a different repository for each run? Could Obnam be run in parallel instances backing up the same data to different repos? Is that possible now? --Adam


Adam, it can certainly be done by using wrapper scripts (I've been doing that), and while I haven't actually tried it, there should be no problem with backing up to multiple repositories concurrently, though you may need to fiddle with the configs so that they use different log files. --liw


After some thinking, I think I don't want nicknames for repositories, I want "profiles".Here's a concrete suggestion:

[config]
encrypt-with = CAFEF00D
profile = all
log = /var/log/obnam/obnam-%(profile)s.log

[profile "online"]
repository = sftp://liw@personal.backup.server.example.com/~/repo/
use-if = ping -c1 personal.backup.server.example.com

[profile "usb-drive"]
repository = /media/Drum/obnam-repo/
use-if = test -d /media/Drum/obnam-repo/

[profile "at-work"]
repository = /mnt/backups/
use-if = ping -c1 fs.work.example.com
pre-command = sudo mount /mnt/backups
post-command = sudo umount /mnt/backups
  • if --profile=all, then iterate automatically over all profiles
  • otherwise, use only the chosen profiles
  • some day: run some/all profiles in parallel in one obnam instance; initially, user may run parallel obnam instances
  • log file should embed profile name somehow
  • should profile be selected based on user too? that can be done with "use-if = test $USER = liw"; better support can be added later, if there's a need

--liw

Posted Sat Dec 10 21:54:36 2011 Tags:

Obnam is currently using paramiko as the SFTP implementation. It is a bit more limited than the SFTP protocol is, and so some stuff that Obnam should be doing, such as restoring hardlinks across SFTP, are not possible. There may also be some bugs with regards to timestamp handling.

Possible fixes:

--liw

Posted Sat Dec 10 21:54:36 2011 Tags:

Would it be beneficial to choose the size of data chunks based on the type of data, or its size? Text data might benefit from small chunks, whereas video data might benefit from large chunks, for example. Possibly the size of a file would be a sufficient indicator.

The goal would be to reduce the number of chunks, without reducing the effectiveness of de-duplication.

This needs to be measurements of real data to see if there are interesting parameters.

--liw

Posted Sat Dec 10 21:54:36 2011 Tags:

Moving forward in small increments than individual chunks would allow more rsync-like behavior, and more chance of finding duplicate data. This might be worthwhile, for some users, some of the time. It should be configurable, though, since it's also potentially going to be a big performance problem. --liw

Posted Sat Dec 10 21:54:36 2011 Tags:

If a file is sparse, and has a large hole, it would be good to skip over it with SEEK_HOLE and SEEK_DATA. --liw

Posted Sat Dec 10 21:54:36 2011 Tags:

fsck has all the necessary information it needs to reconstruct the chunksums and chunklist B-trees in a repository. It should do so, at least when requested.

--liw

Posted Sat Dec 10 21:54:36 2011 Tags:

How does Obnam perform if the live data has massive numbers of hardlinks? Such hardlink trees are not rare, but it would be good to at least know how Obnam handles them. Perform a benchmark comparing three cases:

  1. a very large number (one million?) of small files, all unique
  2. same, but all files identical
  3. same, but all files hardlinks to the same content

Ideally, Obnam should perform about the same for all. --liw

Posted Sat Dec 10 21:54:36 2011 Tags:

Joey asked:

have you done anything in obnam to deal with it needing to keep the symmetric key, decrypted, in RAM?

yeah, it's tough. probably could be avoided by having gpg decrypt the passphrase and pipe it to the encrypting gpg .. but then gpg would constantly be using the public key

It might be possible to have a C extension that holds the symmetric key, locks it into RAM, and feeds it to gpg whenever necessary, via a file descriptor.

--liw

Posted Sat Dec 10 21:54:36 2011 Tags:

From Joey Hess:

My take on this is that, by choosing to use a tool that uses hashes, I am giving up (near-)absolute certainty for speed, or space, or whatever. So it's important that the hash type be good at collision resistance (for example, no two likely filenames should hash the same; "/etc/passwd" should only tend to collide with blobs that are very unlike a filename). It's also important that the tool be upfront about using hashes, and about what hash it uses. And if it's not designed to allow swapping the hash out when it gets broken, I will trust it less (hello git).

Ah, the replacement of hash functions is an interesting problem.

For pathnames, it's not at all important, I think, except perhaps for performance, since pathnames will be compared byte-by-byte instead of by hashes.

For file data, replacing is easy, if one is willing to back up everything from scratch. Supporting several hashes in the same backup store is a little bit more work, but not a whole lot: instead of having just one tree for mapping checksums to chunk identifiers, one would have one per checksum algorithm.

--liw

Posted Sat Dec 10 21:54:36 2011 Tags: