Disclaimer: If you need to store and retrieve large amounts of data on disk, SQLite is probably what you're looking for. If you want to quickly look up data in memory, the Python dict class is probably what you need. However, if you want something for which neither is suitable, a B-tree might be helpful.

This is an implementation of particular kind of B-tree, based on research by Ohad Rodeh. See "B-trees, Shadowing, and Clones" (copied here with permission of author) for details on the data structure. This is the same data structure that btrfs uses. Note that my implementation is independent from the btrfs one, and might differ from what the paper describes.

The distinctive feature of this B-tree is that a node is never modified (sort-of). Instead, all updates are done by copy-on-write. Among other things, this makes it easy to clone a tree, and modify only the clone, while other processes access the original tree. This is utterly wonderful for my backup application, and that's the reason I wrote larch in the first place.

I have tried to keep the implementation generic and flexible, so that you may use it in a variety of situations. For example, the tree itself does not decide where its nodes are stored: you provide a class that does that for it. I have two implementations of the NodeStore class, one for in-memory and one for on-disk storage.

The tree attempts to guarantee this: all modifications you make will be safely stored in the node store when the larch.Forest.commit method is called. After that, unless you actually modify the committed tree yourself, it will be safe from further modifications. (You need to take care to create a new tree for further modifications, though.)

Documentation is sparse. Docstrings and reading the code are your best hope.

Contributions are welcome. Every level of contribution is most appreciated: bug reports, spelling and grammar fixes, code patches, questions on how to get started, etc. See README for information on how to modify the code. And if anything's unclear, ask!

Status: in production use. The on-disk file format is not expected to change anymore.

Hi, does this btree support concurrency? I mean can we start several python instances accessing same btree?
Comment by carlopires [blogspot.com] Wed Sep 8 13:41:45 2010
Nope, it makes no allowance for concurrency, I'm afraid. It hasn't been interesting to my backup application.
Comment by liw.fi Wed Sep 8 17:10:08 2010

I have several question:

  • Do larch works with 1 writer several reader ?
  • What is a common use of clones ? I though it was used for multiple writers
  • If it doesn't support multiple writers what would it take to support it ? Have any
  • It possible to write bsddb-clone on top of larch but not a sqlite, isn't it ?
  • What happens when a commit fails ?

Best regards

Comment by Aneglus Thu Nov 24 16:04:44 2011

Larch works with one writer at a time. If you set up mutual exclusion (locking) between them, many writers will work too. However, Larch itself does not deal with locking, and leaves it entirely up to you.

I use clones in my backup application to implement backup generations. Clones are not sufficient for multiple writers, because some things, such as the reference counts, are shared between all clones and need mutual exclusion for updates.

I have not given much thought for supporting multiple writers directly in Larch. It's not been something that's been interesting to the case where I use Larch (my backup program).

I see no reason why an SQL implementation like Sqlite could not be built on top of Larch. However, it's not something I have any interest working on.

If a commit fails, the "forest" metadata does not get updated, and the readers do not see any changes. There may be some garbage files on the disk (nodes that got written, but whose trees got lost from the list of clones in the forest), but apart from wastig space, they should be harmless.

Comment by Lars Wirzenius Sat Nov 26 20:42:04 2011