Will implement cloning of trees in my Rodeh B-tree implementation. The first step is to add per-node reference counts. For the in-memory implementation of NodeStore that is easy. For disk? I'll need to figure out something smart, but for now I will do a sparse-ish array of nodes. NodeStoreDisk gets a directory in which it stores the nodes. I will add a file there with a list of node ids for nodes that contain reference counts. But this trespasses on the Tree's name space for nodes. So, for now, I will put refcounts in a separate file, called refcounts, indexed by node identifier. Each refcount is of a fixed size.

Which brings me to the question of the size of the refcount. One byte would allow 255 clones of trees, and that's clearly too little. The next size is 16 bits, which allows 65535 clones, which should be enough. Should the reference count size be a user-adjustable parameter? Maybe in the future. But not now.

Should I first implement arbitrary key/value string pairs as metadata for a node store? Or perhaps I could have a "node numbers in use" list of integers? Since the list is reasonably dense, like Usenet article numbers, I could store it like this:

nodes: 1-2, 6, 7-9999, 1000000-9999999999

That should be quite compact. I could then have another such list, for refcount nodes.

I will implement the key/value metadata, then the node lists on top of that.

How should I store the metadata on disk? Storing one key/value pair per file is wasteful. Instead, I will pack them in a single file, using some simple encoding. To save my coding, I will just use ConfigParser. That is easy enough, but now the tests run slowly. Time to optimize. I added a test that inserts 1024 different keys... that's probably excessive. It's also slow because the metadata is saved for every key. Should I have a separate commit step? I think so.

Next step is to implement the dense list of node identifiers. Ids are non-negative integers, allocated sequentially. This will be a fun helper class to write. But I should not get carried away, this does not need to be a complete set() implementation.

Then, start on the actual refcounting.

Hmm.. do I actually need the lists of node identifiers? Maybe not.

Refcounting storage on disk should probably be something like this:

  • store refcounts into leaf nodes
  • node id is the key
  • count is the value
  • this re-uses existing code, but is still fairly compact
  • there is a fixed number of pairs in each node
  • each node starts from a particular value (first_key)
  • the node is stored in a file named after the starting value
  • nodes where all values would be zero are not stored on disk

However, this presumes that counts are stored as strings, not as integers. I guess that's reasonably reasonable. So what do I need to do?

  • Add an in-memory cache of refcounts in memory. Just like NodeStoreMemory.
  • Add a method for storing changed refcounts to disk.
  • Add a method for loading refcounts from disk on demand.

Hmm, do I want to revisit the encoding? I think I do. Instead of abusing a LeafNode, I'll just encode everything using struct:

  • first a 64-bit number giving the first node_id
  • then a 16-bit number giving the number of references
  • then that many 16-bit numbers giving the refcounts

OK, after some mishaps, NodeStoreDisk now stores refcounts on disk.

That means I can start working on using refcounts in the tree itself.

The next thing I need to get working is the actual cloning. This will be aided by the refcounting infrastructure I just did.

But I will do that later, now my brain is tired, and I want to go out on a walk with Soile.