I wish to review all of the btree code. It is currently 2907 lines, of which 1158 are in test modules. That's all lines, including empty lines and copyright boilerplate. I'll skip reviewing the tests, so that I have to review everything else.

General:

  • It would be good to have the API clearly defined. For example, it should be made clear which methods are public and which are not. What is a good way to do that? Prefix everything non-public with an underscore? But not all classes are meant to be public, either. Should look up what is Python best practice for this. (Bug filed.)
  • A tutorial would be good, too. Preferably with executable, automatically tested examples. (Buf filed.)
  • ReST should be used for all docstrings. (Buf filed.)
  • fsck-btree should check that no node is too big or small. (Buf filed.)

bsearch.py:

  • It would be nice to use the Python bisect module instead. It is faster, according to my own bsearch-speed tests. This would result in an API change. bsearch is not being used in all that many places, though, so it might be worth it. Current nodes.py and tree.py use it, in a total of about a dozen places. I should review those and see how big the changes are. Then run speed-test and the Obnam benchmark with and without the change to see of the speedup actually results in a faster tree. Also, bisect has insertion sort functions, too, which might simplify my code a bit. (Buf filed.)
  • Unfortunately, bisect API does not allow passing a method for extracting the key. Thus, if I want to use it, I would need to restructure all the places where the getkey parameter for bsearch is used. It seems to be used everywhere. (Buf filed.)
  • I am basically only searching for lists of pairs, and it might be possible to restructure things so that I have two lists (one of keys, one of values) rather than one list (of (key, value) pairs). (Bug filed.)

forest.py:

  • Forest.remove_tree just removes the tree from the list of trees in memory. It does not seem to remove things from disk. commit will update the list of root nodes, which will no longer include the node of the removed tree, but will not remove the actual nodes. Or at least that's what a quick analysis indicates. This should be verified. If it turns out to be true, then probably remove_tree should do a remove_range for all possible keys. It would perhaps be faster to have a routine specifically for removing all nodes in a tree, but the range removal needs to be fast, too, so optimizing that should deal with both problems at the same time. (Buf filed.)
  • Should new_tree use BTree.new_root instead of fudging the root node creation manually? (Fixed.)

nodes.py:

  • keys could perhaps call self._dict.keys instead of constructing a new list manually? (No, the list needs to be sorted.)
  • Likewise, values could call self._dict.values? (Likewise.)
  • IndexNode.__init__ has a spurious assertion loop. It might not be visible in profiling, but it's pointless. Why check types only in this one place? (Fixed.)
  • Why are the limits exclusive for find_children_in_range? Inclusive would seem to be more natural. The method is used from only one place, BTree._lookup_range, and at first glance, inclusiveness would seem to be what it assumes, anyway. Perhaps it is inclusive? The docstring says exclusive, but what does the code say? The code is actually inclusive, at least in some cases. Let's write some test cases. All tests pass. (Fixed now.)
  • Also, why is there such a method only for IndexNode and not LeafNode? This leads to BTree having a method for looking up ranges in leaves rather than having the leaf node itself having it. (Fixed.)

nodestore_disk.py:

  • Should RefcountStore and UploadQueue be in separate modules? (Bugs filed.)
  • RefcountStore should have a docstring explaining what it does, and a comment that explains how it works. (Bug filed.)
  • Is 32 KiB a good size for refcount groups? How would one measure? I guess it would be OK to add a knob that the user can tweak, and then run benchmarks tweaking that knob in various ways. However, I would guess any size in the tens of kilobytes range would do. (Ignored.)
  • Are there better ways to store the refcounts? Perhaps a sparse file? But that assume there are sparse files, which for some hypothetical VFS implementations is not going to be true. For example, a key/value database. (Ignored.)
  • Why is there a logging level test for the logging.debug call in save_refcounts? Is the call so high that it impacts performance? If so, that should be documented in a comment. Actually, the actual call is probably fast enough, but computing the values to be logged is going to take some time, perhaps? Either way, benchmark and document. Or, are those statistics useful at all? (Removed.)
  • Name of method group could be better. Rename to group_start_id? (Fixed.)
  • UploadQueue should have a docstring explaining what it does, and a comment explaining how it works. (Bug filed.)
  • Could the LRU class be used for UploadQueue? Perhaps augmented with a callback for items that get dropped from the LRU queue? In that case UploadQueue would almost entirely vanish, and would not contain the LRU implementation mixed with other stuff. (Bug filed.)
  • UploadQueue.__init__ could initialize node_before and node_after with dictionary values: `{ None: None }'. (Ignored, due to LRU.)
  • Would it be cleaner to provide NodeStoreDisk with a VFS class instead of subclassing and overriding methods? (Yes, but too much change for fairly little gain at this point. Bug filed anyway.)
  • Should get_node prefer the upload queue over the node cache? Or, rather, is there ever a case when the upload queue might contain a newer node than the cache? No, probably not, since put_node puts it in the cache as well. (Ignored.)
  • Actually, is it a good idea for put_node to put things into the cache? It might be that in most cases we only fetch things while they are in the upload queue anyway, and pushing out otherwise-usable nodes from the cache is bad. put_node could remove from the cache, and get_node could check the upload queue first. Needs benchmarking. (Bug filed.)
  • Should the whole caching thing be providing by a separate layer? (Ignored.)
  • Is list_nodes interesting for anything except fsck? Should it be implemented as a generator, since a very big B-tree will result in a very large list? On the other hand, fsck-btree will want to have the number of nodes anyway, and a generator would not help with that. How many nodes is a very large B-tree, anyway? In the test backup of havelock I did the other day, the live data is maybe 250 thousand files, and the per-client B-tree is 36 thousand nodes. Not all of those are in the same tree, but probably most of them are in all trees. That's not enough nodes to worry about, yet, for fsck. Maybe after the tree is millions of nodes. (Ignored.)
  • Would it be easier for the caller to have increment_refcount and decrement_refcount methods as well? BTree already has a method for incrementing a refcount that would map directly to one in NodeStore. However, BTree.decrement would not be much simpler if NodeStore.decrement_refcount existed. It's perhaps easier to keep the code as it is now. (Ignored.)

tree.py:

  • Should this not be called btree.py? (Yes, but big change. Bug filed.)
  • The file docstring is wrong, now: some nodes are modified in place. The note about fullness of nodes is also wrong. Removed both comments for now, should perhaps be added back later, with updates. (Fixed.)
  • The BTree docstring is also out of date, but not actively wrong. (Fixed.)
  • BTree.min_index_length is not used anywhere. This is a symptom of the tree merging adjacent siblings too aggressively: whenever it can, not just when a node becomes too small. This should be fixed, assuming benchmarks show it is useful. (Bug filed.)
  • Are helper methods like BTree.new_id, which just map to methods in another class, actually useful? There would be fewer methods without them, but actual method calls would be a bit longer. (Ignored.)
  • BTree.set_root does not seem to be useful as a separate method. (Fixed.)
  • Does BTree._leaf_size actually save enough to warrant itself? Probably does. (Ignored.)
  • Could I rewrite _lookup using looping, and would that have any benefit for clarity, or speed? Ditto for _lookup_range. (Bug filed.)
  • lookup_range should call _check_key_size for both parameters. (Fixed.)
  • WTF is BTree._new_root and how does it compare with BTree.new_root? At least one difference is how refcount incrementing happens: in method or in caller. (Fixed.)
  • When we replace a tree's root node, what happens to the old one? Should its reference count be reduced? However, that would mean it drops to 0, and the node gets removed, and yet the node might belong to a different tree. So BTree can't do that. Removing tree should happen by calling Forest.remove_tree. However, currently there is a scenario where a tree's root node gets created for it, then becomes empty, then a new value is inserted, and a new root node gets created. This results in the old root node leaking: it has a refcount of 1, but nobody actually uses it. Perhaps Forest should keep track of root nodes that get dropped this way, and remove the nodes? But then it would have to be notified of all root node changes, which seems excessive. Instead, I can fix BTree.insert to not replace an empty root node (exists, but no children), and instead insert into that node directly. Oh, the code does that already. Never mind. (Ignored.)
  • BTree._insert_into_index docstring should say that it creates new nodes as necessary so that the number of children it returns will fit as children of a single index node. (Fixed.)
  • However, is there every a case that this will cause the tree to become imbalanced? No, because it will then return two children, not one. It will create a sibling, not a new parent. If a new parent needs to be created, then the previous level will take care of that. (Ignored.)
  • _insert_into_leaf does not actually seem to guarantee that the result of a node split are small enough. It splits based on number of keys, not size of values. This can break in extreme circumstances: assume original node is two values, an empty one and a maximum-sized one, then insert a new maximum-sized value with a key that results in both maximum-sized values going to the same side of the split. Oops. The key counting works for index nodes, where all values are of the same size. (Bug filed.)
  • Why is the assert for refcount being 1 useful in remove? (Fixed.)
  • Rename _remove_single_index_children to _reduce_height and move it later in the class. (Fixed.)
  • Use old_index and new_index in _remove_from_index, for clarity. (Fixed.)
  • Should _remove_from_index be split into two sub-methods? (Ignored.)
  • Oh, look at that: I do already actually avoid merging sibling index nodes if they are large. (Ignored.)
  • The implementation of _merge_index and _merge_leaf are almost identical, as are those of _add_or_merge_index and _add_or_merge_leaf. This suggests a helper method would be a good idea to avoid bugs and simplify code. (Fixed.)
  • remove_range should check key sizes. (Fixed.)
  • _remove_range_from_index has a bunch of asserts commented out. Either remove them completely or uncomment them. (Fixed.)

Did not find my bug, but fixed a bunch of other things. Time well spent.