It seems that my B-tree code is creating a large number of small leaf nodes. The proof: I wrote a small utility that collects the sizes of each regular file, and ran it on the chunklist tree in the backup store for havelock.

total_files: 3402244
sum of sizes: 266097109
files per dir:
  fewest: 3 /media/Klatch/obnam-backups/chunklist/
  most: 8192 /media/Klatch/obnam-backups/chunklist/nodes/99
  average: 8120
  median: 8189
file sizes:
  smallest: 49 /media/Klatch/obnam-backups/chunklist/metadata
  largest: 65546 /media/Klatch/obnam-backups/chunklist/refcounts/refcounts-983040
  average: 78
  median: 60

(The directories are too large, too, I think.)

That's 3.4 million files, with a median size of 60 bytes. Which means that 1.7 million files are 60 bytes or smaller. Which is insane.

A quick loop over all the files in the nodes directory indicates that most small files are leaf nodes (start with ORBL, not ORBI). There should be no reason why so many leaf nodes are small.

Let me count them...

Also, running fsck-btree, just in case the nodes are leftovers.

That's going to take a long time. I'll other tests on xander while I wait for the run on havelock to finish.

Made a copy of speed-test, as foo.py, and made it do inserts, but nothing else, and ran:

python foo.py temp 100000 no
find temp/nodes -type f -printf '%s\'n | sort -n | numaverage -M

This outputs 318, meaning the median size for nodes is 318 bytes. The maximum node size is 64 KiB, so this is a reproduction of the bug.

find temp/nodes -type f -size -32000 | 
while read x; do dd if=$x bs=4 count=1 2>/dev/null; echo; done | 
uniq -c

The above outputs this:

    571 ORBL
      1 ORBI
    368 ORBL
      1 ORBI
    333 ORBL
      1 ORBI
   1255 ORBL
      1 ORBI
   2319 ORBL

In other words, all but four of the nodes less than half the maximum node size are leaf nodes.

The test only does inserts, not removes, so the problem is almost certainly in insert. I don't see anything obviously wrong, so I'll need to add some logging to see what's going on.

Aha, the code to split overfull leaf nodes is broken. I thought the problem was that the caller did not merge adjacent small nodes, but Richard pointed out that it should not result in nodes that need merging.

I've fixed this code before, well, tried to. So now it's time to do it right.

The problem: I have a leaf node that is larger than is allowed, call it A. It needs to be split into two nodes (or some key/values need to be moved to one new node, same thing), call them B and C. Afterwards, the following conditions should all be true:

  • size(B) <= max
  • size(C) <= max
  • all keys in B are <= all keys in C
  • every key in A is either in B or in C

where

  • size(x) is size of node x in bytes
  • max is maximum size of a node in bytes

From discussion with Richard... I need to impose an upper size limit on values, so that I don't end up with scenarios like this: node holds values X and Z, both half the size of maximum, and then we insert value Y, of node size, resulting in either X+Y and Z, or X and Y+Z. But both X+Y and Y+Z are bigger than the maximum size of a node, so that breaks.

If values are at most half the size of a node, then it will work. So that's what we'll require.

Now we should be able to do this:

  • put half the key/values from A into B, and the rest in C
  • we are now guaranteed that at most one of B or C is too big
  • while size(C) > max, move first pair in C to B,
  • otherwise, while size(B) > max, move the last pair in B to C

Let O be the original node, before the new key/value got inserted (creating A). Then half of O's key/values would automatically result in a node with size <= max/2. Thus, adding a key/value that is less than max/2 in size results in a new node that is at most max in size. So yeah, that should work.

Code works now. Making btree release 0.16.