Proposal for a high-performance storage system for OSMIC branching histories

The storage system consists of a binary ledger, an append-only concatext, a reverse index of hashes of large chunks of text (organized…

Proposal for a high-performance storage system for OSMIC branching histories

Introduction:

OSMIC is a tree versioning system proposed & used as part of Project Xanadu®. This document proposes an alternative, compacted, high-speed storage system for OSMIC ledgers. For an introduction to OSMIC & the Enfilade, see my introduction to Xanadu technologies. Alternatively, you may look at my proof of concept implementations of OSMIC and the Enfilade.

Abstract:

The storage system consists of a binary ledger, an append-only permascroll, a reverse index of hashes of large chunks of text (organized first by length), and an index & reverse index for translating OSMIC numbers into offsets into the ledger.

Binary ledger:

The binary ledger consists of three fixed-length integers: a length, an insertion offset into the versioned blob (concatext), and a start offset into the permascroll.

The length is a signed 16-bit integer — negative for a delete operation; the other two are unsigned 64-bit integers.

Permascroll:

The permascroll contains all historically-inserted text concatenated together in insertion order.

Reverse hash index:

For long chunks of text (where length is more than some n times the length of the output of some fast hashing algorithm), we avoid double-inserting strings into the permascroll by checking them against a reverse hash index. This index consists of k buckets organized by size. Each bucket associates a hashed string to its position in the concatext. For long strings, the maximum prefix for each bucket size will be filed into those buckets.

The value of n must be calculated based on the cost of hashing & index lookup. I would start with n=3 as a potentially-reasonable value.

The value of k must be calculated based on the likelihood of collision for a given number of hashes. The hashing algorithm should be fast and produce relatively small output, in order to ensure performance, so collision is possible at smaller sizes. (Cryptographic hashes are not reasonable in this context.)

OSMIC number to ledger offset index:

This index should be an enfilade with 32-bit integer indices.

Lookup produces a structure consisting of:

a signed 64-bit integer row number in the ledger, which must be multiplied by 18 (2 bytes length + 8 bytes for each offset) to get a byte offset. If the row number is positive, then the next row is contiguous (i.e., the next OSMIC number is the next row).
a pointer to the previous & next OSMIC number entries in the enfilade.

OSMIC number to ledger offset reverse index:

An array of null-padded fixed-length string representations of OSMIC numbers, in hexadecimal. Their ordering is the same as the corresponding rows in the ledger. They are capped at 100 levels — in other words, 100 hex-encoded 32-bit integers, or 800 bytes.

Insert behavior:

If an inserted string is smaller than n, then it is added to the permascroll and a row is added to the end of the ledger.

If an inserted string is larger than n, then it is hashed & looked for in the reverse hash index. If it is not found, successive prefixes sized for each buckets are tried. If a prefix is found, that prefix is used in a ledger row & the remaining suffix is passed to regular insert. Any string larger than n that is not found at all will be put into the permascroll & reverse hash index.

All new ledger rows correspond to new OSMIC numbers, and these numbers are added to both the forward and reverse OSMIC number indexes.

As with regular OSMIC, delete behavior is the same as insert behavior, with the delete indicator (in this case, the sign of the length) inverted.

Single lookup behavior:

To reconstruct an action, we first look up the row number in the forward OSMIC number index. We fetch the row. We then perform general lookup behavior.

General lookup behavior:

If we are in ‘rewind’ or ‘undo’ mode, we invert the sign on the ledger row’s length value.

If this new sign is negative, we are in ‘delete’ mode: we do not fetch the value of the string from the permascroll, but instead merely remove the corresponding number of bytes from the concatext, beginning at the index provided.

Otherwise, we fetch that number of bytes from the permascroll starting at the permascroll index provided, and insert it into the concatext.

Bulk lookup (fast-forward / rewind) behavior:

To reconstruct a series of actions, we fetch the entire range from the OSMIC number forward lookup enfilade, then break it into sets of contiguous ledger rows. We fetch those sets of rows with single seek+read operations, creating a single and fully-contiguous virtual ledger row list.

If we are in rewind mode, we reverse the order of these rows.

After that, we perform general lookup behavior.