description of the problem
i have a data state space: a set of data sets, each of which can be modeled as a collection of arbitrary key-value pairs. These datasets are each a branch of the evolution of a parent dataset, forming a tree (not a network; branch of datasets but do not merge.) The root of the story tree is an empty dataset.
I'm not looking at / not exploring the entirety of this data state space. Rather, I have a list of all the leaf node datasets that exist in the real world. I only care about these leaf node data sets and (sometimes) their ancestors.
I want to find (or create!) A persistent data structure (on disk, distributed / fragmented if necessary) to store and query these data sets. I would also accept a complete database management system built around such a structure.
This data structure would require the following operations:
to define a new set of data in the store, in terms of parent dataset identifier / identifier; and one set of "written" key-values which would create this dataset if it were applied to the referenced parent;
open memory relative to a particular dataset identifier, returning a dataset descriptor;
request the store handle + data set for a value of a particular key;
request the storage handle + data set for the key-value pairs in a data range of keys;
request the store handle + data set for a spill of all the key-value pairs in the dataset.
(possibly) get a the cursor compared to a store + data set handle + initial key, and use it to repeat forward / backward through key-value pairs, each iteration returning a key-value pair.
Constraints imposed by the data:
Many leaf node datasets in the state space will contain Billions key-value pairs.
Many leaf node datasets are more than 10 million levels "deep" or "high" away from the root.
Many datasets are trivial changes from their parent, consisting of a update or even zero updates (but will keep a separate identity in the case of the zero update.)
Many datasets are not trivial, hundreds of thousands updates from their parent.
Each branch node in the tree has, on average, 1.5 children (most have only one; some have two; very few have more than two.) The tree consists mainly of long linear segments of nodes, with branch nodes getting branching off between a new "main branch" on one side; and a short and terminal "side branch" on the other.
the Lily operations must be fast (i.e. soft real-time / limited latency), because the purpose of this datastore is to serve analytical requests with intensive reading.
Insertion new datasets in the store have to ladder in the millions without going exponential; but otherwise, the inserts can be quite slow, taking the order of a few seconds to validate a new data set definition in the store.
Opening a data set in the store can be time consuming (again ~ seconds), but again, this overhead must increase slowly enough to open a buried data set "deep" in the store. This "free" time can be used to decompress the data set of any delta compression / encoding format, to cache the required data or intermediate data structures from disk to memory, etc.
The store must also be space-saving on disk as possible (allowing the overhead of disk space needed to satisfy the other constraints.) I don't have petabytes to expand this store!
Consideration of the use of disk space is where things get interesting, IMHO.
Without a requirement for disk space efficiency, the naïve solution is to simply have full copies of each separate data set stored separately in their own read-indexed storage files, and then distribute sets of Separate data on their own network fragments in order to read requests can be routed to them separately.
But this naïve solution would throw the storage requirements for this state space (reminder, ~ billion KV pairs each, ~ million knots) into a territory of several petabytes, and I don't have that kind of Space.
Intuitively, thanks to my experience in tuning analytical database systems, I well disk data structure for OLAP backup storage (including all required indexes) to introduce at most 10x overhead in addition to a modified data capture representation of the source data. The modified data capture representation of all these data sets (i.e. the representations that to define above) currently stands at ~ 50 GB; So I would intuitively expect this data structure to require no more than ~ 500 GB of disk. Am I crazy to expect this?
I know I can get some easy gains in terms of storage overhead for general "deduplication" of datasets, just by relying on a file system with a copy-on-write block level, where each dataset in the state space becomes its copy-on-write snapshot and the snapshots form a tree structure. But it looks like would not evolve operationally, because either I would use a sort of sorted flat file dataset format (in which case the insertions "in the middle" of the data cause an explosion of storage in descending snapshots), or I & # 39; Would use something like a disc LSM tree (eg LevelDB) or B + tree (eg LMDB) (in which case each snapshot would add another "level" to the tree, causing either an explosion of the inodes of file in the case of LevelDB, i.e. a fragmentation of each file into tiny scopes per layer in the case of LMDB), which ultimately means that reading in a data set of one million branches in one such a store would have a lot of overhead in file system accounting.
I guess one well the data structure for this would involve, at some level:
- tests (HAMT?)
- a notion of "keyframes" versus "interstitial frames", to represent sets of trivial changes
- a notion of "highly connected roads" through the tree, where the "major branches" are repackaged (defragmented?) and the "minor branches" are re-stored in terms of difference from a "main branch"
I am aware of Datomic, which seems to have an architecture and a set of operations similar to what I am looking for; but which only supports a linear chronology, rather than a tree of chronological branches. I do not know if its architectural design could be extended to support connection time without fundamental changes.
I also know what blockchain systems (eg Ethereum) do with merkle patricia trie– storage room. I evaluated exactly that, but – at least when the blockchains themselves implement this approach – the read performance born scale for analytical workloads. (This works for these systems, since their evolutionary stages are OLTP workloads, almost always calculating against a "main branch" state – the result of the most recent previous calculation – which is therefore set almost completely hides in memory. Most recent "main branch" state can be considered "cold". None of this is true for an OLAP use case; OLAP queries examine arbitrary branches of a state space at arbitrary times, with neither branch nor node being "hot".) I would expect, however, that some modification (relaxation?) of this approach is appropriate.