Expand description
Embedded index for the .forest.car.zst
format.
Maps from Cid
s to candidate zstd frame offsets.
§Design statement
- Create once, read many times. This means that existing databases are overkill - most of their API complexity is for write support.
- Embeddable in-file. This precludes most existing databases, which operate on files or folders.
- Lookups must NOT require reading the index into memory.
This precludes using e.g
serde::Serialize
- (Bonus) efficient merging of multiple indices.
§Implementation
The simplest implementation is a sorted list of (Cid, u64)
, pairs.
We’ll call each such pair an entry
.
But this simple implementation has a couple of downsides:
O(log(n))
time complexity for searches with binary search. (We could try to amortise this by doing an initial scan for checkpoints, but seeking backwards in the file may still be penalised by the OS).- Variable length, possibly large entries on disk, which balloons our size and/or implementation complexity.
We can address this by using an open-addressed hash table with linear probing.
- “hash table”: Have a linear array
We create a linear array of equal-length Slot
s.
- hashing the
Cid
gives us a fixed length entry. - A
hash::ideal_slot_ix
gives us a likely location to find the entry, given a table size. (That is, a hash in a collection of length 10 has a differentideal_slot_ix
than if the same hash were in a collection of length 20.) We insert paddingSlot::Empty
s to ensure each entry is at or after itsideal_slot_ix
. - We sort the hashes from lowest to highest, so lookups can
scan forwards from the
ideal_slot_ix
to find the hash they’re looking for. This is called linear probing. - We have two types of collisions.
Both must be handled by callers of
Reader::get
.- Hash collisions, where two different
Cid
s have the same hash. hash::ideal_slot_ix
collisions.
- Hash collisions, where two different
- A slot is always found at or within
V1Header::longest_distance
after itshash::ideal_slot_ix
. This is calculated at construction time.
So the layout on disk is as follows:
┌──────────────┐
│Version::V1 │
├──────────────┤
│Header │ <- Contains the "intial width", required to perform lookups
├──────────────┤
│Slot::Occupied│
├──────────────┤
│Slot::Empty │
├──────────────┤ The hash table does not know how many slots it contains:
│Slot::Empty │ Length information must be stored out of band (e.g in the
├──────────────┤ <- Zstd skip frame header)
Modules§
Structs§
- Iter 🔒
- RawSlot 🔒A
Slot
as it appears on disk. - Reader for the
.forest.car.zst
’s embedded index. - V1Header 🔒
- Writes the actual slot table to disk.
Enums§
Constants§
Traits§
- Readable 🔒
Functions§
- Useful for exhaustiveness checking