Module forest_filecoin::db::car::forest::index

source ·
Expand description

Embedded index for the .forest.car.zst format.

Maps from Cids to candidate zstd frame offsets.

§Design statement

  • Create once, read many times. This means that existing databases are overkill - most of their API complexity is for write support.
  • Embeddable in-file. This precludes most existing databases, which operate on files or folders.
  • Lookups must NOT require reading the index into memory. This precludes using e.g serde::Serialize
  • (Bonus) efficient merging of multiple indices.

§Implementation

The simplest implementation is a sorted list of (Cid, u64), pairs. We’ll call each such pair an entry. But this simple implementation has a couple of downsides:

  • O(log(n)) time complexity for searches with binary search. (We could try to amortise this by doing an initial scan for checkpoints, but seeking backwards in the file may still be penalised by the OS).
  • Variable length, possibly large entries on disk, which balloons our size and/or implementation complexity.

We can address this by using an open-addressed hash table with linear probing.

  • “hash table”: Have a linear array

We create a linear array of equal-length Slots.

  • hashing the Cid gives us a fixed length entry.
  • A hash::ideal_slot_ix gives us a likely location to find the entry, given a table size. (That is, a hash in a collection of length 10 has a different ideal_slot_ix than if the same hash were in a collection of length 20.) We insert padding Slot::Emptys to ensure each entry is at or after its ideal_slot_ix.
  • We sort the hashes from lowest to highest, so lookups can scan forwards from the ideal_slot_ix to find the hash they’re looking for. This is called linear probing.
  • We have two types of collisions. Both must be handled by callers of Reader::get.
  • A slot is always found at or within V1Header::longest_distance after its hash::ideal_slot_ix. This is calculated at construction time.

So the layout on disk is as follows:

┌──────────────┐
│Version::V1   │
├──────────────┤
│Header        │ <- Contains the "intial width", required to perform lookups
├──────────────┤
│Slot::Occupied│
├──────────────┤
│Slot::Empty   │
├──────────────┤    The hash table does not know how many slots it contains:
│Slot::Empty   │    Length information must be stored out of band (e.g in the
├──────────────┤ <- Zstd skip frame header)

Modules§

Structs§

Enums§

Constants§

Traits§

Functions§