Background scribbles

A simple log-structured merge tree in Rust

October 28, 2025

This is an attempt at a high-level introduction to the architecture of LSM trees. Certain implementation details or features (there are a lot!) may not be covered. If you're interested in LSM trees in detail, a great resource is the RocksDB wiki.

A log-structured merge tree (or an LSM tree) is a write-optimized data structure that is used in databases like RocksDB and Cassandra.

By buffering writes in memory, and then flushing sorted files to disk, LSM trees can achieve high write throughput and efficient reads to power use-cases like observability, streaming, and more.

Fundamentals

There are a few fundamental pieces to chat through before we jump into an implementation.

The write path of LSM trees does a good job showcasing how each core component fits together, so we'll first cover what happens when a write comes in. Then, we'll briefly cover the read path to complete the picture.

Writes

             ┌─────────┐
             │  Write  │
             └────┬────┘

        ┌─────────┴──────────┐
        ▼                    ▼
  ┌─────┴─────┐       ┌──────┴──────┐
  │    WAL    │       │  Memtable   │
  │ (on disk) │       │ (in memory) │
  └───────────┘       └──────┬──────┘
                             │ flush
                  ┌──────────┘

      ┌───────────┴────────────┐
      │ SSTable1  SSTable2 ... │ (on disk)
      └───────────┬────────────┘
                  │ compact

           merged SSTable

When a write comes in, it is simultaneously written to a memtable and a write-ahead log (WAL).

Memtable

The memtable is an in-memory buffer that temporarily holds all incoming writes (before they get persisted to disk). Because writing to memory is orders of magnitude faster than writing to disk, we can quickly acknowledge the writes.

Memtables are usually built from skip lists or red-black trees, which store sorted key-value pairs (ordered by key). The sorted layout gives us the benefit of efficient lookups, and speeds up future on-disk merges.

Once a certain size limit is hit, the memtable gets saved to disk. While being saved to disk, the memtable being flushed will become immutable (although still queryable), and another memtable will take over incoming writes.

Write-ahead log (WAL)

Given that the memtable lives in-memory, we also want to provide durability to our system. Enter, the WAL.

The WAL is an append-only file that gives us data durability. Without the WAL, if the system were to crash, we would lose all of the data in the memtable that hasn't yet been flushed to disk (no good!). With the WAL, we can replay all of the operations in the log file to reconstruct the last state of the memtable.

Once the memtable is safely flushed to disk (up next) and the data is persisted, the corresponding WAL entries can be purged.

Sorted string tables (SSTables)

As previously described, once the memtable reaches its size limit and "gets flushed", it gets saved to a Sorted String Table, or SSTable.

An SSTable is an immutable, sorted file that stores key-value pairs (ordered by key). These files usually also contain index blocks to quickly locate keys within the file, and bloom filters to filter through files that don't contain a given key.

One important performance aspect here is that the SSTables are writen using sequential I/O, meaning we batch the data in large, contiguous chunks on disk. This makes our writes (and reads which we'll cover soon) faster than random I/O. This also helps us fully utilize the disk throughput when flushing (or when compacting, which we'll cover next).

Because SSTables are immutable, each memtable flush creates a new SSTable. Over time, these SSTables accumulate, often containing different versions of keys.

Compaction

Periodically, we take the accumulated SSTables, and we compact and consolidate them into larger SSTables.

Compaction keeps data sorted, removes outdated versions and deleted keys, and reduces the number of files that must be searched during reads. It's an important process for maintaining stable performance as data grows.

There are several compaction algorithms, but we'll briefly look at two of the most common: leveled compaction and size-tiered compaction.

Leveled compaction

Used by RocksDB, leveled compaction organizes SSTables into different levels (L0, L1, L2, ...). Each level has a maximum size, and after L0, SSTables within the same level are guaranteed to have no overlapping key ranges.

New SSTables are written to L0, so that initial non-overlapping guarantee can't hold. But, once the size limit of L0 is hit, the files get merged into the next level (each is ~10x bigger than the previous) with non-overlapping keys.

The result is strong read performance. To find a key, only one SSTable needs to be looked at per level.

This comes at the cost of high write amplification due to all of the small merging operations as data falls through the levels.

Size-Tiered compaction

Used by Cassandra, size-tiered compaction groups SSTables of similar sizes into tiers. Once there are enough SSTables of roughly the same size within a tier, they get merged into a single, larger SSTable and moved to the next tier.

The result is high write throughput due to the lower write amplification (fewer compactions).

The tradeoff is higher read amplification, since a key might appear in multiple overlapping SSTables.

Reads

 
                ┌────────┐
                │  Read  │
                └───┬────┘

    ┌──────────────────────────────────┐
    │  Memtable (in memory)            │
    │  Immutable Memtable (flushing)   │
    └───────────────┬──────────────────┘
                    │ (not found in memory)

                    │ (check L0 bloom filter)

  L0 (on disk) - - - - - - - - - - - - - -
        ┌────────┐     ┌────────┐
        │   L0   │     │   L0   │
        │ [SSTs] │     │ [SSTs] │
        └────────┘     └────────┘
            (return if found)

                    │ (check L1 bloom filter)

  L1 (on disk) - - - - - - - - - - - - - -
  ┌───────────────┐   ┌───────────────┐
  │      L1       │   │      L1       │
  │    [SSTs]     │   │    [SSTs]     │
  └───────────────┘   └───────────────┘
            (return if found)
                   ...

                    │ (check LN bloom filter)

  LN (on disk) - - - - - - - - - - - - - -
  ┌───────────────────────────────────┐
  │                 LN                │
  │               [SSTs]              │
  └───────────────────────────────────┘
 

Now that we've covered the high-level pieces on the write side, we can outline what an incoming read looks like.

The main idea for reads is that we search using a layered approach since we know that our most recent writes will be in-memory, and then they move to disk progressing through older and larger SSTables.

Step 1: Check the memtable

Our first check should be seeing if our key lives in memory. The memtable will hold our most recent writes, and we'll avoid disk I/O if we have a match. If we're in the middle of a flush, we can also check the immutable memtable being flushed as that is still queryable.

Step 2: Bloom filters

Before scanning through SSTables on disk, we use bloom filters to rule out any files that cannot contain the key.

The tl;dr with bloom filters is that they either tell you that a key probably exists in a set (false positives are possible here), or that a key does not exist in a set (false negatives are not possible). This check is key in helping to minimize unnecessary disk I/O, and we can leverage it to skip levels entirely.

Step 3: SSTables

With some candidates filtered out, we search through the remaining SSTables in order from newest to oldest (newer tables have the newer writes), level by level.

Because each SSTable is sorted by key, and having index blocks with each SSTable, we can achieve fairly efficient reads. We can also take advantage of sequential I/O during scans because the tables were stored contiguously.

Implementation

Now with some background, let's write out some of the basic pieces we described above. This is meant to be a very simple implementation just to illustrate key concepts, so we'll skip over some details and optimizations.

Our goal here is to achieve a basic structure that looks like this:

pub struct LsmTree {
    memtable: Memtable,
    wal: Wal,
    sstables: Vec<SSTable>,
    path: PathBuf,
    next_sstable_id: u8,
}

and we'll implement the following operations:

impl LsmTree {
    pub fn open(path: &Path) -> Result<Self>; // replays if exists
    pub fn put(&mut self, key: Vec<u8>, value: Vec<u8>) -> Result<()>;
    pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>>;
    pub fn flush(&mut self) -> Result<()>;
    pub fn compact_all(&mut self) -> Result<()>; // very basic
}

Since we already have a lot of code to write, I'm going to skip over a few things like:

  • Deleting records (by adding tombstones, you can read more here)
  • Index blocks and bloom filters for reads
  • We'll assume we're in a single-threaded environment / avoid parallelism

One other note is I'll be adding anyhow to simplify error handling (can be added with cargo add anyhow).

Let's get started!

Memtable

We'll start with the memtable. Above we described that they typically use skip lists or red-black trees, to keep things simple we're going to use a BTreeMap, which will still keep our memtable ordered by key.

use std::collections::BTreeMap;
 
pub struct Memtable {
    map: BTreeMap<Vec<u8>, Vec<u8>>,
}
 
impl Default for Memtable {
    fn default() -> Self {
        Self::new()
    }
}
 
impl Memtable {
    pub fn new() -> Self {
        Self {
            map: BTreeMap::new(),
        }
    }
 
    pub fn put(&mut self, key: Vec<u8>, value: Vec<u8>) {
        self.map.insert(key, value);
    }
 
    pub fn get(&self, key: &[u8]) -> Option<&[u8]> {
        self.map.get(key).map(|v| v.as_slice())
    }
 
    pub fn iter(&self) -> impl Iterator<Item = (&[u8], &[u8])> {
        self.map.iter().map(|(k, v)| (k.as_slice(), v.as_slice()))
    }
 
    pub fn size(&self) -> usize {
        self.map.len()
    }
 
    pub fn total_bytes(&self) -> usize {
        self.map.iter().map(|(k, v)| k.len() + v.len()).sum()
    }
 
    pub fn is_empty(&self) -> bool {
        self.map.is_empty()
    }
 
    pub fn clear(&mut self) {
        self.map.clear();
    }
}

Nothing crazy here as this is largely just a wrapper over built-in BTreeMap operations, but it will help us build upon the implementation in a bit.

WAL

Next, we need some durability so we can sleep easy at night knowing our data is on-disk.

A few small notes about the implementation here:

  • We're just going to follow a very basic log scheme of <u32 key length><u32 value length><key bytes><val bytes>
  • No deletes
use std::{
    fs::{File, OpenOptions},
    io::{BufReader, BufWriter, Read, Write},
    path::{Path, PathBuf},
};
 
use anyhow::{anyhow, bail, Context, Result};
 
/// Helper function to read out key/value pairs from a file given
/// our binary format: `<u32 key length><u32 value length><key bytes><val bytes>`
fn read_entry_from_header(reader: &mut BufReader<File>) -> Result<Option<(Vec<u8>, Vec<u8>)>> {
    let mut header = [0u8; 8];
 
    match reader.read_exact(&mut header) {
        Ok(()) => {}
        Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None),
        Err(e) => return Err(e).context("Failed to read entry header"),
    }
 
    let key_length = u32::from_le_bytes(
        header
            .get(0..4)
            .ok_or_else(|| anyhow!("Invalid header: missing key length"))?
            .try_into()
            .context("Invalid key length slice")?,
    ) as usize;
 
    let value_length = u32::from_le_bytes(
        header
            .get(4..8)
            .ok_or_else(|| anyhow!("Invalid header: missing value length"))?
            .try_into()
            .context("Invalid value length slice")?,
    ) as usize;
 
    // defensive check to avoid any OOM even though we also check on write
    if key_length > LsmTree::MAX_ENTRY_SIZE || value_length > LsmTree::MAX_ENTRY_SIZE {
        bail!(
            "Corrupt entry: header saying key or value is too large (key length={} value length={})",
            key_length,
            value_length
        )
    }
 
    let mut key = vec![0u8; key_length];
    reader.read_exact(&mut key)?;
 
    let mut value = vec![0u8; value_length];
    reader.read_exact(&mut value)?;
 
    Ok(Some((key, value)))
}
 
pub struct Wal {
    path: PathBuf,
    writer: BufWriter<File>,
}
 
impl Wal {
    pub fn open(path: &Path) -> Result<Self> {
        let path = path.to_path_buf();
        let file = OpenOptions::new()
            .create(true)
            .append(true)
            .read(true)
            .open(&path)?;
 
        Ok(Self {
            path,
            writer: BufWriter::new(file),
        })
    }
 
    /// Append a `Put(key, value)` record to the WAL and fsync
    /// using the following log format:
    /// `<u32 key length><u32 value length><key bytes><val bytes>`
    pub fn append(&mut self, key: &[u8], value: &[u8]) -> Result<()> {
        let key_length = key.len() as u32;
        let value_length = value.len() as u32;
 
        self.writer.write_all(&key_length.to_le_bytes())?;
        self.writer.write_all(&value_length.to_le_bytes())?;
        self.writer.write_all(key)?;
        self.writer.write_all(value)?;
 
        // flush buffer and sync the file for durability
        self.writer.flush()?;
        self.writer.get_ref().sync_all()?;
 
        Ok(())
    }
 
    /// Replay all record in the WAL (returns key/value pairs to rebuild the memtable)
    pub fn replay(&mut self) -> Result<Vec<(Vec<u8>, Vec<u8>)>> {
        // flush out any buffered writes
        self.writer.flush()?;
 
        let file = OpenOptions::new().read(true).open(&self.path)?;
        let mut reader = BufReader::new(file);
 
        let mut records = Vec::new();
        loop {
            let (key, value) = match read_entry_from_header(&mut reader) {
                Ok(Some((key, value))) => (key, value),
                Ok(None) => break, // EOF
                Err(e) => return Err(e).context("Failed to read WAL record"),
            };
            records.push((key, value));
        }
 
        Ok(records)
    }
 
    /// Purges the WAL after a flush
    pub fn purge(&mut self) -> Result<()> {
        // truncate then re-init
        OpenOptions::new()
            .create(true)
            .write(true)
            .truncate(true)
            .open(&self.path)?;
 
        let file = OpenOptions::new()
            .append(true)
            .read(true)
            .open(&self.path)?;
 
        self.writer = BufWriter::new(file);
 
        Ok(())
    }
}

SSTables

We'll also make a few simplifying assumptions for the SSTable implementation.

  • One file per flush
  • File is sorted by entry key, but we're just going to do a linear scan instead of a binary search
// ...continued from above
 
pub struct SSTable {
    path: PathBuf,
}
 
impl SSTable {
    const FILE_NAME_PREFIX: &'static str = "sstable_";
    const FILE_EXT: &'static str = ".sst";
 
    /// Creates an SSTable file from the data in a memtable.
    /// Format of the entries remains the same.
    pub fn from_memtable(path: &Path, memtable: &Memtable) -> Result<Self> {
        let path_buf = path.to_path_buf();
        let mut file = File::create(path)?;
 
        // write all of the pre-sorted data
        for (key, value) in memtable.iter() {
            let key_length = key.len() as u32;
            let value_length = value.len() as u32;
            file.write_all(&key_length.to_le_bytes())?;
            file.write_all(&value_length.to_le_bytes())?;
            file.write_all(key)?;
            file.write_all(value)?;
        }
 
        file.flush()?;
        file.sync_all()?;
 
        Ok(Self { path: path_buf })
    }
 
    /// Find a single key on-disk.
    /// Sequentially scans each record until a match is found or we EOF.
    pub fn get(&self, target_key: &[u8]) -> Result<Option<Vec<u8>>> {
        let file = File::open(&self.path)?;
        let mut reader = BufReader::new(file);
        loop {
            let (key, value) = match read_entry_from_header(&mut reader) {
                Ok(Some((key, value))) => (key, value),
                Ok(None) => break, // EOF
                Err(e) => return Err(e).context("Failed to read WAL record"),
            };
 
            // linear scan, but we could be smarter here for an optimization
            if target_key == key {
                return Ok(Some(value));
            }
        }
 
        Ok(None)
    }
 
    /// Simple iterator for convenience to go over all key/values
    /// in an SSTable (primarily for compaction)
    pub fn iter(&self) -> Result<impl Iterator<Item = Result<(Vec<u8>, Vec<u8>)>>> {
        let file = File::open(&self.path)?;
        let mut reader = BufReader::new(file);
        Ok(std::iter::from_fn(move || {
            match read_entry_from_header(&mut reader) {
                Ok(Some((key, value))) => Some(Ok((key, value))),
                Ok(None) => None, // EOF
                Err(e) => Some(Err(e).context("Failed to read SSTable record")),
            }
        }))
    }
}

LSM Tree

Now let's put it all together and finish up our implementation!

One note here is we will just provide a compaction function, but we won't work it into the lifecycle of the LSM tree.

pub struct LsmTree {
    memtable: Memtable,
    wal: Wal,
    sstables: Vec<SSTable>,
    path: PathBuf,
    next_sstable_id: u8,
}
 
impl LsmTree {
    const MAX_ENTRY_SIZE: usize = 64 * 1024; // 64 KB
    const MAX_MEMTABLE_SIZE: usize = 128 * 1024; // 128 KB
 
    /// Open (or create) an LSM tree given the directory.
    /// If the structure exists already, we will:
    ///   - open and replay the WAL
    ///   - load any existing SSTables
    pub fn open(path: &Path) -> Result<Self> {
        let path_buf = path.to_path_buf();
        std::fs::create_dir_all(&path_buf)?;
 
        // we'll have the WAL live at `<path>/wal.log`
        let wal_path = path_buf.join("wal.log");
        let mut wal = Wal::open(&wal_path)?;
 
        // fill the memtable with the WAL replay
        let mut memtable = Memtable::new();
        for (key, value) in wal.replay()? {
            memtable.put(key, value);
        }
 
        // grab any existing SSTables `<path>/sstable_{:04}.sst`
        let mut sstables_with_id = Vec::new();
        for dir_entry_result in std::fs::read_dir(&path_buf)? {
            let dir_entry = dir_entry_result?;
            let dir_entry_path = dir_entry.path();
 
            if !dir_entry_path.is_file() {
                continue;
            }
 
            match dir_entry_path.file_name().and_then(|x| x.to_str()) {
                Some(file_name)
                    if file_name.starts_with(SSTable::FILE_NAME_PREFIX)
                        && file_name.ends_with(SSTable::FILE_EXT) =>
                {
                    let sstable_id_opt = file_name
                        .strip_prefix(SSTable::FILE_NAME_PREFIX)
                        .and_then(|s| s.strip_suffix(SSTable::FILE_EXT))
                        .and_then(|s| s.parse::<u32>().ok());
 
                    if let Some(sstable_id) = sstable_id_opt {
                        sstables_with_id.push((
                            sstable_id,
                            SSTable {
                                path: dir_entry_path,
                            },
                        ));
                    }
                }
                _ => continue,
            }
        }
 
        // sort them by id (age), oldest first
        sstables_with_id.sort_by_key(|(id, _)| *id);
 
        let sstables: Vec<SSTable> = sstables_with_id
            .into_iter()
            .map(|(_, sstable)| sstable)
            .collect();
 
        Ok(Self {
            memtable,
            wal,
            sstables,
            path: path_buf,
            next_sstable_id: 0,
        })
    }
 
    /// Put a key/value pair onto the WAL and memtable
    pub fn put(&mut self, key: Vec<u8>, value: Vec<u8>) -> Result<()> {
        self.wal.append(&key, &value)?;
        self.memtable.put(key, value);
 
        if self.memtable.total_bytes() > Self::MAX_MEMTABLE_SIZE {
            self.flush()?;
        }
 
        Ok(())
    }
 
    /// Search for a key first against the memtable, then against
    /// SSTable from newest to oldest
    pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>> {
        if let Some(value) = self.memtable.get(key) {
            return Ok(Some(value.to_vec()));
        }
 
        for table in self.sstables.iter().rev() {
            let table_read_opt = table.get(key)?;
            if let Some(value) = table_read_opt {
                return Ok(Some(value.to_vec()));
            }
        }
 
        Ok(None)
    }
 
    /// Write the current memtable to an SSTable on-disk, and then clear
    /// the existing memtable and reset the WAL
    pub fn flush(&mut self) -> Result<()> {
        if self.memtable.is_empty() {
            return Ok(());
        }
 
        let file_name = format!(
            "{}{}{}",
            SSTable::FILE_NAME_PREFIX,
            self.next_sstable_id,
            SSTable::FILE_EXT
        );
        self.next_sstable_id += 1;
        let path = self.path.join(file_name);
 
        let sstable = SSTable::from_memtable(&path, &self.memtable)?;
        self.sstables.push(sstable);
 
        self.memtable.clear();
        self.wal.purge()?;
 
        Ok(())
    }
 
    /// Very basic compaction that is called independently.
    /// It will take all SSTables, merge them, sort the merged map,
    /// write all of them as a single SStable, and then drop the old tables
    pub fn compact_all(&mut self) -> Result<()> {
        // we wouldn't generally use the memtable to do this, but all of the
        // operations are setup for us, we so can use one here
        let mut merged = Memtable::new();
 
        for table in &self.sstables {
            for read_result in table.iter()? {
                let (key, value) = read_result?;
                merged.put(key, value);
            }
        }
 
        let file_name = format!(
            "{}{}{}",
            SSTable::FILE_NAME_PREFIX,
            self.next_sstable_id,
            SSTable::FILE_EXT
        );
        self.next_sstable_id += 1;
        let path = self.path.join(file_name);
 
        let compacted_table = SSTable::from_memtable(&path, &merged)?;
 
        let paths_to_delete: Vec<PathBuf> =
            self.sstables.iter().map(|sst| sst.path.clone()).collect();
 
        self.sstables.clear();
        self.sstables.push(compacted_table);
 
        // clean up old tables on disk
        for path in &paths_to_delete {
            std::fs::remove_file(path)?;
        }
 
        Ok(())
    }
}

Trying it out

With everything in place, let's try a few operations and see how we do.

fn main() -> Result<()> {
    let lsm_tree_path = Path::new("./tmp");
    let mut lsm_tree = LsmTree::open(lsm_tree_path)?;
 
    lsm_tree.put(b"ring bearer".to_vec(), b"Frodo Baggins".to_vec())?;
    lsm_tree.put(b"wizard".to_vec(), b"Gandalf the Grey".to_vec())?;
    lsm_tree.put(b"meal".to_vec(), b"second breakfast".to_vec())?;
    lsm_tree.put(b"pipeweed".to_vec(), b"Longbottom Leaf".to_vec())?;
 
    println!("ring bearer   -> {:?}", lsm_tree.get(b"ring bearer")?.map(|b| String::from_utf8_lossy(&b).to_string()));
    println!("wizard        -> {:?}", lsm_tree.get(b"wizard")?.map(|b| String::from_utf8_lossy(&b).to_string()));
    println!("meal          -> {:?}", lsm_tree.get(b"meal")?.map(|b| String::from_utf8_lossy(&b).to_string()));
    println!("pipeweed      -> {:?}", lsm_tree.get(b"pipeweed")?.map(|b| String::from_utf8_lossy(&b).to_string()));
    println!("dragon        -> {:?}", lsm_tree.get(b"dragon")?.map(|b| String::from_utf8_lossy(&b).to_string())); // not there, currently in The Lonely Mountain
 
    Ok(())
}
 ring bearer   -> Some("Frodo Baggins")
 wizard        -> Some("Gandalf the Grey")
 meal          -> Some("second breakfast")
 pipeweed      -> Some("Longbottom Leaf")
 dragon        -> None

Looks like put and get are working as expected! You can also notice that a WAL file was created in our tmp/ directory after you run this for the first time.

We can quickly check that the replay works as expected by commenting out the put calls to see if we can still get our data.

// ... still in main
 
let lsm_tree_path = Path::new("./tmp");
let mut lsm_tree = LsmTree::open(lsm_tree_path)?;
 
// lsm_tree.put(b"ring bearer".to_vec(), b"Frodo Baggins".to_vec())?;
// lsm_tree.put(b"wizard".to_vec(), b"Gandalf the Grey".to_vec())?;
// lsm_tree.put(b"meal".to_vec(), b"second breakfast".to_vec())?;
// lsm_tree.put(b"pipeweed".to_vec(), b"Longbottom Leaf".to_vec())?;
 
println!("ring bearer   -> {:?}", lsm_tree.get(b"ring bearer")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("wizard        -> {:?}", lsm_tree.get(b"wizard")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("meal          -> {:?}", lsm_tree.get(b"meal")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("pipeweed      -> {:?}", lsm_tree.get(b"pipeweed")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("dragon        -> {:?}", lsm_tree.get(b"dragon")?.map(|b| String::from_utf8_lossy(&b).to_string())); // not there, currently in The Lonely Mountain
 ring bearer   -> Some("Frodo Baggins")
 wizard        -> Some("Gandalf the Grey")
 meal          -> Some("second breakfast")
 pipeweed      -> Some("Longbottom Leaf")
 dragon        -> None

Pretty sweet. Now let's try flushing.

lsm_tree.flush()?;
 
println!("(after flush)...");
 
println!("ring bearer   -> {:?}", lsm_tree.get(b"ring bearer")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("wizard        -> {:?}", lsm_tree.get(b"wizard")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("meal          -> {:?}", lsm_tree.get(b"meal")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("pipeweed      -> {:?}", lsm_tree.get(b"pipeweed")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("dragon        -> {:?}", lsm_tree.get(b"dragon")?.map(|b| String::from_utf8_lossy(&b).to_string())); // not there, currently in The Lonely Mountain

You should notice that a tmp/sstable_0.sst file appeared. Looks like it worked as expected.

Now let's try updating a value with put to see if our layered search approach works correctly. If it does, the value will get written to the memtable and on get we should check there first, returning the new value.

lsm_tree.put(b"meal".to_vec(), b"elevenses".to_vec())?;
println!("meal (new)    -> {:?}", lsm_tree.get(b"meal")?.map(|b| String::from_utf8_lossy(&b).to_string()));
 meal (new)    -> Some("elevenses")

Let's flush again so we now have two SSTables, and we can see if our latest write still wins.

lsm_tree.flush()?;
println!("(after second flush)");
 
println!("ring bearer   -> {:?}", lsm_tree.get(b"ring bearer")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("wizard        -> {:?}", lsm_tree.get(b"wizard")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("meal (new)    -> {:?}", lsm_tree.get(b"meal")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("pipeweed      -> {:?}", lsm_tree.get(b"pipeweed")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("dragon        -> {:?}", lsm_tree.get(b"dragon")?.map(|b| String::from_utf8_lossy(&b).to_string())); // not there, currently in The Lonely Mountain
 
 (after second flush)
 ring bearer   -> Some("Frodo Baggins")
 wizard        -> Some("Gandalf the Grey")
 meal (new)    -> Some("elevenses")
 pipeweed      -> Some("Longbottom Leaf")
 dragon        -> None

Looks correct! We can also see that tmp/sstable_1.sst appeared, which is expected.

The last thing to try is compaction. Let's give it a go.

lsm_tree.compact_all()?;
println!("(after compaction)");
 
println!("ring bearer   -> {:?}", lsm_tree.get(b"ring bearer")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("wizard        -> {:?}", lsm_tree.get(b"wizard")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("meal (new)    -> {:?}", lsm_tree.get(b"meal")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("pipeweed      -> {:?}", lsm_tree.get(b"pipeweed")?.map(|b| String::from_utf8_lossy(&b).to_string()));
println!("dragon        -> {:?}", lsm_tree.get(b"dragon")?.map(|b| String::from_utf8_lossy(&b).to_string())); // not there, currently in The Lonely Mountain
 (after compaction)
 ring bearer   -> Some("Frodo Baggins")
 wizard        -> Some("Gandalf the Grey")
 meal (new)    -> Some("elevenses")
 pipeweed      -> Some("Longbottom Leaf")
 dragon        -> None

Works as expected! You'll notice we cleaned up the two old SSTables and merged them into a combined tmp/sstable_2.sst.


Even through a toy implementation we can start to understand the inner mechanics, and see the utility of LSM trees. There are countless optimizations we could layer on top, some we touched on (like index blocks and bloom filters), and others we didn't explore, such as caching, compression, and background compaction strategies.

All in all, LSM trees underpin a huge portion of modern data infrastructure, and hopefully this served as an approachable introduction to how they work under the hood.

Thanks for reading! You can find the full source code for this implementation here.


Resources

There are some links scattered throughout, but here are some useful resources that helped guide some of the writing above: