Cover image generated by Nano Banana
Disclaimer: A 50x speedup is not guaranteed. Actual performance depends on the nature of the dataset and the hardware on which the code is run. Please refer to the Benchmarks section below for more information.
Introduction
Recently, I finally found some time to learn the Rust programming language. I find its memory safety guarantee quite elegant, although it comes with the trade-off of a steep learning curve, especially when it comes to Rust’s ownership and lifetime system. It is very appealing to someone like me, who mainly uses a scripting language and who writes low-level code only from time to time. Writing C/C++ code can easily lead to unstable runtime behavior or unexpected results in such circumstances.
For practice, I translated one of my CLI tools from Julia to Rust with the assistance of GPT-5-Codex. The tool performs a video understanding task via a LitServe-based HTTP server. I used the Tokio library to run the three stages (frame extraction, frame inference, and post-processing) as concurrent tasks. In my opinion, Rust is much better suited to this kind of task than Julia. I’m also satisfied with the modularity and clarity of the new Rust codebase.
However, because I still mostly use Python for my various projects, making Rust code work in Python code would offer the best of both worlds and empower me to write more efficient and maintainable code. Incidentally, I was studying the fenic library and found that it includes Polars extensions written in Rust, which are integrated into the main package to speed up certain compute-intensive tasks (e.g., handling JSON strings). I took the opportunity to study the tooling around Polars extensions and learned that Polars and some other popular Rust-powered Python libraries (e.g., Tokenizers) also use the same toolchain (they usually have more sophisticated build configurations than fenic, though).
The tooling includes PyO3[1] and Maturin[2]. PyO3 is a Rust library that provides a bridge between Rust and Python, while Maturin is a build system that simplifies the process of building and distributing Rust-based Python packages. To try out the tooling, I decided to build a Levenshtein distance[3] calculation function in Rust that can be imported and run in Python code. Getting a working implementation was much simpler than I expected, and the speedup was immediately visible. However, getting it to a more robust, publish-ready state took me a bit more digging. This blog post will walk you through the process and highlight some key configurations and concepts that may confuse beginners.
The code for this post is available on GitHub at ceshine/pyo3-Levenshtein.
Single Thread Implementation
Let’s start with the most straightforward implementation of the Levenshtein distance, which runs on a single thread. It immediately provides more than a 50× speedup compared to a single-threaded Python implementation of the same algorithm. Further optimization of the algorithm and multi-threaded execution could yield a significantly greater speedup, but let’s keep things simple to help you grasp the key ideas.
We’ll use v0.2.2 of the pyo3-Levenshtein package to demonstrate the single-threaded implementation.
Prerequisites:
- You need to have Rust and Cargo installed on your system.
- I recommend using
uv(written in Rust) to manage your Python environment, and I will use it throughout the rest of this post. However, this process should work with any Python environment manager.
Setting up the project
First, install Maturin globally using uv tool install maturin. The maturin command should be available in your terminal afterward.
I manually created Cargo.toml and pyproject.toml in the project root, following Maturin’s documentation [2]. However, you can also use the maturin new command to create these files.
Below are the contents of my Cargo.toml:
[package]
name = "pyo3-levenshtein"
version = "0.2.2"
edition = "2024"
[dependencies]
pyo3 = { version = "0.27.1", features = ["abi3-py310"] }
rayon = "1.10"
once_cell = "1.20"
dashmap = "6.1"
[features]
default = []
extension-module = ["pyo3/extension-module"]
[lib]
name = "pyo3_levenshtein"
crate-type = ["cdylib", "rlib"]
Notes:
- The
rayon,once_cell, anddashmapdependencies are for the multi-threaded implementation. For the single-threaded implementation,pyo3is the single dependency needed. cdylibis required for the Python binding;rlibis required for running the Rust doctests.
And here are the contents of my pyproject.toml file:
[project]
name = "pyo3-levenshtein"
dynamic = ["version"]
description = "Add your description here"
requires-python = ">=3.10"
dependencies = []
[build-system]
requires = ["maturin>=1.0,<2.0"]
build-backend = "maturin"
[dependency-groups]
dev = [
"ipython>=8.37.0",
"joblib>=1.5.2",
"levenshtein>=0.27.3",
"pytest>=9.0.1",
"pytest-benchmark>=5.2.3",
]
[tool.uv]
cache-keys = [{file = "pyproject.toml"}, {file = "Cargo.toml"}, {file = "**/*.rs"}]
Notes:
- None of the
devdependencies are needed for the package to work. They are only for development and benchmarking purposes. - Maturin automatically pulls the version information from Cargo.toml, so we can use
dynamic = ["version"]in this file [4]. - The
[tool.uv]section makes Python environment management tools such asuvtrack changes in Rust source files, so the uv run command will automatically rebuild the Rust code when changes are detected. - Remember to update the
descriptionfield if you plan to publish your package on PyPI.
Implementing the Rust code
For simplicity, we’ll use the pure Rust project layout [5] in this section. The trade-off is that Python type checkers won’t have type information for the Rust-powered function because we haven’t provided any type stubs.
We’ll implement the classic Wagner–Fischer algorithm to calculate the Levenshtein distance, which measures the edit distance between two strings.
Put this code in src/lib.rs (docstring omitted to save space):
use pyo3::prelude::*;
#[pyfunction]
pub fn levenshtein(s1: &str, s2: &str) -> usize {
let len1 = s1.chars().count();
let len2 = s2.chars().count();
// Handle edge cases: if either string is empty,
// the distance is the length of the other string
if len1 == 0 {
return len2;
}
if len2 == 0 {
return len1;
}
// Convert strings to char vectors to handle Unicode properly
let s1_chars: Vec<char> = s1.chars().collect();
let s2_chars: Vec<char> = s2.chars().collect();
// Create a matrix where matrix[i][j] represents the minimum edits
// needed to transform the first i characters of s1 into the first j characters of s2
let mut matrix = vec![vec![0; len2 + 1]; len1 + 1];
// Initialize first column: distance from empty string to s1[0..i]
// requires i deletions
for i in 0..=len1 {
matrix[i][0] = i;
}
// Initialize first row: distance from empty string to s2[0..j]
// requires j insertions
for j in 0..=len2 {
matrix[0][j] = j;
}
// Fill in the rest of the matrix using dynamic programming
for i in 1..=len1 {
for j in 1..=len2 {
// If characters match, no edit is needed (cost = 0)
// Otherwise, substitution is needed (cost = 1)
let cost = if s1_chars[i - 1] == s2_chars[j - 1] {
0
} else {
1
};
// Take the minimum of three possible operations:
matrix[i][j] = std::cmp::min(
std::cmp::min(
matrix[i - 1][j] + 1, // deletion: remove char from s1
matrix[i][j - 1] + 1, // insertion: add char to s1
),
matrix[i - 1][j - 1] + cost, // substitution: replace char in s1
);
}
}
// The bottom-right cell contains the final distance
matrix[len1][len2]
}
#[pymodule]
mod pyo3_levenshtein {
#[pymodule_export]
use super::levenshtein;
}
Run maturin develop --release --uv to build and install the Rust code and Python extension in your local environment. Afterwards, you can run uv run python -c "from pyo3_levenshtein import levenshtein; print(levenshtein('abc', 'ebd'))" to check if the levenshtein function is now available in Python. The output of this command should be 2 (two substitution operations).
Summary
We’ve finished implementing a Rust-powered Levenshtein distance function in Python! You can stop here if you’re in a hurry. We’ve successfully achieved our goal: a much faster Levenshtein distance function in Python.
In the following sections, we’ll discuss benchmarking details, the multi-threaded implementation, Python type stubs, and further performance improvements.
Benchmarking
I use pytest-benchmark [6] in this project to get a benchmark fixture in pytest that benchmarks any function passed to it and prints the results at the end of the test run.
The benchmark dataset is synthetic and consists of string pairs with random lengths from 1 to 50 (not using strings of length 0 because they will trigger an early return), with characters sampled from all ASCII letters and digits. Because the Rust implementation is really fast, especially when run in multi-threaded mode, we use 2^14 (16,384) pairs of strings in the dataset and measure the time required for the program to calculate the distances for all pairs.
The benchmarking code looks like the following (find the full benchmarking script for the single-threaded implementations here):
from pyo3_levenshtein import levenshtein as pyo3_levenshtein_
TEST_DATASET_SIZE = 2**14
def run_levenshtein(test_dataset: list[tuple[str, str, int]], distance_func: Callable[[str, str], int]) -> list[int]:
"""Run a levenshtein distance function on a test dataset.
Args:
test_dataset: List of tuples containing (str1, str2, expected_distance)
distance_func: The levenshtein distance function to use
Returns:
List of computed distances
"""
result: list[int] = []
for str1, str2, _ in test_dataset:
result.append(distance_func(str1, str2))
return result
def test_pyo3_levenshtein(test_dataset: list[tuple[str, str, int]]):
res = run_levenshtein(test_dataset, pyo3_levenshtein_)
assert len(res) == TEST_DATASET_SIZE
for pred, (_, _, gt) in zip(res, test_dataset):
assert pred == gt
You can run the benchmark with the following command: uv run pytest benchmarking/benchmarks_single.py --benchmark-max-time 10. This command will repeat the test until the specified maximum time elapses and then report the statistics.
I copied the pure Python implementation from toastdriven/pylev in benchmarking/pure_python_impl.py as a baseline implementation.
The single-threaded Python implementation of the Wagner–Fischer algorithm takes about 2,500 ms to process all 16,384 pairs of strings on average, while our Rust/PyO3 version takes about 40 ms on average. That’s a 60x speedup!
Sample benchmark output (v0.3.0); results vary slightly from README due to statistical variance.
Multi-threaded implementation
I use the Rust crate rayon [7] to provide data parallelism. The parallel implementation is much more involved than the single-threaded one because it must explicitly release the Python GIL before distributing work to the thread pool.
I also added a global thread pool cache to avoid repeatedly creating thread pools during benchmarking. In practice, it would be more reasonable to use a single thread pool if the user is not expected to change the number of threads during the program’s lifetime.
Below is the multi-threaded levenshtein_batch function taken from version 0.3.0. I’ve omitted the doc comments to save space. You can find the full version on GitHub.
use rayon::prelude::*;
use pyo3::exceptions::PyValueError;
#[pyfunction(signature = (pairs, num_threads=None, grapheme_segmentation = false))]
pub fn levenshtein_batch(
py: Python<'_>,
pairs: Vec<(String, String)>,
num_threads: Option<usize>,
grapheme_segmentation: bool,
) -> PyResult<Vec<usize>> {
// Handle empty input
if pairs.is_empty() {
return Ok(Vec::new());
}
// Validate thread count if specified
if let Some(threads) = num_threads {
if threads == 0 {
return Err(PyValueError::new_err("num_threads must be at least 1"));
}
}
// Release GIL and perform computation
// The computation is entirely in Rust and doesn't need Python access
py.detach(|| {
if let Some(threads) = num_threads {
// Use cached thread pool for the specified thread count
let pool = get_or_create_pool(threads).map_err(PyValueError::new_err)?;
Ok(pool.install(|| {
pairs
.par_iter()
.map(|(s1, s2)| levenshtein(s1, s2, grapheme_segmentation))
.collect()
}))
} else {
// Use the global thread pool (lazily initialized and reused by Rayon)
// This avoids creating a new thread pool on every call
Ok(pairs
.par_iter()
.map(|(s1, s2)| levenshtein(s1, s2, grapheme_segmentation))
.collect())
}
})
}
A side effect of incorporating the rayon crate and interacting with the Python GIL (Global Interpreter Lock) is that the cargo test command now requires a Python interpreter and its dev headers.
I created a shell script, run-cargo-tests.sh, to determine the paths to the Python interpreter and the Python library directory, set the appropriate environment variables, and run cargo test. You should run the script in your CI environment (continuous integration) instead of running cargo test directly.
The parallelism provided by the rayon crate is highly efficient. With a 12-thread pool, which in theory uses all hyperthreads on the six performance cores of my Core i5-13500 CPU, I see about a 6× speedup compared to the single-threaded version (6.5 ms vs. 40 ms). This suggests there is very little overhead.
The benchmarking code for the multi-threaded implementation is available at benchmarking/benchmarks_multi.py.
Python Type Stubs
To provide typing information to Python type checkers, we need to convert the project from a pure-Rust layout to a mixed Rust/Python layout [5].
We need to create a Python package in python/pyo3_levenshtein, which will contain the following files:
__init__.py: Imports the Rust-powered module (this is done automatically by maturin in a pure-Rust layout) and provides version information.py.typed: An empty file.pyo3_levenshtein.pyi: Provides type stubs for the Rust-powered module.
Contents of __init__.py:
from importlib import metadata
# Import the Rust-powered module
from .pyo3_levenshtein import * # noqa
try:
__version__ = metadata.version("pyo3-levenshtein")
except metadata.PackageNotFoundError:
# This handles the case where the package is imported
# without being installed (e.g., local script execution)
__version__ = "unknown"
Contents of pyo3_levenshtein.pyi (docstrings omitted to save space):
from typing import List, Tuple, Optional
def levenshtein(s1: str, s2: str, grapheme_segmentation: bool = False) -> int:
...
def levenshtein_batch(pairs: List[Tuple[str, str]], num_threads: Optional[int] = None, grapheme_segmentation: bool = False) -> List[int]:
...
Run maturin develop --release --uv again, and your IDE and type checker should no longer complain about missing type stubs for the levenshtein and levenshtein_batch functions.
Further Optimizations
Despite already being significantly faster than the pure-Python implementation, our Rust implementation is still much slower than the C implementation used in the Levenshtein Python package [8]. That package is powered by rapidfuzz-cpp [9]. I used Claude Code to summarize the optimizations employed in rapidfuzz-cpp and stored the report in rapidfuzz-levenshtein.md. Unfortunately, these techniques are too sophisticated to be incorporated into this learning project.
However, I found another Python package, jellyfish, that provides high-performance implementations of various string-comparison metrics (including Levenshtein distance). (It also uses PyO3 and Maturin to integrate Rust code into the package.) In version 0.3.0 of my pyo3-levenshtein package, I ported some of the techniques it uses and achieved a 1.7× speedup in single-threaded mode, along with much lower memory usage (space complexity). I’ve also added support for grapheme-cluster segmentation for more complex Unicode characters, but it incurs some performance overhead, so it’s disabled by default and in benchmarks.
Explaining the details of these optimizations would require a whole new blog post. For now, I encourage you to read the code directly to learn more about them.
References
- PyO3 - Python Functions
- Maturin Tutorial
- Levenshtein Distance
- GitHub Issue: Add dynamic = [“version”] to pyproject.toml
- Maturin: Project Layout
- pytest-benchmark
- Crate rayon
- Levenshtein (formerly python-levenshtein)
- rapidfuzz-cpp
AI Use Disclosure
I used AI tools to revise my writing (primarily for grammar and lexical correctness) in this post. However, I wrote most of the content myself; it was not generated with prompts.