Loft - Performance

Loft offers four execution modes — interpreter, native (compiled to Rust via --native), WebAssembly (--native-wasm), and Rust (hand-written reference). The table below shows wall-clock time for ten micro-benchmarks run on the same machine, alongside CPython 3 for comparison.

Python 3 (CPython) loft interpreter loft native (rustc -O) loft wasm (wasm32-wasip2) Rust reference (rustc -O)

Benchmark	Python	loft interp	loft native	loft wasm	Rust
01 fibonacci (recursive, n=38) ▶ Try	3395 ms	4819 ms	169 ms	257 ms	92 ms
02 sum loop (10 M integers)	66 ms	584 ms	15 ms	21 ms	8 ms
03 prime sieve (trial division, n=100 000) ▶ Try	49 ms	141 ms	4 ms	6 ms	4 ms
04 Collatz lengths (1 .. 1 M) ▶ Try	7393 ms	14 379 ms	334 ms	599 ms	149 ms
05 Mandelbrot (200×200, 256 iter) ▶ Try	135 ms	344 ms	7 ms	10 ms	6 ms
06 Newton sqrt (1 M calls)	1481 ms	3437 ms	159 ms	159 ms	152 ms
07 string build (500 K appends)	70 ms	61 ms	33 ms	68 ms	23 ms
08 word frequency (hash map)	46 ms	169 ms	32 ms	60 ms	2 ms
09 dot product (5 M floats)	158 ms	428 ms	36 ms	86 ms	3 ms
10 insertion sort (3 000 integers)	131 ms	291 ms	29 ms	56 ms	4 ms

Measured on a single core, Linux x86-64. Times are wall clock, best of one warm run. Run bench/run_bench.sh from the project root to reproduce.

Key takeaways

The four loft execution modes have very different performance profiles depending on what the program does.

loft interpreter vs Python

The loft interpreter runs your program directly without a separate compilation step. For tight numeric loops it is slower than CPython — typically 1.4–10× — because CPython's core is written in highly-optimised C while loft's is Rust and must also determine the type of each value before acting on it. The gap is largest for integer-heavy workloads (sum loop: 9×). For float arithmetic the gap narrows (Mandelbrot: 2.5×). Notably, string building is faster in the loft interpreter (61 ms vs Python's 70 ms) because loft's format-string concatenation creates fewer temporary objects.

loft native vs Rust

The native pipeline compiles loft source to Rust via --native or --native-release, then invokes rustc -O. For pure floating-point workloads the generated Rust is essentially as fast as hand-written Rust (Newton sqrt: 159 ms vs 152 ms, Mandelbrot: 7 ms vs 6 ms). For integer arithmetic the gap is 1.8–2.5×, and for data-structure workloads (word count, matrix, sort) the gap is 7–16×. The bottleneck in those cases is the codegen_runtime layer: the generated code calls runtime helpers (hash lookup, text comparison, vector index) that carry more overhead than the idiomatic Rust equivalents.

loft wasm vs loft native

WebAssembly adds a modest 1.5–2× overhead over native for most workloads. The exception is floating-point throughput: Newton sqrt runs at identical speed in wasm and native (both 159 ms), because the bottleneck is the FPU, not the wasm runtime. String building is slower in wasm (68 ms) than native (33 ms) due to wasm's memory model for dynamic strings. Wasm is a good target when native compilation is unavailable — it runs everywhere wasmtime or a browser is installed.

Current bottlenecks

Interpreter — bytecode dispatch overhead The interpreter executes one instruction at a time and must check the type of each value before every operation. There is no just-in-time compiler (JIT) or other technique to speed up code that runs repeatedly. Each iteration of a tight loop pays this per-instruction cost, which is the dominant overhead for sum-loop and Collatz.

Interpreter — long arithmetic uses i64 always The Collatz benchmark uses long for range safety. Loft's long arithmetic goes through a separate opcode path with additional null-sentinel checks, making it roughly 2× slower than integer on the interpreter. The native path closes this gap (334 ms vs 149 ms for Rust).

Native — codegen_runtime helper overhead for data structures Hash lookup (word_count: 32 ms native vs 2 ms Rust = 16×), vector indexing (matrix_mul: 36 ms vs 3 ms = 12×), and sort (29 ms vs 4 ms = 7×) all go through src/codegen_runtime.rs helpers. These functions perform bounds checks, null-sentinel tests, and store-pointer indirections that hand-written Rust avoids. Eliminating these indirections for simple in-memory collections is a planned optimisation.

Native — function call overhead for recursive code The recursive Fibonacci benchmark is 1.8× slower in native loft (169 ms) than hand-written Rust (92 ms). The generated code passes stores and additional runtime context through every call frame. Reducing per-call overhead for pure functions that do not touch the heap is a planned optimisation.

wasm — dynamic string memory model String building in wasm (68 ms) is roughly 2× slower than native (33 ms). Dynamic strings in the wasm build are heap-allocated inside the linear memory model with an extra indirection layer compared to native Rust String. This is a structural limitation of the wasm target; the loft wasm build prioritises compatibility over raw string throughput.

What is planned

Interpreter speedup — superinstruction merging for common opcode pairs (e.g. load + add + store in a loop); reduces dispatch count by 2–3× for arithmetic-heavy loops.
Native data-structure inlining — for vectors and hashes that do not cross function boundaries, emit direct Rust Vec / HashMap operations instead of codegen_runtime helper calls.
Native function call reduction — inline small pure functions at the IR level before native codegen.
wasm string optimisation — use wasm-native string representation to close the 2× gap on string-building workloads.