I built a 2x faster lexer, then discovered I/O was the real bottleneck

https://news.ycombinator.com/rss Hits: 3
Summary

I built an ARM64 assembly lexer (well, I generated one from my own parser generator, but this post is not about that) that processes Dart code 2x faster than the official scanner, a result I achieved using statistical methods to reliably measure small performance differences. Then I benchmarked it on 104,000 files and discovered my lexer was not the bottleneck. I/O was. This is the story of how I accidentally learned why pub.dev stores packages as tar.gz files. The setup I wanted to benchmark my lexer against the official Dart scanner. The pub cache on my machine had 104,000 Dart files totaling 1.13 GB, a perfect test corpus. I wrote a benchmark that: Reads each file from disk Lexes it Measures time separately for I/O and lexing Simple enough. The first surprise: lexing is fast Here are the results: Metric ASM Lexer Official Dart Lex time 2,807 ms 6,087 ms Lex throughput 402 MB/s 185 MB/s My lexer was 2.17x faster. Success! But wait: Metric ASM Lexer Official Dart I/O time 14,126 ms 14,606 ms Total time 16,933 ms 20,693 ms Total speedup 1.22x - The total speedup was only 1.22x. My 2.17x lexer improvement was being swallowed by I/O. Reading files took 5x longer than lexing them. The second surprise: the SSD is not the bottleneck My MacBook has an NVMe SSD that can read at 5-7 GB/s. I was getting 80 MB/s. That is 1.5% of the theoretical maximum. The problem was not the disk. It was the syscalls. For 104,000 files, the operating system had to execute: 104,000 open() calls 104,000 read() calls 104,000 close() calls That is over 300,000 syscalls. Each syscall involves: A context switch from user space to kernel space Kernel bookkeeping and permission checks A context switch back to user space Each syscall costs roughly 1-5 microseconds. Multiply that by 300,000 and you get 0.3-1.5 seconds of pure overhead, before any actual disk I/O happens. Add filesystem metadata lookups, directory traversal, and you understand where the time goes. I tried a few things that did not hel...

First seen: 2026-01-25 08:54

Last seen: 2026-01-25 10:54