Huge CSV Generator ← Back to viewer

How we built a WebAssembly CSV parser

Huge CSV opens CSV files that crash spreadsheets — tens of gigabytes, hundreds of millions of rows — entirely in your browser, with nothing uploaded. The part that has to read every byte is written in C and compiled to a WebAssembly module under 4 KB. Here is the source, the trick that lets it parse chunks in parallel, the exact compile command, and a benchmark where it runs 2.3× faster than the same logic in JavaScript.

webassemblycsvrfc 4180clang wasm32performance

What the parser actually does

When you drop a CSV onto Huge CSV, we never build an in-memory array of rows — a 21 GB file would never fit. Instead we make a single streaming pass over the bytes that records just enough to render and scroll the file: how many rows there are, the byte offset of periodic row boundaries so the virtual scroller can jump straight to row eighty million, and the widest cell in each column so the table can auto-fit. That pass is the only work that scales with file size, so it is the work worth making fast.

The file is split into chunks, each chunk is handed to a Web Worker, and the workers run the pass in parallel while the main thread stays free to paint. Inside each worker the hot loop is a byte-at-a-time RFC 4180 state machine. That loop is where all the time goes, and it is why it is written in WebAssembly rather than JavaScript.

Why WebAssembly

WebAssembly is a compact binary instruction format that every modern browser compiles to real machine code. You write C (or Rust, or Zig), compile to a .wasm file, and the engine turns it into native code for the CPU it is running on. For a tight byte loop that matters for two reasons. First, there is no garbage collector pausing in the middle of a hundred-million-iteration scan. Second, the performance is predictable: no JIT warmup, no deoptimization when a hidden class changes — the same code path every time. A WASM module also has no DOM and no network; its whole world is one flat array of bytes called linear memory, which is exactly the shape of "here is a block of bytes, scan it." A sandboxed number-cruncher living inside a worker is precisely what we want.

Unlike our sibling JSON parser, which keeps a readable JavaScript engine alongside the WASM one, the CSV viewer is WASM-only: the worker loads the module, checks it, and parses. The JavaScript version below exists only as the yardstick we benchmark against, to show why the C path earns its place.

The interesting problem: chunks that start mid-quote

Here is the wrinkle that makes a streaming CSV parser harder than it looks. We do not hand the parser a whole file — we hand each worker a 256 KB block, and a block can begin anywhere. RFC 4180 lets a field be quoted, and a quoted field can contain commas and even literal newlines. So a block boundary can land in the middle of a quoted cell, and a lone newline just after it might be ordinary text inside that cell, not a row terminator. The block has no way to know which — that depends on a block it never sees.

Our answer is to run both hypotheses at once. Every chunk is parsed twice in lockstep: H0 assumes the chunk starts outside a quoted field, H1 assumes it starts inside one. Each hypothesis tracks its own row count, column widths and quote state. When the chunks are stitched back together, each chunk reports whether it ended inside a quote, and that picks the surviving hypothesis for the next chunk. (Our JSON parser needs 25 such entry scenarios; CSV needs only these two.) That two-for-one design is the reason the loop has to be cheap — we pay for it twice on every byte.

// csv-parser.c — the unquoted-context core of the state machine.
// `delim` is the active delimiter; `h` is one hypothesis' state.
if (b == delim) { end_field(h); continue; }
if (b == BYTE_LF) {                       // row terminator outside quotes
  end_record(h);
  // the next row starts at relStart + i + 1, with h->rows rows before it:
  // that's this block's seek sample for the virtual scroller.
  if (needSample) { hyp_push_sample(h, relStart + i + 1u, h->rows, 1); needSample = 0; }
  continue;
}
if (b == BYTE_CR) { continue; }                  // CRLF: skip CR, the LF ends the row
if (b == BYTE_QUOTE && !h->curFieldStarted) {
  h->inQuote = 1; h->curFieldStarted = 1; continue;  // an opening quote
}
if ((b & 0xC0u) != 0x80u) h->curWidth++;     // count a UTF-8 codepoint, not a byte
h->curFieldStarted = 1;
The whole parser is one freestanding C file, sites/csv/wasm/csv-parser.c.

Two details in that excerpt are worth calling out. The width counter only increments when (b & 0xC0) != 0x80 — i.e. it skips UTF-8 continuation bytes, so a column of Japanese or Ukrainian text is measured in characters, not bytes, and auto-fit is correct for any script. And every block emits exactly one record-aligned "seek sample" — a {byteOffset, rowsBefore} pair — which is the sparse index the virtual scroller binary-searches to map a scrollbar position to a byte offset without ever re-reading the file.

Freestanding C: no libc, no malloc

The module is freestanding — it links against no standard library at all. There is no malloc; all analyzer-scoped memory comes from a bump allocator that just walks a pointer forward through linear memory, and resetting it is moving the pointer back to the start. A worker can parse ten thousand chunks in a row without leaking a byte, because each new analyzer resets the arena.

// No stdlib. A single bump arena owns all analyzer-scoped data;
// new_analyzer() resets it so long-running workers never leak.
#include <stdint.h>
#include <stddef.h>

// clang may still lower a big struct copy to memcpy under -O3, and with
// -nostdlib there is nothing to bind it to — so we provide our own.
void *memcpy(void *dst, const void *src, size_t n) {
  uint8_t *d = (uint8_t *)dst; const uint8_t *s = (const uint8_t *)src;
  for (size_t i = 0; i < n; i++) d[i] = s[i];
  return dst;
}
Providing tiny memcpy/memset ourselves keeps the linker happy under -nostdlib.

This bit us once: even with -fno-builtin, the compiler emitted a memcpy call for a struct copy, and because there was no memcpy to link, the module failed to instantiate — and the worker silently produced nothing. A freestanding build means you own every symbol the optimizer might reach for.

How we compile it

We compile straight with LLVM's clang, which ships a wasm32 backend, and link with wasm-ld. No Emscripten runtime, no wasm-bindgen, no npm toolchain in the hot path — the whole build is one script, and the flags are short:

clang \
  --target=wasm32 \
  -msimd128 \              # enable the 128-bit SIMD instruction set
  -nostdlib \              # freestanding: no libc
  -fno-builtin \
  -Wall -Wextra -Werror \  # the parser builds clean or not at all
  -O3 -flto \             # optimize hard, link-time optimization
  -Wl,--no-entry \         # a library, not a program with main()
  -Wl,--export-dynamic \
  -Wl,--strip-all \
  -o src/wasm/csv-parser.wasm  wasm/csv-parser.c

The whole module — the full RFC 4180 state machine, both hypotheses, the width and seek-sample bookkeeping — comes out to 3,864 bytes. That is the entire parser in less space than this paragraph's worth of JavaScript. The .wasm is committed to the repository on purpose: anyone cloning the project, and the static host serving it, never needs clang installed.

One honest note on -msimd128: we turn SIMD on so the optimizer is free to use it, but unlike the JSON parser — which hand-writes 16-byte-wide vector scans — the CSV core is a straightforward scalar state machine. CSV's grammar branches on almost every byte (delimiter? quote? newline? continuation?), so there is far less of the long single-character run that SIMD vectorizes so well in JSON string bodies. The CSV win comes from branch-predictable native code, no GC, and a tight per-byte loop, not from hand-vectorization. It is a candidate for a future SIMD pass over quoted-field interiors; for now the scalar loop is already comfortably ahead of JavaScript.

How we ship and guard it

The browser fetches the .wasm and instantiates it once per worker, then refuses to use a module that is not exactly the one it expects — before it ever touches a byte:

const { instance } = await WebAssembly.instantiate(bytes, {});
const exports = instance.exports;

// Catch an ABI drift or a truncated module at load time, not from a
// wrong row count three million rows into a file.
if (exports.parser_abi_version() >>> 0 !== EXPECTED_ABI)
  throw new Error('WASM ABI mismatch — rebuild with `npm run build:wasm`');
const probe = 0x01020304;
if ((exports.ping(probe) >>> 0) !== ((probe ^ 0xA5A5A5A5) >>> 0))
  throw new Error('WASM ping sanity check failed');

The C side and the JS loader agree on a struct, CsvSnapshot, that the loader reads field-by-field through a DataView. The version number and the ping round-trip are cheap insurance that the two layouts have not drifted. Once the page has loaded, a service worker caches the app and the .wasm, so the parser keeps working with no network at all.

The benchmark

Talk is cheap, so here is the run. I built a ~106 MB CSV (1,066,800 rows, mixed Latin/Japanese/Cyrillic cells, with a share of quoted fields containing commas and embedded newlines) and fed it through the real pipeline — 256 KB blocks, both hypotheses, the full finalize — against an equivalent JavaScript state machine. To keep it apples-to-apples, the JS figure does the same two-hypothesis work the production WASM path does. Best of six iterations on this machine:

Engine (106 MB, 1.07 M rows)Wall timeThroughputRelative
JavaScript, 1 hypothesis382 ms278 MB/s
JavaScript, 2 hypotheses (production-equivalent)774 ms137 MB/s1.0×
WASM, 2 hypotheses343 ms309 MB/s2.26×

Same file, same work, same machine — the only difference is whether the byte loop is JavaScript or WebAssembly. The WASM module does the full two-hypothesis parse at 309 MB/s against the equivalent JavaScript's 137 MB/s, a little over 2.3× faster, and the gap only widens on bigger files where the per-byte loop dominates everything else. On disk-bound multi-gigabyte files the wall-clock is gated by read speed, but the parser is never the bottleneck.

One thing this table is not: a comparison against Array.prototype.split or a library that builds a full array of row objects. Those measure a different, larger job — materializing every cell as a string. Our number is the cost of the streaming validate-and-index pass the viewer actually needs, nothing more.

The short version

Huge CSV opens CSV files far too large for a spreadsheet, entirely in your browser, with nothing uploaded. The expensive step is one streaming pass over every byte to count rows, index row boundaries for the scroller, and measure column widths for auto-fit. We wrote that pass in freestanding C — an RFC 4180 state machine that runs two boundary hypotheses at once so 256 KB chunks can be parsed in parallel — and compiled it with clang to a 3.8 KB WebAssembly module. On a 100 MB file it does the work at 309 MB/s, about 2.3× the speed of the same logic in JavaScript, and it keeps running offline once the page is cached. You can reproduce the input with the generator and watch the parser at work in the viewer.

Open the viewer Generate a test file