Go to the homepage

Looping over indexOf() matches

by Tim Severien

The other day, I was writing an RFC 4180-compliant CSV parser for fun. This file format uses delimiters to separate fields and records (i.e. columns and rows respectively). A field (i.e. cell) may contain one of these delimiters, so parsing is slightly more involved than a string.split().

In my first and naive implementation, I read over every character as I wrote it to the buffer. This allowed me to keep track of a simple state machine and flush the fields and records the moment a delimiter is hit. In the vast majority of these iterations, the symbol isn’t a delimiter, so it only gets appended to the buffer. That is very wasteful and noticeable in the benchmark where I compared it to other CSV parsers.

A more top-down approach is to jump directly between potential delimiters in the buffer instead of inspecting every character. For every candidate, we can do more analysis to figure out whether it’s part of a field’s content or a true delimiter.

This is a great example of something you want to do iteratively. A buffer may or may not contain a delimiter, we have to know the delimiter’s location to be able to do more analysis, and looking up all matches when the first is likely good enough is wasteful. The iterative approach and desire to stop early makes this a great use-case for generator functions!

We can write a generic utility function that is useful beyond parsing CSV files to loop over indexOf() calls.

Here’s a simplified/less optimised version of what I’m using:

function* indexOfAll(string: string, substring: string): Generator<number> {
	// Early return for cases where we're not going to find any matches
	if (
		string.length === 0 ||
		substring.length === 0 ||
		substring.length > string.length
	) {
		return;
	}

	let index = -1;
	let offset = 0;

	// 1. Get the index of the first occurrence of `substring` in `string`, starting at `offset`
	// 2. Assign `indexOf()` return value to `index`
	// 3. Loop while `index > -1`
	while ((index = string.indexOf(substring, offset)) > -1) {
		yield index;

		// For next iteration, we want to start looking after current match, so set offset accordingly
		offset = index + substring.length;
	}
}

And then we can use that in a for…of loop:

const string = 'Hello world';
const substring = 'l';

for (const index of indexOfAll(string, substring)) {
	console.log(index); // 2, 3, 9

	// If we're not interested in other results, we can `break` out of the loop
}

This is not nearly enough to get my CSV parser up to par with alternatives, but it’s a small, reusable optimization that removes work from a hot loop.