Decompressing Large Gzip files with Deno

I've been working on a script to import a huge dataset into Neolace. The dataset has huge .gz files containing JSON-encoded entities, with one JSON object per line. Our import script is written in Deno and so the initial version had some code like this:

import { gunzip } from "https://deno.land/x/compress@v0.4.1/mod.ts";

const compressed = await Deno.readFile("/path/to/data.gz");
const binaryData = gunzip(compressed);
const stringData = new TextDecoder().decode(binaryData);
const lines = stringData.split("\n");

for (const line of lines) {
  // JSON decode and import this line.
}

This works perfectly well - until you try to process a file that's over 512MB, and then you'll start getting errors that this exceeds the maximum string or buffer size.

In the course of fixing this, I found out that there's an even simpler way to decompress .gzip files in Deno that doesn't require any third party dependencies at all: DecompressionStream. Because it's a stream-based API, DecompressionStream will work regardless of the file size.

Here's the new approach:

import { TextLineStream } from "https://deno.land/std@0.167.0/streams/text_line_stream.ts";

const fileHandle = await Deno.open("/path/to/data.gz");

const stream = fileHandle.readable
  .pipeThrough(new DecompressionStream("gzip"))
  .pipeThrough(new TextDecoderStream())
  .pipeThrough(new TextLineStream());
const reader = stream.getReader();

Now you can just await the reader.read() method to get each line of the data, already decompressed, decoded into a unicode string, and separated by lines.

Using TextLineStream is important for use cases like mine, because without it, the decompressed text may be broken up into chunks that don't end on a line boundary, and you'd otherwise get a parsing error trying to parse an incomplete line.

And that worked perfectly - streaming decompression and decoding of huge files, with no third party dependencies.

Updated on 2022-12-06: Changed this from custom code to handle the line joining to TextLineStream after @deno_land pointed it out on Twitter.