What Building a Goodreads Import Taught Me About Book Data

One of the fun parts of building InkTree has been discovering just how strange the seemingly simple world of book cataloguing actually is.

The last couple of posts, I talked about why star ratings fail readers and why I'm building InkTree. Those were about the semantics — how we talk about books, what ratings miss, why dimensions matter. That stuff was easy. Conceptually clear, fun to write about.

Then I started building the Goodreads import. Take a reader's CSV export, parse a few hundred books, put them somewhere useful. Simple, right?

I went in assuming book data was a solved problem. Libraries have been cataloguing books for centuries. Dewey Decimal. Library of Congress. There are literal international standards for this.

It is not a solved problem.

Here's the first thing that broke my brain: there's a difference between a book and a work. Your paperback copy of Stoner? That's a book — the Picador edition, specific ISBN, specific cover, specific page count. But "Stoner" by John Williams — the thing he actually wrote — that's a work. One work, dozens of editions.

Librarians have understood this distinction forever. Most book software hasn't bothered.

When I imported my own 400+ book Goodreads library to test our new import feature, everything looked broken.

Search "Stoner" in my database: three results. The Picador paperback. The NYRB Classics edition. The Audible audiobook. Same work, three entries, 13 reviews fragmented across them — 4 here, 5 there, 4 over there. None of them looked popular. Aggregated under one work? 13 reviews. That changes everything.

Search "The Hobbit": hardcover, mass market, illustrated edition. Three hits. You know it's one book. The database doesn't.

And the metadata gets wild. The Great Gatsby's "first publication year" showed as 2003 in my database. Not 1925. Because the system was tracking the Scribner edition, not Fitzgerald's original publication. The data was technically correct — that edition was published in 2003 — and completely useless.

This is where it actually gets interesting as a builder.

6,183 book records in our original database populated from Google Books. After clustering by normalized title and author last name, they collapsed into 4,845 unique works. That's 1,338 duplicate editions — 22% of the entire database — just on basic noise.

The clustering is almost comically simple. Strip articles, lowercase everything, remove punctuation, grab the author's last name. "The Great Gatsby" by "F. Scott Fitzgerald" becomes great gatsby|||fitzgerald. Any edition matching that key belongs to the same work. Pick the edition with the most reviews as canonical. Promote its metadata. Done.

Except not done. Because then you need actual work-level data — the real first publication year, subjects, descriptions. This is where OpenLibrary saved me. Their API is work-centric by design. Give it an ISBN and it returns the parent work, plus metadata that knows The Great Gatsby is from 1925, not 2003.

I also learned that Goodreads exports wrap ISBNs in ="..." in the CSV — a hack to stop Excel from eating leading zeros. So your import pipeline needs to know that ="9780756413026" is actually just 9780756413026. The kind of thing nobody tells you until your parser chokes on it.

I thought book data would be the straightforward part. Build the review model, add some search, build a Goodreads import — how hard could cataloguing be?

It turns out every layer has its own rabbit hole. Editions vs works was the first one. Getting high-quality cover art without publisher deals was the next — a multi-week saga involving API rate limits, placeholder detection, and the discovery that Google Books will sometimes return a photo of the table of contents instead of a cover.

That's next time. Building a book app is a masterclass in "things that sound simple."


I'm building InkTree — a reading companion that replaces star ratings with five dimensions. It's in beta, and I'm writing about the weird stuff I find along the way.