Jeff Keen

Dev Stuff

07 Aug 2019

How to Correct 32,000 Incorrect CSV Files in Fewer Than 32,000 Steps

Or how impossible problems can become possible when given no other choice.

CSV feels like the simplest of file formats, where it might seem that there’s not much to know after mentally expanding the acronym - Comma Separated Values. But tell me this: if it’s so simple, then why are there so many CSV parsing libraries, alternative CSV parsing libraries, and CSV parsing libraries that claim to be better or smarter, and a mountain of mangled CSVs in existence?

CSV isn’t so much a file format as it is a loose set of guidelines for converting tabular data into text. The closest thing to a spec for it is this, which deals with vital and often overlooked questions such as:

“What happens if a value has a comma in it?” - oh, you quote it
“What happens if a value has a quote in it?” - oh, you put another quote before it

One question the spec definitely does not cover is one I needed answering: “What do you do with 32,000 files claiming to be valid CSVs but of the 750,000 some lines an unknown number of them have extra unquoted commas hidden in the values, basically making the data untrustworthy?” This is not such a simple problem, but it’s an interesting problem.

Read the rest of this

Ruby Gem active

Comma Splice 07 Aug 2019

Posts tagged with #problem solving

How to Correct 32,000 Incorrect CSV Files in Fewer Than 32,000 Steps

Or how impossible problems can become possible when given no other choice.