CSV feels like the simplest of file formats, where it might seem that there’s not much to know after mentally expanding the acronym - Comma Separated Values. But tell me this: if it’s so simple, then why are there so many CSV parsing libraries, alternative CSV parsing libraries, and CSV parsing libraries that claim to be better or smarter, and a mountain of mangled CSVs in existence?
CSV isn’t so much a file format as it is a loose set of guidelines for converting tabular data into text. The closest thing to a spec for it is this, which deals with vital and often overlooked questions such as:
-
“What happens if a value has a comma in it?” - oh, you quote it
-
“What happens if a value has a quote in it?” - oh, you put another quote before it
One question the spec definitely does not cover is one I needed answering: “What do you do with 32,000 files claiming to be valid CSVs but of the 750,000 some lines an unknown number of them have extra unquoted commas hidden in the values, basically making the data untrustworthy?” This is not such a simple problem, but it’s an interesting problem.
Read the rest of this