CSV ↔ JSON Conversion Pitfalls: Escaping, Quoting, Encoding
CSV looks trivial until you ship it. Embedded commas, Excel's auto-formatting, UTF-8 BOMs, and the locale wars — with rules for getting it right.
CSV is the format everyone thinks they understand until they try to import 50,000 rows of customer data and half the addresses land in the wrong columns. It looks like "values separated by commas," and that definition is right about 85% of the time. The other 15% is where your Tuesday morning goes.
Here's what actually trips people up, and how to not become the person who ships a broken CSV.
The Spec Nobody Reads
There is, in fact, a spec: RFC 4180. It's two pages long. Most tools implement a dialect of it, not the thing itself. The key rules:
- Fields containing commas, quotes, or newlines must be wrapped in double quotes
- Double quotes inside a quoted field are escaped by doubling them:
"becomes"" - Records are separated by
CRLF— yes, Windows line endings, in 2026, because that's what the RFC says
Most modern parsers accept LF too. Some legacy ones do not.
Pitfall 1: Embedded Commas and Quotes
This is the one everyone hits first.
id,name,address
1,Jane Doe,123 Main St, Apt 4B, Portland
2,"Smith, Bob",42 Oak Ave, Chicago
The first row silently becomes 5 fields. The second row is technically valid but the address still spills over.
id,name,address
1,"Jane Doe","123 Main St, Apt 4B, Portland"
2,"Smith, Bob","42 Oak Ave, Chicago"
Any field that might contain the delimiter gets quoted. Consistent, safe, boring. Boring is the goal.
And the nested-quote case:
If your generator emits \" instead of "", it's lying to you — that's JSON escaping, not CSV escaping. Different rules.
Pitfall 2: Excel's "Helpful" Auto-Interpretation
Excel is the most popular CSV editor in the world and it will actively corrupt your data.
01234(ZIP code) →1234(leading zero stripped, cast to number)0123456789(phone) →1.23457E+08(scientific notation)3/4(fraction in a product SKU) →4-Mar(guessed as a date)GENE1(gene symbol) →Jan-1(Excel's gene-name bug is so famous scientists renamed 27 genes to work around it)- Long numeric IDs (credit card, order ID) → silently rounded because IEEE 754
If your users will open your CSV in Excel, you have three options: (1) ship XLSX instead and keep types, (2) force-quote every numeric-looking string, or (3) document that "double-click this file in Excel" is not a supported workflow. Pick one and mean it.
The force-quote trick works some of the time — "01234" in the file often still gets coerced by Excel on open. The truly paranoid prefix with a tab character or single quote: ="01234". It's ugly. That's the price.
Pitfall 3: Encoding — UTF-8 BOM, Windows-1252, the Usual Suspects
Your CSV is bytes. Bytes need an encoding. If you don't declare one, every tool guesses, and they guess differently.
The specific failures you'll see:
- UTF-8 without BOM opened in Excel on Windows →
caféshows up ascafé - UTF-8 with BOM sent to a strict parser → first column name becomes
\uFEFFidand your JOIN fails - Windows-1252 ("ANSI") sent to a UTF-8 parser → emoji and smart quotes break
- Mixed encoding within one file → someone concatenated two exports and you're in pain
The pragmatic default for modern pipelines is UTF-8 with BOM, because Excel needs the BOM to detect UTF-8 and most other parsers strip it gracefully. If your consumer is strictly a program (not a human opening the file), UTF-8 without BOM.
Pitfall 4: Line Endings
Linux writes \n. Windows wants \r\n. Some old Macs wrote \r. Most tools handle all three — until you get the one that doesn't, and your "CSV" is now a single row with 4 million fields.
Rule: emit \r\n if you're producing a file for unknown consumers. Accept anything when parsing.
Pitfall 5: Semicolons vs Commas (the Locale Wars)
In Europe, the decimal separator is a comma. So 1,5 means "one and a half." If you use , as both decimal separator and field separator, you're doomed. European Excel solves this by using ; as the field delimiter:
If you're shipping CSVs to a mixed audience, consider:
- TSV (tab-separated) — unambiguous, no decimals collide with tabs
- JSON — if both ends are machines, just do this
- Explicitly document the delimiter and decimal separator in the file header or API docs
Rules for Shipping CSV That Survives First Contact
- UTF-8 with BOM for human-consumed files; without BOM for machine pipelines
- Always quote fields that contain
,,", newlines, or leading/trailing whitespace - Double up internal quotes (
""), never backslash-escape - CRLF line endings for maximum compatibility
- No empty header row — if you have no header, say so explicitly
- Treat strings as strings — wrap ZIP codes, phone numbers, and IDs in quotes and hope the consumer respects it
- If you control both ends and it's more than trivial data, use JSON or Parquet instead
When JSON Is the Right Call
CSV was designed for spreadsheets. JSON was designed for structured data. If your fields contain nested objects, arrays, or anything that isn't a scalar string/number, you're fighting the format.
The converter handles the RFC 4180 quoting rules correctly and lets you pick the header behavior, so you can flip back and forth without hand-editing.
The Takeaway
CSV is a simple format with a complicated ecosystem. The format itself is fine — it's the 40 years of inconsistent implementations, Excel's opinions, and locale differences that will get you. Be boring: quote everything questionable, declare UTF-8, document your dialect, and for anything non-trivial, skip CSV entirely.
More articles
Cron Expression Cheatsheet With Real Examples
The handful of cron patterns you actually need, the difference between 5-field and 6-field syntax, and the timezone gotcha that will burn you.
Favicon Sizes That Actually Matter in 2026
The classic 'generate 47 favicon files' advice is outdated. Here's the minimum viable set modern browsers and PWAs actually use.
Encrypt Text in Your Browser Without Sending It Anywhere
How WebCrypto's AES-GCM works, why PBKDF2 iteration count matters, and what passphrase-based encryption actually protects you from.