ZairussalamTools

CSV ↔ JSON Conversion Pitfalls: Escaping, Quoting, Encoding

CSV looks trivial until you ship it. Embedded commas, Excel's auto-formatting, UTF-8 BOMs, and the locale wars — with rules for getting it right.

·Ibrahimsyah Zairussalam·

CSV is the format everyone thinks they understand until they try to import 50,000 rows of customer data and half the addresses land in the wrong columns. It looks like "values separated by commas," and that definition is right about 85% of the time. The other 15% is where your Tuesday morning goes.

Here's what actually trips people up, and how to not become the person who ships a broken CSV.

The Spec Nobody Reads

There is, in fact, a spec: RFC 4180. It's two pages long. Most tools implement a dialect of it, not the thing itself. The key rules:

  • Fields containing commas, quotes, or newlines must be wrapped in double quotes
  • Double quotes inside a quoted field are escaped by doubling them: " becomes ""
  • Records are separated by CRLF — yes, Windows line endings, in 2026, because that's what the RFC says

Most modern parsers accept LF too. Some legacy ones do not.

Pitfall 1: Embedded Commas and Quotes

This is the one everyone hits first.

Broken
id,name,address
1,Jane Doe,123 Main St, Apt 4B, Portland
2,"Smith, Bob",42 Oak Ave, Chicago

The first row silently becomes 5 fields. The second row is technically valid but the address still spills over.

Correct
id,name,address
1,"Jane Doe","123 Main St, Apt 4B, Portland"
2,"Smith, Bob","42 Oak Ave, Chicago"

Any field that might contain the delimiter gets quoted. Consistent, safe, boring. Boring is the goal.

And the nested-quote case:

Escaping a literal double quote

If your generator emits \" instead of "", it's lying to you — that's JSON escaping, not CSV escaping. Different rules.

Pitfall 2: Excel's "Helpful" Auto-Interpretation

Excel is the most popular CSV editor in the world and it will actively corrupt your data.

  • 01234 (ZIP code) → 1234 (leading zero stripped, cast to number)
  • 0123456789 (phone) → 1.23457E+08 (scientific notation)
  • 3/4 (fraction in a product SKU) → 4-Mar (guessed as a date)
  • GENE1 (gene symbol) → Jan-1 (Excel's gene-name bug is so famous scientists renamed 27 genes to work around it)
  • Long numeric IDs (credit card, order ID) → silently rounded because IEEE 754
Excel is not a CSV tool

If your users will open your CSV in Excel, you have three options: (1) ship XLSX instead and keep types, (2) force-quote every numeric-looking string, or (3) document that "double-click this file in Excel" is not a supported workflow. Pick one and mean it.

The force-quote trick works some of the time — "01234" in the file often still gets coerced by Excel on open. The truly paranoid prefix with a tab character or single quote: ="01234". It's ugly. That's the price.

Pitfall 3: Encoding — UTF-8 BOM, Windows-1252, the Usual Suspects

Your CSV is bytes. Bytes need an encoding. If you don't declare one, every tool guesses, and they guess differently.

The specific failures you'll see:

  • UTF-8 without BOM opened in Excel on Windows → café shows up as café
  • UTF-8 with BOM sent to a strict parser → first column name becomes \uFEFFid and your JOIN fails
  • Windows-1252 ("ANSI") sent to a UTF-8 parser → emoji and smart quotes break
  • Mixed encoding within one file → someone concatenated two exports and you're in pain

The pragmatic default for modern pipelines is UTF-8 with BOM, because Excel needs the BOM to detect UTF-8 and most other parsers strip it gracefully. If your consumer is strictly a program (not a human opening the file), UTF-8 without BOM.

Pitfall 4: Line Endings

Linux writes \n. Windows wants \r\n. Some old Macs wrote \r. Most tools handle all three — until you get the one that doesn't, and your "CSV" is now a single row with 4 million fields.

Rule: emit \r\n if you're producing a file for unknown consumers. Accept anything when parsing.

Pitfall 5: Semicolons vs Commas (the Locale Wars)

In Europe, the decimal separator is a comma. So 1,5 means "one and a half." If you use , as both decimal separator and field separator, you're doomed. European Excel solves this by using ; as the field delimiter:

Locale-aware dialect

If you're shipping CSVs to a mixed audience, consider:

  • TSV (tab-separated) — unambiguous, no decimals collide with tabs
  • JSON — if both ends are machines, just do this
  • Explicitly document the delimiter and decimal separator in the file header or API docs

Rules for Shipping CSV That Survives First Contact

  1. UTF-8 with BOM for human-consumed files; without BOM for machine pipelines
  2. Always quote fields that contain ,, ", newlines, or leading/trailing whitespace
  3. Double up internal quotes (""), never backslash-escape
  4. CRLF line endings for maximum compatibility
  5. No empty header row — if you have no header, say so explicitly
  6. Treat strings as strings — wrap ZIP codes, phone numbers, and IDs in quotes and hope the consumer respects it
  7. If you control both ends and it's more than trivial data, use JSON or Parquet instead

When JSON Is the Right Call

CSV was designed for spreadsheets. JSON was designed for structured data. If your fields contain nested objects, arrays, or anything that isn't a scalar string/number, you're fighting the format.

The converter handles the RFC 4180 quoting rules correctly and lets you pick the header behavior, so you can flip back and forth without hand-editing.

The Takeaway

CSV is a simple format with a complicated ecosystem. The format itself is fine — it's the 40 years of inconsistent implementations, Excel's opinions, and locale differences that will get you. Be boring: quote everything questionable, declare UTF-8, document your dialect, and for anything non-trivial, skip CSV entirely.