“I expected a clean table, but my CSV file wrote a novel in cell C3.”
Me, one dark evening
We’ve all been there. You crack open a CSV file expecting a tidy, tabular layout, and boom – cell contents are suddenly poets, breaking into multiple lines like they’re performing at open mic night.
Imagine this beauty:
Col1;col2;"col3 l1 Col3 l2 Col3 l3";col4;col5
Or even worse:
Col1;col2;"col3 l1"
"Col3 l2"
"Col3 l3";col4;col5
It’s like Excel (or whatever poisonous program you use) got drunk and started mixing Shakespeare into your sales data.
So what’s going on?
CSV files can handle newlines inside fields – if you wrap those fields in quotes! The problem? Not every tool handles that gracefully. Sometimes the quotes are inconsistent. Sometimes every line of the multi-line value has quotes. Sometimes, you wonder if JSON was really the better choice all along.
But fear not: I’ve got a Python script that’ll whip those multi-line rebels back into shape.
Python to the rescue
Here’s a little Python script that does the dirty work and has helped me in more than one situation in my daily work. The script is really simple and will:
- Detects when a field spans multiple lines
- Merges it all into one line
- Makes sure there’s only one pair of quotes around the cleaned-up string
- Leaves your CSV nice, parseable, and drama-free
import csv
def clean_multiline_csv(input_file, output_file):
cleaned_rows = []
current_row = ""
inside_multiline = False
with open(input_file, encoding='utf-8') as infile:
for line in infile:
line = line.rstrip('\n')
if not inside_multiline:
current_row = line
if current_row.count('"') % 2 == 1:
inside_multiline = True
else:
current_row += "\\n" + line.strip()
if current_row.count('"') % 2 == 0:
inside_multiline = False
parts = current_row.split(';')
for i, part in enumerate(parts):
if part.count('"') > 2:
parts[i] = '"' + part.replace('"', '').replace("\\n", " ") + '"'
break
current_row = ';'.join(parts)
if not inside_multiline:
cleaned_rows.append(current_row)
with open(output_file, 'w', encoding='utf-8', newline='') as outfile:
for row in cleaned_rows:
outfile.write(row + '\n')
if __name__ == "__main__":
clean_multiline_csv('input.csv', 'output.csv')
Stick that in a .py file, run it on your messy CSV, and rejoice as the chaos is compressed back into order.
Conclusion
Parsing CSV files should be simple—but when multi-line values sneak in, things can spiral fast. This Python script gives you a quick fix, so your columns stop getting all emotional mid-row.
Feel free to fork it, modify it, or just scream at it in cathartic frustration. Either way—happy data wrangling!
Its also available here: https://gist.github.com/kimusan/db7b74ec9f413afd990ed58c2394011b