Notes on reading a UTF-8 encoded CSV in Python
Here’s a problem I solved today: I have a CSV file to parse which contained UTF-8 strings, and I want to parse it using Python. I want to do it in a way that works in both Python 2.7 and Python 3.
This proved to be non-trivial, so this blog post is a quick brain dump of what I did, in the hope it’s useful to somebody else and/or my future self.
Problem statement
Consider the following minimal example of a CSV file:
1,alïce
2,bøb
3,cárol
We want to parse this into a list of lists:
[
["1", "alïce"],
["2", "bøb"],
["3", "cárol"],
]
Experiments
The following code can read the file in Python 2.7; here we treat the file as a bag of bytes and only decode after the CSV parsing is done:
import csv
with open("example.csv", "rb") as csvfile:
csvreader = csv.reader(csvfile, delimiter=",")
for row in csvreader:
row = [entry.decode("utf8") for entry in row]
print(": ".join(row))
But if you run that code in Python 3, you get the following error:
Traceback (most recent call last):
File "reader2.py", line 6, in <module>
for row in csvreader:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
The following code can read the file in Python 3:
import csv
with open("example.csv", encoding="utf8") as csvfile:
csvreader = csv.reader(csvfile, delimiter=",")
for row in csvreader:
print(": ".join(row))
But the encoding
argument to open()
is only in Python 3 or later, so you can’t use this in Python 2.
In theory this is backported as codecs.open()
, but I get a different error if I use codecs.open()
in this file with Python 2.7:
Traceback (most recent call last):
File "reader3.py", line 7, in <module>
for row in csvreader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 4: ordinal not in range(128)
This feels like it should be possible using only the standard library, but it was becoming sufficiently complicated that I didn’t want to bother.
I considered defining these as two separate functions, and running:
import sys
if sys.version_info[0] == 2:
read_csv_python2()
else:
read_csv_python3()
but that felt a little icky, and would have been annoying for code coverage. Having two separate functions also introduces a source of bugs – I might remember to update one function, but not the other.
I found csv23 on PyPI, whose description sounded similar to what I wanted. The following snippet does what I want:
import csv23
with csv23.open_reader("example.csv") as csvreader:
for row in csvreader:
print(": ".join(row))
This reads the CSV file as UTF-8 in both Python 2 and 3. Having a third-party library is mildly annoying, but it’s easier than trying to write, test and maintain this functionality myself.
tl;dr
Python 2 only:
import csv
with open("example.csv", "rb") as csvfile:
csvreader = csv.reader(csvfile, delimiter=",")
for row in csvreader:
row = [entry.decode("utf8") for entry in row]
print(": ".join(row))
Python 3 only:
import csv
with open("example.csv", encoding="utf8") as csvfile:
csvreader = csv.reader(csvfile, delimiter=",")
for row in csvreader:
print(": ".join(row))
Both Python 2 and 3:
import csv23
with csv23.open_reader("example.csv") as csvreader:
for row in csvreader:
print(": ".join(row))