Csvread, 1st cell incorrect with UTF-8 csv files from MS excel

Greetings, I have noticed that csvread incorrectly returns the first cell as ‘0’, when the csv file was saved from MS Excel as ‘UTF-8 csv’, but works okay when the export type is ‘CSV’. So I suspect something is going wrong with how csvread works with the ‘byte order mark’ that MS Excel is now using for this export type. While I am not implying csvread contains a bug (it is probably meant to work only with ascii csv), it may be appropriate to add a caveat to the documentation, i.e. csvread’s function reference.

To replicate using GNU Octave 5.2 for windows, and Excel 365, create a csv file csvu.csv such as [1 2 3; 4 5 6] and save as utf8 csv, and the same numbers saved as plain (ascii) csva.csv. In Octave, use csvread on both files and observe the first value is 0 from csvu.csv and 1 from csva.csv.

Regards,
Adrian

Thank you for your report. Was it possible for you to upload a compressed ZIP file of your CSV file here? (Uploading ZIP files should be allowed by this forum)

Hi, thanks for your reply. I have attached the CSV. I also found some information on this problem at this blog:
https://donatstudios.com/CSV-An-Encoding-Nightmare

csva_and_csvu.zip (323 Bytes)

The elementary storage unit for UTF-8 is a byte. So imho, a byte-order-mark doesn’t really make sense (unlike for e.g. UTF-16 or UTF-32).
But since MS seems to use the BOM to mark files containing UTF-8 (as opposed to any other encoding), we should probably handle it somehow. It should be safe for dlmread (the work-horse behind csvread) to just skip the BOM if it is present.

Could you open a bug report for this on savannah (linking to this topic)?
https://savannah.gnu.org/bugs/?func=additem&group=octave

Bug report opened https://savannah.gnu.org/bugs/index.php?58813 moving the discussion there.

UTF-8 with a BOM isn’t common, but it is valid. I think it’s something we should support.

You’re right; for UTF-8, there’s no endianness, so a BOM doesn’t serve to indicate byte ordering. It just acts as a magic cookie that tells you it’s a UTF-8 text stream, which can be useful if you’re trying to “sniff” a file as UTF-8 vs ASCII vs 8859-1 vs cp-1252 or whatever.

Agreed.
Stripping the UTF-8 BOM for .csv files has been implemented on the current default branch which will become Octave 7 probably some time next year.