Reading TSV files

Is it possible to read and write tab-separated values (.TSV) files with Octave?

Did you try dlmread with \t as the delimiter argument?

At least, the following works for me for writing and reading tab separated data files:

data = rand (4);
fname = 'toto.csv';
dlmwrite (fname, data, '\t');

## Open in default application (LibreOffice here) works
open (fname)

## Reread also works
data2 = dlmread (fname, '\t');
1 Like

This works for numeric data. Is it possible to read also strings from TSV files? I have a TSV file with both numeric columns and text columns. So I need a function in which a range can be specified (as in dlmread).

If you know how many columns the tab separated file contains, you could use textscan:

fid = fopen('test.csv');
a = textscan(fid, '%s %s %s %s %s', 'Delimiter', '\t')
fclose(fid)

test.csv (43 Bytes)

1 Like

I have found that also csv2cell (io package) works fine.

You can also use csv2cell in the io package; that function can read mixed type .csv & .tsv files. Type ‘help csv2cell’ for the options you need.

1 Like

I still have a problem with a TSV file with about 13 million lines. I need to read a column with string elements. The function csv2cell of the IO package does not work for a file of this size (while it works for smaller files). I use textscan instead but it only manages to read the first 12 million lines or so and then it appears to read random elements from the whole TSV file (which has 21 columns). I have tried several attempts at solving this problem: I have tried to read the first 12 million lines with the instructions:

fid = fopen(file_name);
for i=1:n1
temp = textscan(fid, ‘%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s’, 1, ‘EndOfLine’, “\r\n”, “Delimiter”, “\t”);
% variants_autosomes_h_1 {i,1} = char(temp {1,1});
endfor
fclose(fid)

Then I have read the following lines with the instructions:

fid = fopen(file_name);
temp = textscan(fid, ‘%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s’, 1, ‘EndOfLine’, “\r\n”, “Delimiter”, “\t”, “HeaderLines”, n1);
variants_autosomes_h_2 {1,1} = char(temp {1,1});
for i=2:n2
temp = textscan(fid, ‘%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s’, 1, ‘EndOfLine’, “\r\n”, “Delimiter”, “\t”);
variants_autosomes_h_2 {i,1} = char(temp {1,1});
endfor
fclose(fid)

Where n1 + n2 = n, the total number of lines. It doesn’t work. The first cell array is fine. The second one is empty. If I then run only the script for the second array, it is made up of random elements taken from the input TSV file.

Problems with 13 million lines? Do you run a 32-bit version of Octave?

If the file is so big you might try splitting it up and read it part by part.
Would csv2cell work if you instruct it to read a spreadsheet-like range (it’ll accept that, just look at its help), and read it part by part that way?

Maybe you hit problems with the size of the output array - would be good to watch RAM usage of Octave.

1 Like

I run octave-6.3.0-w64.

Yes, I have tried csv2cell with the following instructions:

strn = num2str(n);
sep = “\t”
prot = “\t”
range = strcat(“A1:A”,strn)
cell_array = csv2cell (file_name,range,sep,prot);

Even with a small n (just a few lines) it doesn’t work: the cell array comes empty.

Splitting it up could be the solution, let me try! Excel lets me open only about two million lines at the time. I can try by splitting in one file of 11 million lines and one of 2 million lines, for instance.

I was wrong, Excel refuses to open the 13 million line file. Even File viewer plus can’t open it. So I can’t split it into smaller files.

You should be able to use “split” utility which is included with octave distribution.

https://www.man7.org/linux/man-pages/man1/split.1.html

I have managed to split the 13 million line file into three smaller files. I now try again with textscan. Let’s see if I succeed.

Yes, I have tried csv2cell with the following instructions:

strn = num2str(n);
sep = “\t”
prot = “\t”
range = strcat(“A1:A”,strn)
cell_array = csv2cell (file_name,range,sep,prot);

Little wonder, Paolo. You specify the same character as separator and protection character - that won’t work. The protection character cannot be the same as a separator - how would Octave be able to distinguish between a separator and the start and end of a field containing separator literals?
You don’t need to specify the protection character UNLESS the string values in the file contain the same character as the separator and/or spaces AND the protection character is different than double quote (") (the protection character usually is a double quote ) OR the cell string values contain embedded double quotes.

Some questions:

  • what is the max. number of fields per line in your .tsv file? From one of your earlier posts I infer it’s about 25, right?
  • are there any fields containing tabs and.or space?
  • if so, are those fields enclosed in double quotes?

So if you don’t need a protection character, just try:
C = csv2cell (file_name, “A1:Y5000000”, “\t”); ## for a 5 million lines file

BTW As far as I’m aware csv2cell itself has no limit on nr. of lines, but perhaps it runs out of memory somewhere. Looking at the code for csv2cell, I fail to see where. But:
TTBOMK cell arrays and structs can be fairly memory-inefficient data structures. So maybe even if you manage to split the files and read them separately, you might hit OOM issues when trying to concatenate the output cell arrays. If I am right here that’ll be equally valid for csv2cell and textscan.

The fact that Excel won’t load your file it is no surprise - spreadsheets may be small on disk but really explode in RAM due to an enormous overhead from pointers, styles, internal references, etc. Plus, AFAIK most MS-Office and Excel applications are still 32 bit so after subtracting Excel + support modules RAM usage you’re left with less than 1.5 GB for pure data.

1 Like

@PhilipN Yes, it is a TSV file with 25 columns and 13 million lines (2.86 GB). The separation character is space.

I have managed to split the file into three files of 5 million lines each and to read them with textscan. It works fine.

I now try to use cell2csv without the prot character and by reading only 5 million lines, as you suggested. But I note that the same csv2cell lines with my prot character ‘’\t’’ did work for a file with 13 million lines, but only 10 columns.

Thanks for reporting back.

cell2csv or csv2cell ?

If csv2cell works for 13 million lines and 10 columns it seems there’s a limit of 130 million entries in the output cell array. I wonder where that limitation comes from, or is it the limitation of 10 columns.
Weird.

Luckily you do have a working solution now (admittedly more of a workaround).

1 Like

@PhilipN csv2cell, sorry…

I have now tested csv2cell on a 5 million lines TSV file with 25 columns, with the script:

n= 5000000
temp = cell(n,1);
temp = csv2cell (file_name,range,sep);
file_name = strcat(“variants_variants_autosomes_”,strm,".csv")
cell2csv (file_name, temp," ")

It works fine and it is about 20 times faster than textscan called as follows:

for i=1:n
temp = textscan(fid, ‘%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s’, 1, ‘EndOfLine’, “\r\n”, “Delimiter”, “\t”);
array {i,1} = char(temp {1,1});
endfor
fclose(fid)

I confirm that csv2cell does not work for the original TSV of 13 million lines and 25 columns.

So, my advice is to split these big TSV files (which usually come from genetic data) into smaller files (there are many utilities that do this work) and to use csv2cell instead of textscan. The time spared is huge: in my case, reading a column from each of the three files I have split the original one into with textscan (and writing them with cell2csv as three files) required 1.5 hours, while with csv2cell the same operations required about 4.5 minutes.

I have not merged the arrays together into an array of 13 million lines: I have tried to, it doesn’t work.

I must mention that all these problems arise only with columns of the original file containing string elements. For columns containing floating-point elements I used dlmread without having to split the file of 13 million lines and 25 columns.

Just out of interest: Why do you use textscan in a loop with n set to 1? Wouldn’t it be faster to read the complete file in one go?

fid = fopen ('filename....');
array = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s', 'EndOfLine', "\r\n", 'Delimiter', "\t");
fclose (fid);

Or if you are just interested in the values from the first column, skip the remaining ones:

fid = fopen ('filename....');
array = textscan(fid, '%s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s', 'EndOfLine', "\r\n", 'Delimiter', "\t");
fclose (fid);
1 Like

@mmuetzel The answer to these two questions is that I didn’t know textscan could be used the way you proposed…