Reading TSV files

Hmmm, I wonder why you read a .csv file only to make a new one with the same data and a different file name. Or do you select just some data using the range parameter?

Does your textscan trick work to read all 13 million lines into one array?
If I follow that code correctly, you only use the first column.
textscan’s output is one cell array in turn containing cell arrays for each column of data in the file. I see that you read all columns as one row of data into “temp” but only add the first column to the variable “array” and discard the rest.
If I’m right, csv2cell might also read the entire file using a format of “A1:A15000000”. (Just a hunch, didn’t try here)

You write that csv2cell won’t read the original file but can read it in 3 consecutive chunks. What would be the size of the largest chunk of that file that csv2cell is able to read?
Note: “whos” doesn’t help a lot, it only shows the number of bytes occupied by the data, but AFAIK not including the empty RAM slack. To get the real RAM usage, look in Task Manager at Octave’s occupied RAM before and after the call to csv2cell.

I am not surprised that string elements occupy more RAM than floats. Arrays of floating point numbers have much more efficient RAM occupation as they are internally contiguous, in contrast to cellstr arrays and mixed-type cell arrays that may have a lot of interspersed slack space.

I’ll be away until maybe next Monday but I am curious to your answers, if any.

This doesn’t make sense to me. TSV stands for Tab Separated Value, but it now sounds like the separation character between columns is a space.

Obviously the original file is too large to share, so I created a test file of my own using this code.

str = ['A':'Y'];
strmat = repmat (str, [13e6, 1]);
dlmwrite ("tstdata.tsv", strmat, '\t');

This file has 13M lines of 25 columns where each column is a string (in this case, just one character). The separation character is a tab (’\t’) and because this my PC runs Linux the line endings are “\n”.

I can then read the data back using mmuetzel’s code.

fid = fopen ('tstdata.tsv');
tic; array = textscan(fid, '%s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s', 'Delimiter', "\t"); toc
fclose (fid);

Results are

Elapsed time is 44.1789 seconds.

which seems quite acceptable.

1 Like

This might be again the case of linux being a superior OS :slight_smile:
I got 28sec on linux and on windows it is still going on after few minutes…

PS. I redid the test with the test file on the same USB disk (same computer).
Linux → 30 sec, Windows → 222 sec (octave 6.3.1)

1 Like