That said, asking “how long is a piece of string?” is a frequent question so when implementing
numel (unicode2native (...)) , does it need to internally allocate new memory? If so, would there be a low-level function we could use that would just do the counting?
Yes, it has to allocate new memory because it goes through a transcoding step. And that’s a bummer for just a character-counting function.
There is a low-level “mbslen” function in gnulib that almost does what you want: I believe this takes a UTF-8 encoded string (which is Octave’s native string representation in chars at this point), looks at it, and gives you a character count (Unicode code points). Which is almost what you want. Unless you want to be fully compatible with Matlab. Matlab’s strlength doesn’t count Unicode characters though; it counts UTF-16 code units. (Which is the same thing unless you have surrogate-pair characters outside the BMP.) I don’t see an easy way to do that without doing the transcoding and allocating memory.
So if you care about full Matlab compatibility, I think you have to eat the cost. If you’re happy with a little divergence, then you could have core Octave expose gnulib’s mbslen and get it cheaper (I think).
Should the piece of code discussed here end up in a
strlength.m function instead of a method of the string class?
IMHO, no. This code deals with the internal structures (private fields) and implementation of the string class; thus it belongs inside the string class. It is perfectly fine for classes to override generic functions; IMHO that’s how you’re supposed to do it. There should just be another
strlength.m function to handle other types. Because of how method dispatch works - object methods take precedence over plain functions - the general function and specialized methods don’t even have to know about each other or interact (aside from having compatible semantics).
Given all the great work you have already made, it sounds very sensible to use it for this goal, if you agree with it. This would mean teasing apart everything string-related from Tablicious, otherwise it is a never ending task as I don’t think Tablicious is going to be merged in Octave (if you so wish) anytime soon.
Oh yes; feel free to take and use it. The
string class implementation is pretty much standalone; I don’t think it has any dependencies on the rest of Tablicious. You could just copy my
string.m file and drop it in to another code base and I think it would just work. And it’s GPL, so no license concerns there.
string arrays are such a basic data type that is so widely used, that I think Octave would want to end up with a C++ implementation of them, instead of an MCOS/M-code level implementation, both for performance, and so you have more flexibility on the internal representation of strings, and do things like “cheat” and lazily cache the results of strlength() etc.