Implementation of a string class

This thread is about the implementation of a string class in Octave and aims to link to all previous discussions and existing initiatives.

Matlab documentation is here:

and @apjanke has a highly relevant blog post here:

This is mentioned as a possible Octave project here:

and has been discussed previously on the maintainers mailing list here:

@apjanke probably has the most advanced implementation as part of tablicious:

A first step would be to have a complete implementation of string as a classdef M-file then fix existing functions so that they can accept char arrays and strings wherever it’s relevant (in compatibility with Matlab) then deal with the issue of double quotes, which currently have a different meaning in Matlab and Octave.

Should we recommend from now on that Octave M-files use single quotes when dealing with strings when it makes sense?

Aw, thanks!

Shameless plug: can we add my blog post at String Representation in Matlab is a Mess | Andrew Janke’s Blog to the reference material list? :wink:

A first step would be to have a complete implementation of string as a classdef M-file then fix existing functions so that they can accept char arrays and strings wherever it’s relevant (in compatibility with Matlab) then deal with the issue of double quotes, which currently have a different meaning in Matlab and Octave.

I think this is maybe (probably?) a good idea. In fact, this is how Matlab did it: the string array class existed in Matlab for a version or two before the double-quoted-string-literal syntax was added; if you wanted to use it, you had to call string('foo'). I’ve also worked with custom string classes implemented as MCOS classes in userland Matlab code (from like the 2014 days), and they worked fine, and were useful.

But they weren’t dealing with the baggage of an existing double-quoted-string-literal syntax that was already in the code base. Hmmm…

Thank you and sorry - I added your blog post in the list. I came across it yesterday but somehow forgot it in the meantime.

I feel like a string class has been discussed elsewhere but couldn’t find further links, e.g. in the bug tracker on Savannah. Perhaps it was on IRC.

Thanks!

Oh, yeah, we’ve discussed it repeatedly and all over the place. I couldn’t even point you to everywhere. But there are probably some Savannah bugs we should reference. Hard to easily search for all of them, but I see:

I feel I’m missing some, but haven’t been able to easily surface them in the Savannah search function.

One other relevant Octave-Maintainers discussion:

Thank you for these links. I was looking at what you already implemented and made this graph:

You used Java for Unicode support: how much do you think can now be natively handled with Octave?

And here is the list of methods for the string class as reported by R2021a:

>> methods string

Methods for class string:

append          erase           insertBefore    or              startsWith      
cellstr         eraseBetween    ismissing       pad             strip           
char            extract         issorted        plus            strlength       
compose         extractAfter    join            replace         upper           
contains        extractBefore   le              replaceBetween  
count           extractBetween  lower           reverse         
double          ge              lt              sort            
endsWith        gt              matches         split           
eq              insertAfter     ne              splitlines      

All of it, I think, especially if you’re willing to modify Octave itself a bit to expose more of the gnulib Unicode functionality to the M-code level. I just used Java because I was lazy.

The encode() and decode() methods can be implemented directly with calls to Octave’s native2unicode() and unicode2native(). Honestly I don’t know why I didn’t do that in the first place.

For the others - strlength() and reverse() - if you don’t mind a bit of inefficiency, I think you could do those right now by just using unicode2native() to transcode them to UTF-32, typecast that to uint32 and operate on that, and then in the case of reverse() use native2unicode() to transcode it back to UTF-8. If you want to go faster, I think core Octave would need to expose the gnulib functions to do these operations; I bet that would be easy.

E.g., I think the native-Octave strlength() implementation is literally just:

out(i) = numel(typecast(unicode2native(this.strs{i}, 'UTF-32'), 'uint32'));

In fact, I’ll go see if I can do this for Tablicious this weekend: Remove Java use from string implementation · Issue #77 · apjanke/octave-tablicious · GitHub

Oh, looks like you need to strip the BOM because unicode2native('foo', 'UTF-32') comes back with a BOM. Aside from that, looks like it just works. I’ll change Tablicious to do that shortly.

To be on the save side you should probably do something like this:

[~, ~, endian] = computer ();
a = typecast (unicode2native ('foo', sprintf ('UTF-32%cE', endian)), 'uint32')

At least on Windows, that also avoids the BOM.

That’s a good idea. That’s probably the right way to do it.

Okay, I updated Tablicious’s string to use native Octave operations instead of Java: https://github.com/apjanke/octave-tablicious/commit/3b905b4d190d4869bb6cde2baa37b9273a61822b

Went pretty much like we talked about here.

Thanks @apjanke and @mmuetzel !

As mentioned in the last developer meeting, speed shouldn’t necessarily be a priority when implementing a string class as m-file classdef. That said, asking “how long is a piece of string?” is a frequent question so when implementing strlength using numel (unicode2native (...)), does it need to internally allocate new memory? If so, would there be a low-level function we could use that would just do the counting?

strlength also has to be defined for char vectors and cellstr. Should the piece of code discussed here end up in a strlength.m function instead of a method of the string class? Or should that function simply be implemented as strlength (string (str)) ?

Finally, my aim with this thread is to have a string class in Octave so that we can focus on the (probably harder) task of modifying all Octave m-files so that they work natively with char arrays or string inputs. Given all the great work you have already made, it sounds very sensible to use it for this goal, if you agree with it. This would mean teasing apart everything string-related from Tablicious, otherwise it is a never ending task as I don’t think Tablicious is going to be merged in Octave (if you so wish) anytime soon.

We have the function unicode_idx that almost does that. But it returns an index vector the same size as the input char array. I’d guess we can use a very similar approach to just count the number of Unicode code points. (That’s not the number of UTF-16 code units nor is it the number of characters in the string. But probably the most useful answer such function could return easily.)

Reading the Matlab documentation for that function, I’d agree that is should be a “general” function. As long as we don’t have a native string type in Octave, the unicode_idx-like function would need to work on UTF-8 character vectors anyway.
But IIUC, that doesn’t mean that the string class can’t overload that function. Maybe that would make input handling easier.

Ah, thanks, I was not aware of this function. It doesn’t seem to be used anywhere in Octave and is not a Matlab function. Would you know why it was introduced with this syntax?
Looking at its code, it would indeed be easy to have a small variant that would just compute the value for strlength. Or, for non empty strings, we could use unicode_idx (str)(end).

Yes, as we might have this pattern for a number of functions/methods, I am just wondering which approach we should settle with: the method a wrapper of the function or vice-versa.

That said, asking “how long is a piece of string?” is a frequent question so when implementing strlength using numel (unicode2native (...)) , does it need to internally allocate new memory? If so, would there be a low-level function we could use that would just do the counting?

Yes, it has to allocate new memory because it goes through a transcoding step. And that’s a bummer for just a character-counting function.

There is a low-level “mbslen” function in gnulib that almost does what you want: I believe this takes a UTF-8 encoded string (which is Octave’s native string representation in chars at this point), looks at it, and gives you a character count (Unicode code points). Which is almost what you want. Unless you want to be fully compatible with Matlab. Matlab’s strlength doesn’t count Unicode characters though; it counts UTF-16 code units. (Which is the same thing unless you have surrogate-pair characters outside the BMP.) I don’t see an easy way to do that without doing the transcoding and allocating memory.

So if you care about full Matlab compatibility, I think you have to eat the cost. If you’re happy with a little divergence, then you could have core Octave expose gnulib’s mbslen and get it cheaper (I think).

Should the piece of code discussed here end up in a strlength.m function instead of a method of the string class?

IMHO, no. This code deals with the internal structures (private fields) and implementation of the string class; thus it belongs inside the string class. It is perfectly fine for classes to override generic functions; IMHO that’s how you’re supposed to do it. There should just be another strlength.m function to handle other types. Because of how method dispatch works - object methods take precedence over plain functions - the general function and specialized methods don’t even have to know about each other or interact (aside from having compatible semantics).

Given all the great work you have already made, it sounds very sensible to use it for this goal, if you agree with it. This would mean teasing apart everything string-related from Tablicious, otherwise it is a never ending task as I don’t think Tablicious is going to be merged in Octave (if you so wish) anytime soon.

Oh yes; feel free to take and use it. The string class implementation is pretty much standalone; I don’t think it has any dependencies on the rest of Tablicious. You could just copy my string.m file and drop it in to another code base and I think it would just work. And it’s GPL, so no license concerns there.

However: string arrays are such a basic data type that is so widely used, that I think Octave would want to end up with a C++ implementation of them, instead of an MCOS/M-code level implementation, both for performance, and so you have more flexibility on the internal representation of strings, and do things like “cheat” and lazily cache the results of strlength() etc.

(Which is the same thing unless you have surrogate-pair characters outside the BMP.) I don’t see an easy way to do that without doing the transcoding and allocating memory.

without knowing much about available low level functions, I’m assuming there isn’t a low cost way to tell if the string contains such characters without going through it character by character in the first place? if so you could check then use the more efficient way if applicable.

There is not. Because the native form of the string representation in Octave is UTF-8, you have to parse the UTF-8 in order to extract the code point values; then values above 65536 are outside the BMP. As far as I know, any tricks for reading UTF-8 code unit runs to detect range areas are computationally equivalent to just doing that parsing.

If Octave char used UTF-16 internally, then you could just check the values of each individual UTF-16 code unit to detect non-BMP chars. But if char was UTF-16 internally, then strlength() would just be numel(str). :wink:

If Octave char used UTF-16 internally

sounds like another proposal for another day. would that solve all of Octave’s other unicode issues?

I was planning on using it for something. But I forgot what that something was.
IIRC, the motivation was that there was no easy way to iterate through codepoint in an UTF-8 encoded char array in Octave (e.g. to reverse a string, or got or replace the n-th codepoint, or other similar things). That function was meant to make something like that easier. (But admittedly still cumbersome.)
Would that become easier with the string class?

I’d agree that the number of Unicode codepoints in a string is probably more useful than the number of code units it would require if it was encoded in UTF-16.

I had a quick look at unistr/u8-mbsnlen. It seems easy enough to wrap that into an Octave function.

It would solve many of them, especially in regards to Matlab compatibility, including any M-code functions that do characterwise operations. But it would introduce others: basically none of the libraries Octave interacts with (including C/C++ stdlib) take UTF-16 strings in their interfaces, so you’d have to have transcoding layers all over the place in the core Octave code base. It would be a large project.

See this whole discussion:

No. The stuff you’re talking about is characterwise operations, which you still need to do at a lower level with Octave’s char or something like that. The string class does nothing for you here, because it’s a slightly higher-level thing for representing arrays of strings, where an entire string is the “basic” element, and doing operations that work on whole strings as their “atomic” unit.

A utf16char or uchar class that encapsulates UTF-16-encoded, or flexible-width-encoded Unicode characters/code points (as opposed to UTF-8 code units/octets) could help out here, but it’s really a different animal.