Should I use `.oct-config` when I have utf-8 strings in m-files?

I’m debugging some packaging troubles related to running “doctest”, that I’ve traced to having utf-8 chars in a regexp [1]. Is it current best-practice to stick a .oct-config file with contents encoding=utf-8 in it [2]? I haven’t seen any other packages doing that yet (but maybe other people are sane and don’t include non-ascii in their m-files…

[1] error: regexp: unrecognized character after (? or (?- at position 6 of expression · Issue #251 · catch22/octave-doctest · GitHub
[2] GNU Octave - Bugs: bug #49685, Set .m file encoding on a... [Savannah]

By default, Octave uses the system locale to read .m files. That might have been different in older versions (where the default encoding was platform dependent iirc).
In newer versions, it also defaults to the system locale when reading and writing files in text mode. The latter was done for compatibility with Matlab iirc.
You can query the encoding that Octave deduced from the system locale with __locale_charset__, and check which encoding is currently used with __mfile_encoding__.

Another change that might be relevant is that regexp didn’t handle non-ASCII characters in UTF-8 encoded strings correctly. (E.g., it might have matched single bytes of multibyte characters, like the pattern [ö] matched a single byte of ä.) UTF-8 is Octave’s internal encoding that is used for its character arrays (strings). Now that this is fixed, regexp emits an error if it encounters invalid UTF-8. That might be the error that you are seeing.

Both changes (or yet another change) might be relevant here.

The default encoding can be changed either with __mfile_encoding__ or with the dropdown list on the “Editor” tab of the GUI preferences.

The solution with the .oct-config file looks reasonable to me if it works. In principle, that setting only applies to parsing .m files (see also dir_encoding). I’m not sure if it is also used when files are opened in text mode (it probably isn’t). And I don’t know where the string came from that regexp complains about in the report you linked.

Thanks for all the pointers!

And I don’t know where the string came from that regexp complains about in the report you linked.

Its just a string in the source code m-file:

      L = regexprep (L, '^(\s*)(?:⇒|=>|⊣|-\||error→|error->)', '$1', 'once', 'lineanchors');

Note there are non-ascii chars in there. The file is utf-8 encoded:

$ file inst/private/doctest_collect.m 
inst/private/doctest_collect.m: Unicode text, UTF-8 text

I filed a pull request to add the .oct-config to Doctest, and CC’d you. Please look if you have time: