Working with non-ASCII characters on Windows has always been a nightmare.
Historically, only single-byte encodings were supported for C library functions and the Windows API. To support the whole range of Unicode characters, MS added “Unicode versions” of many runtime and API functions that take
wchar_t* instead of
char* in their input. These wide character strings were initially UCS-2 encoded but most (or all) of these functions support UTF-16 now. In Windows jargon, the
char* interfaces are called “ANSI” and the
wchar_t* interfaces “Unicode”.
In contrast, most (or all?) POSIX platforms use UTF-8 encoded
char* in the C runtime.
Up until recently (before Windows 10 1903), it was not possible to set UTF-8 as the locale encoding for the C runtime and the Windows API in Windows.
Octave chose to use UTF-8 encoded strings internally. And many other POSIX libraries did so, too. That makes it easy to pass strings between libraries and Octave on POSIX platforms.
To correctly support non-ASCII characters on Windows, this means we need to convert forth and back between UTF-8 and UTF-16 at the interfaces where we call the C runtime library or Windows API directly. We tried to do that in as many places as possible. But every so often a use-case is reported where this still fails.
Additionally, we don’t have any influence on how other libraries tread strings containing non-ASCII characters and how they interface with the C runtime library on Windows. Most likely, they are using the ANSI interface with the native (non-ASCII) locale encoding which might lead to a misinterpretation of the characters encoded in the strings.
Like already hinted at above, Windows allows applications to set the locale to UTF-8 starting with Windows 10 1903. However, that only works with the “Universal C Run-Time” (UCRT).
That is a different C runtime from the one that was used before (MSVCRT - “Microsoft Visual C Run-Time”). Applications linked to that new runtime can only be executed on Windows versions supporting the UCRT. IIUC, that is only Windows 10 and newer.
Additionally, libraries linked with one of the two C runtimes cannot be linked with applications linked with the other C runtime.
That means that we’d need to rebuild Octave and all dependencies with the UCRT to be able to use any of the new features. Since we are building everything anyway as part of MXE Octave, that might not be a big issue.
That also means that .oct or .mex files built with an Octave linking to the MSVCRT cannot be used in an Octave linking to the UCRT (and vice versa). (Not sure if this is an issue.)
Using a native UTF-8 encoding on Windows will probably cover all issues with non-ASCII characters with any of the libraries we use. Most likely most things will “just work” like they currently do on POSIX platforms. It might also resolve the issues with non-ASCII characters in the command window on Windows. But I guess only testing will show.
There is an overview on this topic written by R developers. If you replace “R” by “GNU Octave” on that page, it pretty much applies exactly to our situation:
Windows UTF-8 UCRT - R
We probably don’t want to drop support for all Windows versions older than version 10 1903. So, we’d probably want to continue building Octave linked to MSVCRT.
However, could we consider to additionally build Octave linked to UCRT? It probably doesn’t make much sense to build these versions for 32-bit targets because most Windows 10 machines are probably running a 64-bit version of the OS anyway.
Would it be ok to distribute an new variant of Octave (w64-ucrt) additional to the existing variants (w32, w64, w64-64)?