Windows UTF-8 and UCRT - yet another variant of Octave on Windows?

Working with non-ASCII characters on Windows has always been a nightmare.
Historically, only single-byte encodings were supported for C library functions and the Windows API. To support the whole range of Unicode characters, MS added “Unicode versions” of many runtime and API functions that take wchar_t* instead of char* in their input. These wide character strings were initially UCS-2 encoded but most (or all) of these functions support UTF-16 now. In Windows jargon, the char* interfaces are called “ANSI” and the wchar_t* interfaces “Unicode”.
In contrast, most (or all?) POSIX platforms use UTF-8 encoded char* in the C runtime.
Up until recently (before Windows 10 1903), it was not possible to set UTF-8 as the locale encoding for the C runtime and the Windows API in Windows.

Octave chose to use UTF-8 encoded strings internally. And many other POSIX libraries did so, too. That makes it easy to pass strings between libraries and Octave on POSIX platforms.
To correctly support non-ASCII characters on Windows, this means we need to convert forth and back between UTF-8 and UTF-16 at the interfaces where we call the C runtime library or Windows API directly. We tried to do that in as many places as possible. But every so often a use-case is reported where this still fails.
Additionally, we don’t have any influence on how other libraries tread strings containing non-ASCII characters and how they interface with the C runtime library on Windows. Most likely, they are using the ANSI interface with the native (non-ASCII) locale encoding which might lead to a misinterpretation of the characters encoded in the strings.

Like already hinted at above, Windows allows applications to set the locale to UTF-8 starting with Windows 10 1903. However, that only works with the “Universal C Run-Time” (UCRT).
That is a different C runtime from the one that was used before (MSVCRT - “Microsoft Visual C Run-Time”). Applications linked to that new runtime can only be executed on Windows versions supporting the UCRT. IIUC, that is only Windows 10 and newer.

Additionally, libraries linked with one of the two C runtimes cannot be linked with applications linked with the other C runtime.
That means that we’d need to rebuild Octave and all dependencies with the UCRT to be able to use any of the new features. Since we are building everything anyway as part of MXE Octave, that might not be a big issue.
That also means that .oct or .mex files built with an Octave linking to the MSVCRT cannot be used in an Octave linking to the UCRT (and vice versa). (Not sure if this is an issue.)

Using a native UTF-8 encoding on Windows will probably cover all issues with non-ASCII characters with any of the libraries we use. Most likely most things will “just work” like they currently do on POSIX platforms. It might also resolve the issues with non-ASCII characters in the command window on Windows. But I guess only testing will show.

There is an overview on this topic written by R developers. If you replace “R” by “GNU Octave” on that page, it pretty much applies exactly to our situation:
Windows UTF-8 UCRT - R

We probably don’t want to drop support for all Windows versions older than version 10 1903. So, we’d probably want to continue building Octave linked to MSVCRT.
However, could we consider to additionally build Octave linked to UCRT? It probably doesn’t make much sense to build these versions for 32-bit targets because most Windows 10 machines are probably running a 64-bit version of the OS anyway.

Would it be ok to distribute an new variant of Octave (w64-ucrt) additional to the existing variants (w32, w64, w64-64)?

1 Like

halfway through reading this I was thinking “haven’t other programs had to deal with this? how’d they do it?” That R document was an interesting read. Misery loves company!

I like the idea of adding a UCRT build, understanding it’s yet another Windows option to maintain. Agreed there’s little sense in a 32bit UCRT build. I’m a bit curious, what’s the current downside to the w64-64 version that prevents us from making that the default w64 build? Does it add significant overhead or performance degradation? My thought being if we’re going to add a w64-UCRT, it might make sense to use the 64bit indexing on all 64 bit versions and reduce the number of separate builds to maintain.

Perhaps not right away, but I think we should have a “legacy” build (w32) that runs on all supported configs and all other builds to be done with UCRT.

The first patch adds the option to configure MXE Octave to build with the UCRT for Windows:
mxe-ucrt-1-add-option.patch (4.1 KB)
To build with that option, start with a clean MXE Octave and configure with --with-windows-msvcrt=ucrt.

A few packages didn’t build in that configuration. The following patch fixes that for me:
mxe-ucrt-2-fix-packages.patch (5.7 KB)

This is the first step to get this going. Nothing much (if anything at all) would be different in the resulting Octave package. The main different is probably that this version will only start on Windows 10.

The next step would probably be to set an UTF-8 locale for Octave. IIUC, that is done by some kind of manifest:
Use the Windows UTF-8 code page - Windows apps | Microsoft Docs

I haven’t looked in much detail into how this would be done. But I’d be happy for comments on the initial patch.
How should a with configure parameter that doesn’t take yes or no be handled “correctly”? Fail on un-recognized values?
How should the makefile variable be named? Currently it’s USE_MSVCRT which could be set to either msvcrt or ucrt.

Edit: IIUC, the manifest part is about setting the code page for the Windows API. I’d guess that the C runtime could be set to a UTF-8 locale with setlocale now. Maybe we’ll need some kind of configure check to see if the target supports UTF-8 locales?

IIUC, many (or all?) major distributions package the Fortran linear algebra libraries with 32-bit Fortran indexing (at least in their default version). That means that the 64-bit Fortran indexing versions are probably not as thoroughly tested as the 32-bit Fortran indexing versions.
IIRC, we didn’t want to package the 64-bit Fortran indexing versions as the only option. ISTR that our motivation was that we didn’t want to be the “testing ground” for those libraries.

We didn’t receive many (any?) reports about issues with the Fortran 64-bit indexing version so far. At the same time, we don’t have any statistics of how much that version is actually used in the wild. We actively recommend that Windows users download and install the 32-bit Fortran indexing version. Given that, it might not be used very widely currently.
If we remove the (more thoroughly tested) 32-bit Fortran indexing version, we might open the flood gates. Or it might go pretty much unnoticed. Hard to tell in advance…

To be clear: This is not the indexing size of Octave arrays. That is always matching the maximum size corresponding to the architecture by now. This is only the size of the indexing type used by the (Fortran) linear algebra libraries.

1 Like

I pushed the (slightly modified) initial patches here:
mxe-octave: 7c0066684448
mxe-octave: 88dfa92d0c86

Linking with the UCRT is optional and the default is to link with the MSVCRT. So, nothing should be different unless MXE Octave is configured with --with-windows-msvcrt=ucrt.

Edit: Pushed an additional patch to the Octave repository that should switch to an UTF-8 locale in the C runtime if possible on Windows:
octave: bf619727bf6c (gnu.org)

UCRT ships by default with Windows 10, but it can be installed on Windows Vista and later, so you wouldn’t be limiting support per se (might not work OOTB though, which could confuse some users; not sure if that can be checked in the installer).

Could we maybe rename the configure option to --with-windows-crt, the above msvcrt=ucrt looks a bit “weird” (msvcrt and ucrt are exclusive in my mind)?

I realize mingw-w64 itself uses --with-default-msvcrt, but I guess Octave doesn’t have to propagate it…

That didn’t do what I hoped it would. Hopefully fixed with this change:
octave: a12b5a32f94b (gnu.org)

That was my motivation for using that name. GCC uses the same name.
IIUC, the library that is used for linking to the UCRT is still -lmsvcrt. So GCC’s and mingw-w64’s naming convention makes sense to me. The switch selects whether that library will be the MSVCRT or the UCRT…

…but we’ll call it MSVCRT anyway :wink: Not convoluted at all. :confused: Thanks for the background explanation, I guess ok to just go w/ the (legacy) flow then, happy to see the UCRT build option.

Good point.
However, MS documents that UTF-8 will only work with Windows 10 1803 or later:
setlocale, _wsetlocale | Microsoft Docs
I guess we’ll need a better check for deciding if we can use that codepage. I was hoping, we could do that on compile time instead of at run-time because I suspect that we’d need to select different code path depending on that in many different places in the code.
Maybe we could instead artificially limit execution of these builds to Windows versions that provide the required features…