Chapter 9 Code set conversion

omniORB 4.0 supports full code set negotiation, used to select and translate between different character code sets, for the transmission of chars, strings, wchars and wstrings. The support is mostly transparent to application code, but there are a number of options that can be selected. This chapter covers the options, and also gives some pointers about how to implement your own code sets, in case the ones that come with omniORB are not sufficient.

9.1 Native code sets

For the ORB to know how to handle strings and wstrings given to it by the application, it must know what code set they are represented with, so it can properly translate them if need be. The defaults are ISO 8859-1 (Latin 1) for char and string, and UTF-16 for wchar and wstring. Different code sets can be chosen at initialisation time with the nativeCharCodeSet and nativeWCharCodeSet parameters. The supported code sets are printed out at initialisation time if the ORB traceLevel is 15 or greater.

For most applications, the defaults are fine. Some applications may need to set the native char code set to UTF-8, allowing the full Unicode range to be supported in strings.

Note that the default for wchar is always UTF-16, even on Unix platforms where wchar is a 32-bit type. Select the UCS-4 code set to select characters outside the first plane without having to use UTF-16 surrogates¹.

9.2 Code set library

To save space in the main ORB core library, most of the code set implementations are in a separate library named omniCodeSets4. To use the extra code sets, you must link your application with that library. On most platforms, if you are using dynamic linking, specifying the omniCodeSets4 library in the link command is sufficient to have it initialised, and for the code sets to be available. With static linking, or platforms with less intelligent dynamic linkers, you must force the linker to initialise the library. You do that by including the omniORB4/optionalFeatures.h header. By default, that header enables several optional features. Look at the file contents to see how to turn off particular features.

9.3 Implementing new code sets

It is quite easy to implement new code sets, if you need support for code sets (or marshalling formats) that do not come with the omniORB distribution. There are extensive comments in the headers and ORB code that explain how to implement a code set; this section just serves to point you in the right direction.

The main definitions for the code set support are in include/omniORB4/codeSets.h. That defines a set of base classes use to implement code sets, plus some derived classes that use look-up tables to convert simple 8-bit and 16-bit code sets to Unicode.

When sending or receiving string data, there are a total of four code sets in action: a native char code set, and transmission char code set, a native wchar code set, and a transmission wchar code set. The native code sets are as described above; the transmission code sets are the ones selected to communicate with a remote machine. They are responsible for understanding the GIOP marshalling formats, as well as the code sets themselves. Each of the four code sets has an object associated with it which contains methods for converting data.

There are two ways in which a string/wstring can be transmitted or received. If the transmission code set in action knows how to deal directly with the native code set (the trivial case being that they are the same code set, but more complex cases are possible too), the transmission code set object can directly marshal or unmarshal the data into or out of the application buffer. If the transmission code set does not know how to handle the native code set, it converts the string/wstring into UTF-16, and passes that to the native code set object (or vice-versa). All code set implementations must therefore know how to convert to and from UTF-16.

With this explanation, the classes in codeSets.h should be easy to understand. The next place to look is in the various existing code set implementations, which are files of the form cs-*.cc in the src/lib/omniORB/orbcore and src/lib/omniORB/codesets. Note how all the 8-bit code sets (the ISO 8859-* family) consist entirely of data and no code, since they are driven by look-up tables.

1: If you have no idea what this means, don't worry---you're better off not knowing unless you really have to.