Chapter 9 Code set conversion
omniORB 4.0 supports full code set negotiation, used to select and
translate between different character code sets, for the transmission
of chars, strings, wchars and wstrings. The support is mostly
transparent to application code, but there are a number of options
that can be selected. This chapter covers the options, and also gives
some pointers about how to implement your own code sets, in case the
ones that come with omniORB are not sufficient.
9.1 Native code sets
For the ORB to know how to handle strings and wstrings given to it by
the application, it must know what code set they are represented
with, so it can properly translate them if need be. The defaults are
ISO 8859-1 (Latin 1) for char and string, and UTF-16 for wchar and
wstring. Different code sets can be chosen at initialisation time with
the nativeCharCodeSet and nativeWCharCodeSet
parameters. The supported code sets are printed out at initialisation
time if the ORB traceLevel is 15 or greater.
For most applications, the defaults are fine. Some applications may
need to set the native char code set to UTF-8, allowing the full
Unicode range to be supported in strings.
Note that the default for wchar is always UTF-16, even on Unix
platforms where wchar is a 32-bit type. Select the UCS-4 code set to
select characters outside the first plane without having to use UTF-16
surrogates1.
9.2 Code set library
To save space in the main ORB core library, most of the code set
implementations are in a separate library named omniCodeSets4. To use
the extra code sets, you must link your application with that
library. On most platforms, if you are using dynamic linking,
specifying the omniCodeSets4 library in the link command is sufficient
to have it initialised, and for the code sets to be available. With
static linking, or platforms with less intelligent dynamic linkers,
you must force the linker to initialise the library. You do that by
including the omniORB4/optionalFeatures.h header. By default,
that header enables several optional features. Look at the file
contents to see how to turn off particular features.
9.3 Implementing new code sets
It is quite easy to implement new code sets, if you need support for
code sets (or marshalling formats) that do not come with the omniORB
distribution. There are extensive comments in the headers and ORB code
that explain how to implement a code set; this section just serves to
point you in the right direction.
The main definitions for the code set support are in
include/omniORB4/codeSets.h. That defines a set of base classes
use to implement code sets, plus some derived classes that use look-up
tables to convert simple 8-bit and 16-bit code sets to Unicode.
When sending or receiving string data, there are a total of four code
sets in action: a native char code set, and transmission char code
set, a native wchar code set, and a transmission wchar code set. The
native code sets are as described above; the transmission code sets
are the ones selected to communicate with a remote machine. They are
responsible for understanding the GIOP marshalling formats, as well as
the code sets themselves. Each of the four code sets has an object
associated with it which contains methods for converting data.
There are two ways in which a string/wstring can be transmitted or
received. If the transmission code set in action knows how to deal
directly with the native code set (the trivial case being that they
are the same code set, but more complex cases are possible too), the
transmission code set object can directly marshal or unmarshal the
data into or out of the application buffer. If the transmission code
set does not know how to handle the native code set, it converts the
string/wstring into UTF-16, and passes that to the native code set
object (or vice-versa). All code set implementations must therefore
know how to convert to and from UTF-16.
With this explanation, the classes in codeSets.h should be easy
to understand. The next place to look is in the various existing code
set implementations, which are files of the form cs-*.cc in the
src/lib/omniORB/orbcore and src/lib/omniORB/codesets.
Note how all the 8-bit code sets (the ISO 8859-* family) consist
entirely of data and no code, since they are driven by look-up tables.
- 1
- If you have no idea what this means, don't
worry---you're better off not knowing unless you really have
to.