String Conversion

On the wire, Ice transmits all strings as Unicode strings in UTF‑8 encoding (see Chapter 34). For languages other than C++, Ice uses strings in their language-native Unicode representation and converts automatically to and from UTF‑8 for transmission, so applications can transparently use characters from non-English alphabets.

However, for C++, the how strings are represented inside a process depends on which mapping is chosen for a particular string, the default mapping to std::string, or the alternative mapping to std::wstring (see Section 6.6.1) as well as the platform.¹ This section explains how strings are encoded by the Ice for C++ run time, and how you can achieve automatic conversion of strings in their native representation to and from UTF‑8.²

• Narrow strings (that is, strings mapped to std::string) are presented to the application in UTF‑8 encoding and, similarly, the application is expected to provide narrow strings in UTF‑8 encoding to the Ice run time for transmission.

With this default behavior, the application code is responsible for converting between the native codeset for 8‑bit characters and UTF‑8. For example, if the native codeset is ISO Latin‑1, the application is responsible for converting between UTF‑8 and narrow (8‑bit) characters in ISO Latin‑1 encoding.

Also note that the default behavior does not require the application to do anything if it only uses characters in the ASCII range. (This is because a string containing only characters in the (7‑bit) ASCII range is also a valid UTF‑8 string.)

• Wide strings (that is, strings mapped to std::wstring) are automatically encoded as Unicode by the Ice run time as appropriate for the platform. For example, for Windows, the Ice run time converts between UTF‑8 and UTF‑16 in little-endian representation whereas, for Linux, the Ice run time converts between UTF‑8 and UTF‑32 in the endian-ness appropriate for the host CPU.

With this default behavior, wide strings are transparently converted between their on-the-wire representation and their native C++ representation as appropriate, so application code need not do anything special. (The exception is if an application uses a non-Unicode encoding, such as Shift‑JIS, as its native wstring codeset.)

28.23.1 Installing String Converters

The default behavior of the run time can be changed by providing application-specific string converters. If you install such converters, all Slice strings will be passed to the appropriate converter when they are marshaled and unmarshaled. Therefore, the string converters allow you to convert all strings transparently into their native representation without having to insert explicit conversion calls whenever a string crosses a Slice interface boundary.

You can install string converters on a per-communicator basis when you create a communicator by setting the stringConverter and wstringConverter members of the InitializationData structure (see Section 28.3). Any strings that use the default (std::string) mapping are passed through the specified stringConverter, and any strings that use the wide (std::wstring) mapping are passed through the specified wstringConverter.

namespace Ice {

class UTF8Buffer {
public:
    virtual Byte* getMoreBytes(size_t howMany,
                               Byte* firstUnused) = 0;
    virtual ~UTF8Buffer() {}
};

template<typename charT>
class BasicStringConverter : public IceUtil::Shared {
public:
    virtual Byte*
        toUTF8(const charT* sourceStart, const charT* sourceEnd,
               UTF8Buffer&) const = 0;

    virtual void fromUTF8(const Byte* sourceStart,
                          const Byte* sourceEnd,
                          std::basic_string<charT>& target) const;
};

typedef BasicStringConverter<char> StringConverter;
typedef IceUtil::Handle<StringConverter> StringConverterPtr;

typedef BasicStringConverter<wchar_t> WstringConverter;
typedef IceUtil::Handle<WstringConverter> WstringConverterPtr;

}

As you can see, both narrow and wide string converters are simply templates with either a narrow or a wide character (char or wchar_t) as the template parameter.

28.23.2 Converting to UTF‑8

If you have a string converter installed, the Ice run time calls the toUTF8 function whenever it needs to convert a native string into UTF‑8 representation for transmission. The sourceStart and sourceEnd pointers point at the first byte and one-beyond-the-last byte of the source string, respectively. The implementation of toUTF8 must return a pointer to the first unused byte following the converted string.

Your implementation of toUTF8 must allocate the returned string by calling the getMoreBytes member function of the UTF8Buffer class that is passed as the third argument. (getMoreBytes throws a MemoryLimitException if it cannot allocate enough memory.) The firstUnused parameter must point at the first unused byte of the allocated memory region. You can make several calls to getMoreBytes to incrementally allocate memory for the converted string. If you do, getMoreBytes may relocate the buffer in memory. (If it does, it copies the part of the string that was converted so far into the new memory region.) The function returns a pointer to the first unused byte of the (possibly relocated) memory.

Conversion with toUTF8 can fail because getMoreBytes can cause the message size to exceed Ice.MessageSizeMax. In this case, you should let the MemoryLimitException thrown by getMoreBytes propagate to the caller.

Conversion can also fail because the encoding of the source string is internally incorrect. In that case, you should throw a StringConversionFailed exception from toUTF8.

28.23.3 Converting from UTF‑8

During unmarshaling, the Ice run time calls the fromUTF8 member function on the corresponding string converter. The function converts a UTF‑8 string into its native form as a std::string. (The string into which the function must place the converted characters is passed to fromUTF8 as the target parameter.)

28.23.4 Built-In String Converters

This is a string converter that converts between Unicode wide strings and UTF‑8 strings. Unless you install a different string converter, this is the default converter that is used for wide strings.

This is a string converter that converts strings using the Linux and Unix iconv conversion facility (see Section 28.23.5). It can be used to convert either wide or narrow strings.

28.23.5 The iconv String Converter

For Linux and Unix platforms, Ice provides an IconvStringConverter template class that uses the iconv conversion facility to convert between the native encoding and UTF‑8. The only member function of interest is the constructor:

template<typename charT>
class IconvStringConverter
    : public Ice::BasicStringConverter<charT>
{
public:
    IconvStringConverter(const char* = nl_langinfo(CODESET));

    // ...
};

To use this string converter, you specify whether the conversion you want is for narrow or wide characters via the template argument, and you specify the corresponding native encoding with the constructor argument. For example, to create a converter that converts between ISO Latin‑1 and UTF‑8, you can instantiate the converter as follows:

InitializationData id;
id.stringConverter = new IconvStringConverter<char>("ISO‑8859‑1");

InititializationData id;
id.stringConverter = new IconvStringConverter<wchar_t>("WCHAR_T");

Using the IconvStringConverter template makes it easy to install code converters for any available encoding without having to explicitly write (or call) conversion routines whose implementation is typically non-trivial.

28.23.6 The Ice String Converter Plugin

The Ice run time includes a plugin that supports conversion between UTF-8 and native encodings on Unix and Windows platforms. You can use this plugin to install converters for narrow and wide strings into the communicator of an existing program. This feature is primarily intended for use in scripting language extensions such as Ice for Python; if you need to use string converters in your C++ application, we recommend using the technique described in Section 28.23.1 instead.

Note that an application must be designed to operate correctly in the presence of a string converter. A string converter assumes that it converts strings in the native encoding into the UTF-8 encoding, and vice versa. An application that performs its own conversions on strings that cross a Slice interface boundary can cause encoding errors when those strings are processed by a converter.

Installing the Plugin

Ice.Plugin.Converter=Ice:createStringConverter
iconv=encoding[,encoding] windows=code-page

You can use any name you wish for the plugin; in this example, we used Converter. The first component of the property value represents the plugin’s entry point, which includes the abbreviated name of the shared library or DLL (Ice) and the name of a factory function (createStringConverter).

This argument is optional on Unix platforms and ignored on Windows platforms. If specified, it defines the iconv names of the narrow string encoding and the optional wide-string encoding. If this argument is not specified, the plugin installs a narrow string converter that uses the default locale-dependent encoding.

The plugin’s argument semantics are designed so that the same configuration property can be used on both Windows and Unix platforms, as shown in the following example:

Ice.Plugin.Converter=Ice:createStringConverter iconv=ISO8859-1
windows=1252

If the configuration file containing this property is shared by programs in multiple implementation languages, you can use an alternate syntax that is loaded only by the Ice for C++ run time:

Ice.Plugin.Converter.cpp=Ice:createStringConverter iconv=ISO8859-1
windows=1252

28.23.7 Dynamically Installing Custom String Converters

If the string converter plugin described in Section 28.23.6 does not satisfy your requirements, you can implement your own solution with help from the StringConverterPlugin class:

namespace Ice {
class StringConverterPlugin : public Ice::Plugin {
public:

    StringConverterPlugin(const CommunicatorPtr& communicator,
                          const StringConverterPtr&,
                          const WstringConverterPtr& = 0);

    virtual void initialize();

    virtual void destroy();
};
}

The converters are installed by the StringConverterPlugin constructor (you can supply an argument of 0 for either converter if you do not wish to install it). The initialize and destroy methods are empty, but you can subclass StringConverterPlugin and override these methods if necessary.

Ice.Plugin.MyConverterPlugin=myconverter:createConverter ...

The first component of the property value represents the plugin’s entry point, which includes the abbreviated name of the shared library or DLL (myconverter) and the name of a factory function (createConverter).