MediaWiki  REL1_22
README
Go to the documentation of this file.
00001 This directory contains some Unicode normalization routines. These routines
00002 are meant to be reusable in other projects, so I'm not tying them to the
00003 MediaWiki utility functions.
00004 
00005 The main function to care about is UtfNormal::toNFC(); this will convert
00006 a given UTF-8 string to Normalization Form C if it's not already such.
00007 The function assumes that the input string is already valid UTF-8; if there
00008 are corrupt characters this may produce erroneous results.
00009 
00010 To also check for illegal characters, use UtfNormal::cleanUp(). This will
00011 strip illegal UTF-8 sequences and characters that are illegal in XML, and
00012 if necessary convert to normalization form C.
00013 
00014 Performance is kind of stinky in absolute terms, though it should be speedy
00015 on pure ASCII text. ;) On text that can be determined quickly to already be
00016 in NFC it's not too awful but it can quickly get uncomfortably slow,
00017 particularly for Korean text (the hangul decomposition/composition code is
00018 extra slow).
00019 
00020 
00021 == Regenerating data tables ==
00022 
00023 UtfNormalData.inc and UtfNormalDataK.inc are generated from the Unicode
00024 Character Database by the script UtfNormalGenerate.php. On a *nix system
00025 'make' should fetch the necessary files and regenerate it if the scripts
00026 have been changed or you remove it.
00027 
00028 
00029 == Testing ==
00030 
00031 'make test' will run the conformance test (UtfNormalTest.php), fetching the
00032 data from from the net if necessary. If it reports failure, something is
00033 going wrong!
00034 
00035 You may have to set up PHPUnit first.
00036 
00037 $ pear channel-discover pear.phpunit.de
00038 $ pear install phpunit/PHPUnit
00039 
00040 == Benchmarks ==
00041 
00042 Run 'make bench' to download some sample texts from Wikipedia and run some
00043 cheap benchmarks of some of the functions. Take all numbers with large
00044 grains of salt.
00045 
00046 
00047 == PHP module extension ==
00048 
00049 There's an experimental PHP extension module which wraps the ICU library's
00050 normalization functions. This is *MUCH* faster than doing this work in pure
00051 PHP code. This is at https://git.wikimedia.org/summary/mediawiki%2Fextensions%2Fnormal.git.
00052 It is used by the WMF, which currently runs PHP 5.3.10 on Linux.  It hasn't been
00053 thoroughly tested on other configurations, but may work.
00054 
00055 If the php_normal.so module is loaded in php.ini, the normalization functions
00056 will automatically use it. If you can't (or don't want to) load it in php.ini,
00057 you may be able to load it using the dl() function before the inclusion of
00058 UtfNormal.php, and it will be picked up.
00059