[ Index ]

PHP Cross Reference of MediaWiki-1.24.0

title

Body

[close]

/includes/normal/ -> README (source)

   1  This directory contains some Unicode normalization routines. These routines
   2  are meant to be reusable in other projects, so I'm not tying them to the
   3  MediaWiki utility functions.
   4  
   5  The main function to care about is UtfNormal::toNFC(); this will convert
   6  a given UTF-8 string to Normalization Form C if it's not already such.
   7  The function assumes that the input string is already valid UTF-8; if there
   8  are corrupt characters this may produce erroneous results.
   9  
  10  To also check for illegal characters, use UtfNormal::cleanUp(). This will
  11  strip illegal UTF-8 sequences and characters that are illegal in XML, and
  12  if necessary convert to normalization form C.
  13  
  14  Performance is kind of stinky in absolute terms, though it should be speedy
  15  on pure ASCII text. ;) On text that can be determined quickly to already be
  16  in NFC it's not too awful but it can quickly get uncomfortably slow,
  17  particularly for Korean text (the hangul decomposition/composition code is
  18  extra slow).
  19  
  20  
  21  == Regenerating data tables ==
  22  
  23  UtfNormalData.inc and UtfNormalDataK.inc are generated from the Unicode
  24  Character Database by the script UtfNormalGenerate.php. On a *nix system
  25  'make' should fetch the necessary files and regenerate it if the scripts
  26  have been changed or you remove it.
  27  
  28  
  29  == Testing ==
  30  
  31  'make test' will run the conformance test (UtfNormalTest.php), fetching the
  32  data from from the net if necessary. If it reports failure, something is
  33  going wrong!
  34  
  35  You may have to set up PHPUnit first.
  36  
  37  $ pear channel-discover pear.phpunit.de
  38  $ pear install phpunit/PHPUnit
  39  
  40  == Benchmarks ==
  41  
  42  Run 'make bench' to download some sample texts from Wikipedia and run some
  43  cheap benchmarks of some of the functions. Take all numbers with large
  44  grains of salt.
  45  
  46  
  47  == PHP module extension ==
  48  
  49  There's an experimental PHP extension module which wraps the ICU library's
  50  normalization functions. This is *MUCH* faster than doing this work in pure
  51  PHP code. This is at https://git.wikimedia.org/summary/mediawiki%2Fextensions%2Fnormal.git.
  52  It is used by the WMF, which currently runs PHP 5.3.10 on Linux.  It hasn't been
  53  thoroughly tested on other configurations, but may work.
  54  
  55  If the php_normal.so module is loaded in php.ini, the normalization functions
  56  will automatically use it. If you can't (or don't want to) load it in php.ini,
  57  you may be able to load it using the dl() function before the inclusion of
  58  UtfNormal.php, and it will be picked up.
  59  


Generated: Fri Nov 28 14:03:12 2014 Cross-referenced by PHPXref 0.7.1