MediaWiki  REL1_24
UtfNormal Class Reference

Unicode normalization routines for working with UTF-8 strings. More...

List of all members.

Static Public Member Functions

static cleanUp ($string)
 The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.
static loadData ()
 Load the basic composition data if necessary.
static placebo ($string)
 This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.
static quickIsNFC ($string)
 Returns true if the string is _definitely_ in NFC.
static quickIsNFCVerify (&$string)
 Returns true if the string is _definitely_ in NFC.
static toNFC ($string)
 Convert a UTF-8 string to normal form C, canonical composition.
static toNFD ($string)
 Convert a UTF-8 string to normal form D, canonical decomposition.
static toNFKC ($string)
 Convert a UTF-8 string to normal form KC, compatibility composition.
static toNFKD ($string)
 Convert a UTF-8 string to normal form KD, compatibility decomposition.

Public Attributes

const UNORM_DEFAULT = self::UNORM_NFC
const UNORM_FCD = 6
const UNORM_NFC = 4
const UNORM_NFD = 2
const UNORM_NFKC = 5
const UNORM_NFKD = 3
const UNORM_NONE = 1
 For using the ICU wrapper.

Static Public Attributes

static $utfCanonicalComp = null
static $utfCanonicalDecomp = null
static $utfCheckNFC
static $utfCombiningClass = null
static $utfCompatibilityDecomp = null

Static Private Member Functions

static fastCombiningSort ($string)
 Sorts combining characters into canonical order.
static fastCompose ($string)
 Produces canonically composed sequences, i.e.
static fastDecompose ($string, $map)
 Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).
static NFC ($string)
static NFD ($string)
static NFKC ($string)
static NFKD ($string)
static replaceForNativeNormalize ($string)
 Function to replace some characters that we don't want but most of the native normalize functions keep.

Detailed Description

Unicode normalization routines for working with UTF-8 strings.

Currently assumes that input strings are valid UTF-8!

Not as fast as I'd like, but should be usable for most purposes. UtfNormal::toNFC() will bail early if given ASCII text or text it can quickly determine is already normalized.

All functions can be called static.

See description of forms at http://www.unicode.org/reports/tr15/

Definition at line 48 of file UtfNormal.php.


Member Function Documentation

if there are corrupt characters this may produce erroneous results To also check for illegal use UtfNormal::cleanUp ( string) [static]

The ultimate convenience function! Clean up invalid UTF-8 sequences, and convert to normal form C, canonical composition.

Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters. Not as fast as toNFC().

Parameters:
string$stringa UTF-8 string
Returns:
string a clean, shiny, normalized UTF-8 string

Definition at line 78 of file UtfNormal.php.

References NFC(), quickIsNFCVerify(), and replaceForNativeNormalize().

Referenced by MWDebug\debugMsg(), FeedUtils\formatDiffRow(), MediaWikiSite\normalizePageName(), DjVuImage\pageTextCallback(), CleanUpTest\testAscii(), CleanUpTest\testBomRegression(), CleanUpTest\testBytes(), CleanUpTest\testChunkRegression(), CleanUpTest\testDoubleBytes(), CleanUpTest\testForbiddenRegression(), CleanUpTest\testHangulRegression(), CleanUpTest\testInterposeRegression(), CleanUpTest\testLatin(), CleanUpTest\testLatinNormal(), CleanUpTest\testNull(), CleanUpTest\testOverlongRegression(), CleanUpTest\testSurrogateRegression(), CleanUpTest\testTripleBytes(), and CleanUpTest\XtestAllChars().

static UtfNormal::fastCombiningSort ( string) [static, private]

Sorts combining characters into canonical order.

This is the final step in creating decomposed normal forms D and KD.

Parameters:
string$stringa valid, decomposed UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string with combining characters sorted in canonical order

Definition at line 573 of file UtfNormal.php.

References $n, $out, array(), and loadData().

Referenced by NFD().

static UtfNormal::fastCompose ( string) [static, private]

Produces canonically composed sequences, i.e.

normal form C or KC.

Parameters:
string$stringa valid UTF-8 string in sorted normal form D or KD. Input is not validated.
Returns:
string a UTF-8 string with canonical precomposed characters used where possible.

Definition at line 628 of file UtfNormal.php.

References $n, $out, empty, and loadData().

Referenced by NFC(), and NFKC().

static UtfNormal::fastDecompose ( string,
map 
) [static, private]

Perform decomposition of a UTF-8 string into either D or KD form (depending on which decomposition map is passed to us).

Input is assumed to be *valid* UTF-8. Invalid code will break.

Parameters:
string$stringvalid UTF-8 string
array$maphash of expanded decomposition map
Returns:
string a UTF-8 string decomposed, not yet normalized (needs sorting)

Definition at line 512 of file UtfNormal.php.

References $n, $out, $t, and loadData().

Referenced by NFD().

static UtfNormal::loadData ( ) [static]

Load the basic composition data if necessary.

Access:
private

Definition at line 190 of file UtfNormal.php.

Referenced by fastCombiningSort(), fastCompose(), fastDecompose(), NFD(), quickIsNFC(), and quickIsNFCVerify().

static UtfNormal::NFC ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 464 of file UtfNormal.php.

References fastCompose(), and NFD().

Referenced by cleanUp(), CleanUpTest\testDoubleBytes(), CleanUpTest\testTripleBytes(), toNFC(), and CleanUpTest\XtestAllChars().

static UtfNormal::NFD ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 473 of file UtfNormal.php.

References fastCombiningSort(), fastDecompose(), and loadData().

Referenced by NFC(), and toNFD().

static UtfNormal::NFKC ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 485 of file UtfNormal.php.

References fastCompose(), and NFKD().

Referenced by toNFKC().

static UtfNormal::NFKD ( string) [static, private]
Parameters:
$stringstring
Returns:
string

Definition at line 494 of file UtfNormal.php.

Referenced by NFKC(), and toNFKD().

static UtfNormal::placebo ( string) [static]

This is just used for the benchmark, comparing how long it takes to interate through a string without really doing anything of substance.

Parameters:
$stringstring
Returns:
string

Definition at line 763 of file UtfNormal.php.

References $out.

static UtfNormal::quickIsNFC ( string) [static]

Returns true if the string is _definitely_ in NFC.

Returns false if not or uncertain.

Parameters:
string$stringa valid UTF-8 string. Input is not validated.
Returns:
bool

Definition at line 202 of file UtfNormal.php.

References $n, and loadData().

Referenced by toNFC().

static UtfNormal::quickIsNFCVerify ( &$  string) [static]

Returns true if the string is _definitely_ in NFC.

Returns false if not or uncertain.

Parameters:
string$stringa UTF-8 string, altered on output to be valid UTF-8 safe for XML.
Returns:
bool

Definition at line 243 of file UtfNormal.php.

References $matches, $n, are, array(), as, in, is, loadData(), see, that, used, and UTF.

Referenced by Exif\charCodeString(), cleanUp(), IPTC\convIPTCHelper(), GIFMetadataExtractor\getMetadata(), and JpegMetadataExtractor\segmentSplitter().

static UtfNormal::replaceForNativeNormalize ( string) [static, private]

Function to replace some characters that we don't want but most of the native normalize functions keep.

Parameters:
string$stringThe string
Returns:
String String with the character codes replaced.

Definition at line 780 of file UtfNormal.php.

Referenced by cleanUp().

This directory contains some Unicode normalization routines These routines are meant to be reusable in other so I m not tying them to the MediaWiki utility functions The main function to care about is UtfNormal::toNFC ( string) [static]

Convert a UTF-8 string to normal form C, canonical composition.

Fast return for pure ASCII strings; some lesser optimizations for strings containing only known-good characters.

Parameters:
string$stringa valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form C

Definition at line 119 of file UtfNormal.php.

References NFC(), and quickIsNFC().

Referenced by normalize_form_c(), and normalize_form_c_php().

static UtfNormal::toNFD ( string) [static]

Convert a UTF-8 string to normal form D, canonical decomposition.

Fast return for pure ASCII strings.

Parameters:
string$stringa valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form D

Definition at line 137 of file UtfNormal.php.

References NFD().

Referenced by normalize_form_d(), and normalize_form_d_php().

static UtfNormal::toNFKC ( string) [static]

Convert a UTF-8 string to normal form KC, compatibility composition.

This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.

Parameters:
string$stringa valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form KC

Definition at line 156 of file UtfNormal.php.

References NFKC().

Referenced by normalize_form_kc(), and normalize_form_kc_php().

static UtfNormal::toNFKD ( string) [static]

Convert a UTF-8 string to normal form KD, compatibility decomposition.

This may cause irreversible information loss, use judiciously. Fast return for pure ASCII strings.

Parameters:
string$stringa valid UTF-8 string. Input is not validated.
Returns:
string a UTF-8 string in normal form KD

Definition at line 175 of file UtfNormal.php.

References NFKD().

Referenced by normalize_form_kd(), and normalize_form_kd_php().


Member Data Documentation

UtfNormal::$utfCanonicalComp = null [static]

Definition at line 61 of file UtfNormal.php.

Referenced by CleanUpTest\XtestAllChars().

UtfNormal::$utfCanonicalDecomp = null [static]

Definition at line 62 of file UtfNormal.php.

Referenced by benchmarkForm(), and CleanUpTest\XtestAllChars().

UtfNormal::$utfCheckNFC [static]

Definition at line 66 of file UtfNormal.php.

UtfNormal::$utfCombiningClass = null [static]

Definition at line 60 of file UtfNormal.php.

UtfNormal::$utfCompatibilityDecomp = null [static]

Definition at line 65 of file UtfNormal.php.

const UtfNormal::UNORM_DEFAULT = self::UNORM_NFC

Definition at line 58 of file UtfNormal.php.

Definition at line 57 of file UtfNormal.php.

Definition at line 55 of file UtfNormal.php.

Referenced by donorm().

Definition at line 53 of file UtfNormal.php.

Definition at line 56 of file UtfNormal.php.

Definition at line 54 of file UtfNormal.php.

For using the ICU wrapper.

Definition at line 52 of file UtfNormal.php.


The documentation for this class was generated from the following files: