MediaWiki
REL1_20
|
XHTML sanitizer for MediaWiki. More...
Static Public Member Functions | |
static | attributeWhitelist ($element) |
Fetch the whitelist of acceptable attributes for a given element name. | |
static | checkCss ($value) |
Pick apart some CSS and check it for forbidden or unsafe structures. | |
static | cleanUrl ($url) |
static | cleanUrlCallback ($matches) |
static | cssDecodeCallback ($matches) |
static | decCharReference ($codepoint) |
static | decodeChar ($codepoint) |
Return UTF-8 string for a codepoint if that is a valid character reference, otherwise U+FFFD REPLACEMENT CHARACTER. | |
static | decodeCharReferences ($text) |
Decode any character references, numeric or named entities, in the text and return a UTF-8 string. | |
static | decodeCharReferencesAndNormalize ($text) |
Decode any character references, numeric or named entities, in the next and normalize the resulting string. | |
static | decodeCharReferencesCallback ($matches) |
static | decodeEntity ($name) |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the UTF-8 encoding of that character. | |
static | decodeTagAttributes ($text) |
Return an associative array of attribute names and values from a partial tag string. | |
static | encodeAttribute ($text) |
Encode an attribute value for HTML output. | |
static | escapeClass ($class) |
Given a value, escape it so that it can be used as a CSS class and return it. | |
static | escapeHtmlAllowEntities ($html) |
Given HTML input, escape with htmlspecialchars but un-escape entites. | |
static | escapeId ($id, $options=array()) |
Given a value, escape it so that it can be used in an id attribute and return it. | |
static | fixTagAttributes ($text, $element) |
Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes. | |
static | getAttribsRegex () |
Regular expression to match HTML/XML attribute pairs within a tag. | |
static | hackDocType () |
Hack up a private DOCTYPE with HTML's standard entity declarations. | |
static | hexCharReference ($codepoint) |
static | mergeAttributes ($a, $b) |
Merge two sets of HTML attributes. | |
static | normalizeCharReferences ($text) |
Ensure that any entities and character references are legal for XML and XHTML specifically. | |
static | normalizeCharReferencesCallback ($matches) |
static | normalizeEntity ($name) |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the equivalent numeric entity reference (except for the core < > & "). | |
static | normalizeSectionNameWhitespace ($section) |
Normalizes whitespace in a section name, such as might be returned by Parser::stripSectionName(), for use in the id's that are used for section links. | |
static | removeHTMLcomments ($text) |
Remove '', and everything between. | |
static | removeHTMLtags ($text, $processCallback=null, $args=array(), $extratags=array(), $removetags=array()) |
Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments. | |
static | safeEncodeAttribute ($text) |
Encode an attribute value for HTML tags, with extra armoring against further wiki processing. | |
static | setupAttributeWhitelist () |
Foreach array key (an allowed HTML element), return an array of allowed attributes. | |
static | stripAllTags ($text) |
Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text. | |
static | validateAttributes ($attribs, $whitelist) |
Take an array of attribute names and values and normalize or discard illegal values for the given whitelist. | |
static | validateEmail ($addr) |
Does a string look like an e-mail address? | |
static | validateTagAttributes ($attribs, $element) |
Take an array of attribute names and values and normalize or discard illegal values for the given element type. | |
Public Attributes | |
const | CHAR_REFS_REGEX |
Regular expression to match various types of character references in Sanitizer::normalizeCharReferences and Sanitizer::decodeCharReferences. | |
const | EVIL_URI_PATTERN = '!(^|\s|\*/\s*)(javascript|vbscript)([^\w]|$)!i' |
Blacklist for evil uris like javascript: WARNING: DO NOT use this in any place that actually requires blacklisting for security reasons. | |
const | XMLNS_ATTRIBUTE_PATTERN = "/^xmlns:[:A-Z_a-z-.0-9]+$/" |
Static Public Attributes | |
static | $attribsRegex |
Lazy-initialised attributes regex, see getAttribsRegex() | |
static | $htmlEntities |
List of all named character entities defined in HTML 4.01 http://www.w3.org/TR/html4/sgml/entities.html As well as ' which is only defined starting in XHTML1. | |
static | $htmlEntityAliases |
Character entity aliases accepted by MediaWiki. | |
Static Private Member Functions | |
static | armorLinksCallback ($matches) |
Regex replace callback for armoring links against further processing. | |
static | getTagAttributeCallback ($set) |
Pick the appropriate attribute value from a match set from the attribs regex matches. | |
static | normalizeAttributeValue ($text) |
Normalize whitespace and character references in an XML source- encoded text for an attribute value. | |
static | normalizeWhitespace ($text) |
static | validateCodepoint ($codepoint) |
Returns true if a given Unicode codepoint is a valid character in XML. |
XHTML sanitizer for MediaWiki.
Definition at line 31 of file Sanitizer.php.
static Sanitizer::armorLinksCallback | ( | $ | matches | ) | [static, private] |
Regex replace callback for armoring links against further processing.
$matches | Array |
Definition at line 1068 of file Sanitizer.php.
References $matches.
static Sanitizer::attributeWhitelist | ( | $ | element | ) | [static] |
Fetch the whitelist of acceptable attributes for a given element name.
$element | String |
Definition at line 1376 of file Sanitizer.php.
References setupAttributeWhitelist().
Referenced by validateTagAttributes().
static Sanitizer::checkCss | ( | $ | value | ) | [static] |
Pick apart some CSS and check it for forbidden or unsafe structures.
Returns a sanitized string. This sanitized string will have character references and escape sequences decoded, and comments stripped. If the input is just too evil, only a comment complaining about evilness will be returned.
Currently URL references, 'expression', 'tps' are forbidden.
NOTE: Despite the fact that character references are decoded, the returned string may contain character references given certain clever input strings. These character references must be escaped before the return value is embedded in HTML.
$value | String |
Definition at line 760 of file Sanitizer.php.
References $matches, $value, decodeCharReferences(), StringUtils\delimiterReplace(), and utf8ToCodepoint().
Referenced by SanitizerTest\testCssCommentsChecking(), and validateAttributes().
static Sanitizer::cleanUrl | ( | $ | url | ) | [static] |
$url | string |
Definition at line 1605 of file Sanitizer.php.
Referenced by Parser\makeFreeExternalLink().
static Sanitizer::cleanUrlCallback | ( | $ | matches | ) | [static] |
static Sanitizer::cssDecodeCallback | ( | $ | matches | ) | [static] |
$matches | array |
Definition at line 854 of file Sanitizer.php.
References $matches, and codepointToUtf8().
static Sanitizer::decCharReference | ( | $ | codepoint | ) | [static] |
$codepoint |
Definition at line 1246 of file Sanitizer.php.
References validateCodepoint().
Referenced by normalizeCharReferencesCallback().
static Sanitizer::decodeChar | ( | $ | codepoint | ) | [static] |
Return UTF-8 string for a codepoint if that is a valid character reference, otherwise U+FFFD REPLACEMENT CHARACTER.
$codepoint | Integer |
Definition at line 1343 of file Sanitizer.php.
References codepointToUtf8(), and validateCodepoint().
Referenced by decodeCharReferencesCallback().
static Sanitizer::decodeCharReferences | ( | $ | text | ) | [static] |
Decode any character references, numeric or named entities, in the text and return a UTF-8 string.
$text | String |
Definition at line 1289 of file Sanitizer.php.
Referenced by CoreLinkFunctions\categoryLinkHook(), checkCss(), decodeTagAttributes(), UploadBase\detectScript(), Skin\doEditSectionLink(), escapeHtmlAllowEntities(), escapeId(), ImageCleanup\processRow(), SanitizerTest\testDecodeMixedComplexEntities(), SanitizerTest\testDecodeMixedEntities(), SanitizerTest\testDecodeNamedEntities(), SanitizerTest\testDecodeNumericEntities(), SanitizerTest\testInvalidAmpersand(), SanitizerTest\testInvalidEntities(), and SanitizerTest\testInvalidNumberedEntities().
static Sanitizer::decodeCharReferencesAndNormalize | ( | $ | text | ) | [static] |
Decode any character references, numeric or named entities, in the next and normalize the resulting string.
(bug 14952)
This is useful for page titles, not for text to be displayed, MediaWiki allows HTML entities to escape normalization as a feature.
$text | String (already normalized, containing entities) |
Definition at line 1306 of file Sanitizer.php.
References $count, and $wgContLang.
Referenced by Title\newFromText().
static Sanitizer::decodeCharReferencesCallback | ( | $ | matches | ) | [static] |
$matches | String |
Definition at line 1324 of file Sanitizer.php.
References $matches, decodeChar(), and decodeEntity().
static Sanitizer::decodeEntity | ( | $ | name | ) | [static] |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the UTF-8 encoding of that character.
Otherwise, returns pseudo-entity source (eg "&foo;")
$name | String |
Definition at line 1359 of file Sanitizer.php.
References codepointToUtf8().
Referenced by decodeCharReferencesCallback().
static Sanitizer::decodeTagAttributes | ( | $ | text | ) | [static] |
Return an associative array of attribute names and values from a partial tag string.
Attribute names are forces to lowercase, character references are decoded to UTF-8 text.
$text | String |
Definition at line 1080 of file Sanitizer.php.
References $value, decodeCharReferences(), and getTagAttributeCallback().
Referenced by fixTagAttributes(), Linker\makeKnownLinkObj(), and SanitizerTest\testDecodeTagAttributes().
static Sanitizer::encodeAttribute | ( | $ | text | ) | [static] |
Encode an attribute value for HTML output.
$text | String |
Definition at line 917 of file Sanitizer.php.
Referenced by Xml\expandAttributes(), ApiFormatXml\recXmlPrint(), and safeEncodeAttribute().
static Sanitizer::escapeClass | ( | $ | class | ) | [static] |
Given a value, escape it so that it can be used as a CSS class and return it.
$class | String |
Definition at line 1040 of file Sanitizer.php.
Referenced by ChangeTags\formatSummaryRow(), SpecialStatistics\getGroupStats(), Skin\getPageClasses(), and SkinTemplate\outputPage().
static Sanitizer::escapeHtmlAllowEntities | ( | $ | html | ) | [static] |
Given HTML input, escape with htmlspecialchars but un-escape entites.
This allows (generally harmless) entities like   to survive.
$html | String to escape |
Definition at line 1055 of file Sanitizer.php.
References decodeCharReferences().
Referenced by Linker\formatComment(), and wfMsgExt().
static Sanitizer::escapeId | ( | $ | id, |
$ | options = array() |
||
) | [static] |
Given a value, escape it so that it can be used in an id attribute and return it.
This will use HTML5 validation if $wgExperimentalHtmlIds is true, allowing anything but ASCII whitespace. Otherwise it will use HTML 4 rules, which means a narrow subset of ASCII, with bad characters escaped with lots of dots.
To ensure we don't have to bother escaping anything, we also strip ', ", & even if $wgExperimentalIds is true. TODO: Is this the best tactic? We also strip # because it upsets IE, and % because it could be ambiguous if it's part of something that looks like a percent escape (which don't work reliably in fragments cross-browser).
$id | String: id to escape |
$options | Mixed: string or array of strings (default is array()): 'noninitial': This is a non-initial fragment of an id, not a full id, so don't pay attention if the first character isn't valid at the beginning of an id. Only matters if $wgExperimentalHtmlIds is false. 'legacy': Behave the way the old HTML 4-based ID escaping worked even if $wgExperimentalHtmlIds is used, so we can generate extra anchors and links won't break. |
Definition at line 996 of file Sanitizer.php.
References $options, and decodeCharReferences().
Referenced by HTMLFormField\__construct(), Skin\addToSidebarPlain(), MonoBookTemplate\customBox(), Title\escapeFragmentForURL(), SpecialListGroupRights\execute(), VectorTemplate\execute(), Parser\guessLegacySectionNameFromWikiText(), and validateAttributes().
static Sanitizer::fixTagAttributes | ( | $ | text, |
$ | element | ||
) | [static] |
Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes.
Output is safe for further wikitext processing, with escaping of values that could trigger problems.
$text | String |
$element | String |
Definition at line 894 of file Sanitizer.php.
References $value, decodeTagAttributes(), safeEncodeAttribute(), and validateTagAttributes().
Referenced by removeHTMLtags(), and SanitizerTest\testDeprecatedAttributesUnaltered().
static Sanitizer::getAttribsRegex | ( | ) | [static] |
Regular expression to match HTML/XML attribute pairs within a tag.
Allows some... latitude. Used in Sanitizer::fixTagAttributes and Sanitizer::decodeTagAttributes
Definition at line 333 of file Sanitizer.php.
References $attribsRegex.
static Sanitizer::getTagAttributeCallback | ( | $ | set | ) | [static, private] |
Pick the appropriate attribute value from a match set from the attribs regex matches.
$set | Array |
Definition at line 1116 of file Sanitizer.php.
Referenced by decodeTagAttributes().
static Sanitizer::hackDocType | ( | ) | [static] |
Hack up a private DOCTYPE with HTML's standard entity declarations.
PHP 4 seemed to know these if you gave it an HTML doctype, but PHP 5.1 doesn't.
Use for passing XHTML fragments to PHP's XML parsing functions
Definition at line 1592 of file Sanitizer.php.
Referenced by Xml\isWellFormedXmlFragment().
static Sanitizer::hexCharReference | ( | $ | codepoint | ) | [static] |
$codepoint |
Definition at line 1259 of file Sanitizer.php.
References validateCodepoint().
Referenced by normalizeCharReferencesCallback().
static Sanitizer::mergeAttributes | ( | $ | a, |
$ | b | ||
) | [static] |
Merge two sets of HTML attributes.
Conflicting items in the second set will override those in the first, except for 'class' attributes which will be combined (if they're both strings).
$a | Array |
$b | Array |
Definition at line 731 of file Sanitizer.php.
References $out.
Referenced by Linker\linkAttribs(), and Linker\makeKnownLinkObj().
static Sanitizer::normalizeAttributeValue | ( | $ | text | ) | [static, private] |
Normalize whitespace and character references in an XML source- encoded text for an attribute value.
See http://www.w3.org/TR/REC-xml/#AVNormalize for background, but note that we're not returning the value, but are returning XML source fragments that will be slapped into output.
$text | String |
Definition at line 1149 of file Sanitizer.php.
References normalizeCharReferences().
static Sanitizer::normalizeCharReferences | ( | $ | text | ) | [static] |
Ensure that any entities and character references are legal for XML and XHTML specifically.
Any stray bits will be &-escaped to result in a valid text fragment.
a. named char refs can only be < > & ", others are numericized (this way we're well-formed even without a DTD) b. any numeric char refs must be legal chars, not invalid or forbidden c. use lower cased "&#x", not "&#X" d. fix or reject non-valid attributes
$text | String |
Definition at line 1193 of file Sanitizer.php.
Referenced by CoreParserFunctions\displaytitle(), and normalizeAttributeValue().
static Sanitizer::normalizeCharReferencesCallback | ( | $ | matches | ) | [static] |
$matches | String |
Definition at line 1203 of file Sanitizer.php.
References $matches, decCharReference(), hexCharReference(), and normalizeEntity().
static Sanitizer::normalizeEntity | ( | $ | name | ) | [static] |
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the equivalent numeric entity reference (except for the core < > & ").
If the entity is a MediaWiki-specific alias, returns the HTML equivalent. Otherwise, returns HTML-escaped text of pseudo-entity source (eg &foo;)
$name | String |
Definition at line 1229 of file Sanitizer.php.
Referenced by normalizeCharReferencesCallback().
static Sanitizer::normalizeSectionNameWhitespace | ( | $ | section | ) | [static] |
Normalizes whitespace in a section name, such as might be returned by Parser::stripSectionName(), for use in the id's that are used for section links.
$section | String |
Definition at line 1174 of file Sanitizer.php.
References $section.
Referenced by Linker\formatAutocommentsCallback(), and Parser\guessLegacySectionNameFromWikiText().
static Sanitizer::normalizeWhitespace | ( | $ | text | ) | [static, private] |
static Sanitizer::removeHTMLcomments | ( | $ | text | ) | [static] |
Remove '', and everything between.
To avoid leaving blank lines, when a comment is both preceded and followed by a newline (ignoring spaces), trim leading and trailing spaces and one of the newlines.
$text | String |
Definition at line 580 of file Sanitizer.php.
References wfProfileIn(), and wfProfileOut().
Referenced by removeHTMLtags().
static Sanitizer::removeHTMLtags | ( | $ | text, |
$ | processCallback = null , |
||
$ | args = array() , |
||
$ | extratags = array() , |
||
$ | removetags = array() |
||
) | [static] |
Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments.
$text | String |
$processCallback | Callback to do any variable or parameter replacements in HTML attribute values |
$args | Array for the processing callback |
$extratags | Array for any extra tags to include |
$removetags | Array for any tags (default or extra) to exclude |
Definition at line 366 of file Sanitizer.php.
References $params, $t, fixTagAttributes(), removeHTMLcomments(), wfProfileIn(), wfProfileOut(), wfRestoreWarnings(), and wfSuppressWarnings().
Referenced by CoreParserFunctions\displaytitle(), and SanitizerTest\testSelfClosingTag().
static Sanitizer::safeEncodeAttribute | ( | $ | text | ) | [static] |
Encode an attribute value for HTML tags, with extra armoring against further wiki processing.
$text | String |
Definition at line 938 of file Sanitizer.php.
References encodeAttribute(), and wfUrlProtocols().
Referenced by fixTagAttributes().
static Sanitizer::setupAttributeWhitelist | ( | ) | [static] |
Foreach array key (an allowed HTML element), return an array of allowed attributes.
Definition at line 1391 of file Sanitizer.php.
Referenced by attributeWhitelist().
static Sanitizer::stripAllTags | ( | $ | text | ) | [static] |
Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text.
Warning: this return value must be further escaped for literal inclusion in HTML output as of 1.10!
$text | String: HTML fragment |
Definition at line 1572 of file Sanitizer.php.
Referenced by MWDebug\appendDebugInfoToApiResult(), and CoreParserFunctions\displaytitle().
static Sanitizer::validateAttributes | ( | $ | attribs, |
$ | whitelist | ||
) | [static] |
Take an array of attribute names and values and normalize or discard illegal values for the given whitelist.
$attribs | Array |
$whitelist | Array: list of allowed attribute names |
Check for legal values where the DTD limits things.
Check for unique id attribute :P
Definition at line 650 of file Sanitizer.php.
References $out, $value, checkCss(), escapeId(), and wfUrlProtocols().
Referenced by validateTagAttributes().
static Sanitizer::validateCodepoint | ( | $ | codepoint | ) | [static, private] |
Returns true if a given Unicode codepoint is a valid character in XML.
$codepoint | Integer |
Definition at line 1273 of file Sanitizer.php.
Referenced by decCharReference(), decodeChar(), and hexCharReference().
static Sanitizer::validateEmail | ( | $ | addr | ) | [static] |
Does a string look like an e-mail address?
This validates an email address using an HTML5 specification found at: http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html#valid-e-mail-address Which as of 2011-01-24 says:
A valid e-mail address is a string that matches the ABNF production 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5.
This function is an implementation of the specification as requested in bug 22449.
Client-side forms will use the same standard validation rules via JS or HTML 5 validation; additional restrictions can be enforced server-side by extensions via the 'isValidEmailAddr' hook.
Note that this validation doesn't 100% match RFC 2822, but is believed to be liberal enough for wide use. Some invalid addresses will still pass validation here.
$addr | String E-mail address |
Definition at line 1684 of file Sanitizer.php.
Referenced by Autopromote\checkCondition(), SanitizerValidateEmailTest\checkEmail(), EmailConfirmation\execute(), and WebInstaller_Name\submit().
static Sanitizer::validateTagAttributes | ( | $ | attribs, |
$ | element | ||
) | [static] |
Take an array of attribute names and values and normalize or discard illegal values for the given element type.
$attribs | Array |
$element | String |
Check for legal values where the DTD limits things.
Check for unique id attribute :P
Definition at line 630 of file Sanitizer.php.
References attributeWhitelist(), and validateAttributes().
Referenced by fixTagAttributes(), and CoreTagHooks\pre().
Sanitizer::$attribsRegex [static] |
Lazy-initialised attributes regex, see getAttribsRegex()
Definition at line 326 of file Sanitizer.php.
Referenced by getAttribsRegex().
Sanitizer::$htmlEntities [static] |
List of all named character entities defined in HTML 4.01 http://www.w3.org/TR/html4/sgml/entities.html As well as ' which is only defined starting in XHTML1.
Definition at line 59 of file Sanitizer.php.
Sanitizer::$htmlEntityAliases [static] |
array( 'רלמ' => 'rlm', 'رلم' => 'rlm', )
Character entity aliases accepted by MediaWiki.
Definition at line 318 of file Sanitizer.php.
'/&([A-Za-z0-9\x80-\xff]+); |&\#([0-9]+); |&\#[xX]([0-9A-Fa-f]+); |(&)/x'
Regular expression to match various types of character references in Sanitizer::normalizeCharReferences and Sanitizer::decodeCharReferences.
Definition at line 36 of file Sanitizer.php.
const Sanitizer::EVIL_URI_PATTERN = '!(^|\s|\*/\s*)(javascript|vbscript)([^\w]|$)!i' |
Blacklist for evil uris like javascript: WARNING: DO NOT use this in any place that actually requires blacklisting for security reasons.
There are NUMEROUS[1] ways to bypass blacklisting, the only way to be secure from javascript: uri based xss vectors is to whitelist things that you know are safe and deny everything else. [1]: http://ha.ckers.org/xss.html
Definition at line 50 of file Sanitizer.php.
const Sanitizer::XMLNS_ATTRIBUTE_PATTERN = "/^xmlns:[:A-Z_a-z-.0-9]+$/" |
Definition at line 51 of file Sanitizer.php.