PHPXRef 0.7.1 : MediaWiki-1.24.0 : Detail view of Sanitizer.php

HTML sanitizer for MediaWiki

Regular expression to match HTML/XML attribute pairs within a tag.
Allows some... latitude.
Used in Sanitizer::fixTagAttributes and Sanitizer::decodeTagAttributes

return: string

removeHTMLtags( $text, $processCallback = null,$args = array() X-Ref

Cleans up HTML, removes dangerous tags and attributes, and
removes HTML comments

param: string $text
param: callable $processCallback Callback to do any variable or parameter
param: array|bool $args Arguments for the processing callback
param: array $extratags For any extra tags to include
param: array $removetags For any tags (default or extra) to exclude
return: string

removeHTMLcomments( $text ) X-Ref

Remove '', and everything between.
To avoid leaving blank lines, when a comment is both preceded
and followed by a newline (ignoring spaces), trim leading and
trailing spaces and one of the newlines.

param: string $text
return: string

validateTag( $params, $element ) X-Ref

Takes attribute names and values for a tag and the tag name and
validates that the tag is allowed to be present.
This DOES NOT validate the attributes, nor does it validate the
tags themselves. This method only handles the special circumstances
where we may want to allow a tag within content but ONLY when it has
specific attributes set.

param: string $params
param: string $element
return: bool

validateTagAttributes( $attribs, $element ) X-Ref

Take an array of attribute names and values and normalize or discard
illegal values for the given element type.

- Discards attributes not on a whitelist for the given element
- Unsafe style attributes are discarded
- Invalid id attributes are re-encoded

param: array $attribs
param: string $element
return: array

validateAttributes( $attribs, $whitelist ) X-Ref

Take an array of attribute names and values and normalize or discard
illegal values for the given whitelist.

- Discards attributes not the given whitelist
- Unsafe style attributes are discarded
- Invalid id attributes are re-encoded

param: array $attribs
param: array $whitelist List of allowed attribute names
return: array

mergeAttributes( $a, $b ) X-Ref

Merge two sets of HTML attributes. Conflicting items in the second set
will override those in the first, except for 'class' attributes which
will be combined (if they're both strings).

param: array $a
param: array $b
return: array

normalizeCss( $value ) X-Ref

Normalize CSS into a format we can easily search for hostile input
- decode character references
- decode escape sequences
- convert characters that IE6 interprets into ascii
- remove comments, unless the entire value is one single comment

param: string $value the css string
return: string normalized css

checkCss( $value ) X-Ref

No description

cssDecodeCallback( $matches ) X-Ref

param: array $matches
return: string

fixTagAttributes( $text, $element ) X-Ref

Take a tag soup fragment listing an HTML element's attributes
and normalize it to well-formed XML, discarding unwanted attributes.
Output is safe for further wikitext processing, with escaping of
values that could trigger problems.

- Normalizes attribute names to lowercase
- Discards attributes not on a whitelist for the given element
- Turns broken or invalid entities into plaintext
- Double-quotes all attribute values
- Attributes without values are given the name as attribute
- Double attributes are discarded
- Unsafe style attributes are discarded
- Prepends space if there are attributes.

param: string $text
param: string $element
return: string

encodeAttribute( $text ) X-Ref

Encode an attribute value for HTML output.

param: string $text
return: string HTML-encoded text fragment

safeEncodeAttribute( $text ) X-Ref

Encode an attribute value for HTML tags, with extra armoring
against further wiki processing.

param: string $text
return: string HTML-encoded text fragment

escapeId( $id, $options = array() X-Ref

Given a value, escape it so that it can be used in an id attribute and
return it.  This will use HTML5 validation if $wgExperimentalHtmlIds is
true, allowing anything but ASCII whitespace.  Otherwise it will use
HTML 4 rules, which means a narrow subset of ASCII, with bad characters
escaped with lots of dots.

To ensure we don't have to bother escaping anything, we also strip ', ",
& even if $wgExperimentalIds is true.  TODO: Is this the best tactic?
We also strip # because it upsets IE, and % because it could be
ambiguous if it's part of something that looks like a percent escape
(which don't work reliably in fragments cross-browser).

param: string $id Id to escape
param: string|array $options String or array of strings (default is array()):
return: string

escapeClass( $class ) X-Ref

Given a value, escape it so that it can be used as a CSS class and
return it.

param: string $class
return: string

escapeHtmlAllowEntities( $html ) X-Ref

Given HTML input, escape with htmlspecialchars but un-escape entities.
This allows (generally harmless) entities like   to survive.

param: string $html HTML to escape
return: string Escaped input

armorLinksCallback( $matches ) X-Ref

Regex replace callback for armoring links against further processing.

param: array $matches
return: string

decodeTagAttributes( $text ) X-Ref

Return an associative array of attribute names and values from
a partial tag string. Attribute names are forces to lowercase,
character references are decoded to UTF-8 text.

param: string $text
return: array

safeEncodeTagAttributes( $assoc_array ) X-Ref

Build a partial tag string from an associative array of attribute
names and values as returned by decodeTagAttributes.

param: array $assoc_array
return: string

getTagAttributeCallback( $set ) X-Ref

Pick the appropriate attribute value from a match set from the
attribs regex matches.

param: array $set
return: string

normalizeAttributeValue( $text ) X-Ref

Normalize whitespace and character references in an XML source-
encoded text for an attribute value.

See http://www.w3.org/TR/REC-xml/#AVNormalize for background,
but note that we're not returning the value, but are returning
XML source fragments that will be slapped into output.

param: string $text
return: string

normalizeWhitespace( $text ) X-Ref

param: string $text
return: string

normalizeSectionNameWhitespace( $section ) X-Ref

Normalizes whitespace in a section name, such as might be returned
by Parser::stripSectionName(), for use in the id's that are used for
section links.

param: string $section
return: string

normalizeCharReferences( $text ) X-Ref

Ensure that any entities and character references are legal
for XML and XHTML specifically. Any stray bits will be
&-escaped to result in a valid text fragment.

a. named char refs can only be < > & ", others are
numericized (this way we're well-formed even without a DTD)
b. any numeric char refs must be legal chars, not invalid or forbidden
c. use lower cased "&#x", not "&#X"
d. fix or reject non-valid attributes

param: string $text
return: string

normalizeCharReferencesCallback( $matches ) X-Ref

param: string $matches
return: string

normalizeEntity( $name ) X-Ref

If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD,
return the equivalent numeric entity reference (except for the core <
> & "). If the entity is a MediaWiki-specific alias, returns
the HTML equivalent. Otherwise, returns HTML-escaped text of
pseudo-entity source (eg &foo;)

param: string $name
return: string

decCharReference( $codepoint ) X-Ref

param: int $codepoint
return: null|string

hexCharReference( $codepoint ) X-Ref

param: int $codepoint
return: null|string

validateCodepoint( $codepoint ) X-Ref

Returns true if a given Unicode codepoint is a valid character in XML.

param: int $codepoint
return: bool

decodeCharReferences( $text ) X-Ref

Decode any character references, numeric or named entities,
in the text and return a UTF-8 string.

param: string $text
return: string

decodeCharReferencesAndNormalize( $text ) X-Ref

Decode any character references, numeric or named entities,
in the next and normalize the resulting string. (bug 14952)

This is useful for page titles, not for text to be displayed,
MediaWiki allows HTML entities to escape normalization as a feature.

param: string $text Already normalized, containing entities
return: string Still normalized, without entities

decodeCharReferencesCallback( $matches ) X-Ref

param: string $matches
return: string

decodeChar( $codepoint ) X-Ref

Return UTF-8 string for a codepoint if that is a valid
character reference, otherwise U+FFFD REPLACEMENT CHARACTER.

param: int $codepoint
return: string

decodeEntity( $name ) X-Ref

If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD,
return the UTF-8 encoding of that character. Otherwise, returns
pseudo-entity source (eg "&foo;")

param: string $name
return: string

attributeWhitelist( $element ) X-Ref

Fetch the whitelist of acceptable attributes for a given element name.

param: string $element
return: array

setupAttributeWhitelist() X-Ref

Foreach array key (an allowed HTML element), return an array
of allowed attributes

return: array

stripAllTags( $text ) X-Ref

Take a fragment of (potentially invalid) HTML and return
a version with any tags removed, encoded as plain text.

Warning: this return value must be further escaped for literal
inclusion in HTML output as of 1.10!

param: string $text HTML fragment
return: string

hackDocType() X-Ref

Hack up a private DOCTYPE with HTML's standard entity declarations.
PHP 4 seemed to know these if you gave it an HTML doctype, but
PHP 5.1 doesn't.

Use for passing XHTML fragments to PHP's XML parsing functions

return: string

cleanUrl( $url ) X-Ref

param: string $url
return: mixed|string

cleanUrlCallback( $matches ) X-Ref

param: array $matches
return: string

validateEmail( $addr ) X-Ref

Does a string look like an e-mail address?

This validates an email address using an HTML5 specification found at:
http://www.whatwg.org/html/states-of-the-type-attribute.html#valid-e-mail-address
Which as of 2011-01-24 says:

A valid e-mail address is a string that matches the ABNF production
1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined
in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section
3.5.

This function is an implementation of the specification as requested in
bug 22449.

Client-side forms will use the same standard validation rules via JS or
HTML 5 validation; additional restrictions can be enforced server-side
by extensions via the 'isValidEmailAddr' hook.

Note that this validation doesn't 100% match RFC 2822, but is believed
to be liberal enough for wide use. Some invalid addresses will still
pass validation here.

param: string $addr E-mail address
return: bool

File Size:	1885 lines (56 kb)
Included or required:	0 times
Referenced:	1 time
Includes or requires:	0 files

PHP Cross Reference of MediaWiki-1.24.0

/includes/ -> Sanitizer.php (summary)

Defines 1 class