XML Schema Formalization

MSL
A model for W3C XML Schema

Editors:

Allen Brown, Microsoft, allenbr@microsoft.com

Matthew Fuchs, Commerce One, matthew.fuchs@commerceone.com

Jonathan Robie, Software AG, jonathan.robie@SoftwareAG-USA.com

Philip Wadler, Avaya Labs, wadler@avaya.com

February 19, 2001


Copyright is held by the author/owner.
WWW10, May 2-5, 2001, Hong Kong
ACM 1-58113-348-0/01/0005.

Abstract

MSL (Model Schema Language) is an attempt to formalize some of the core idea in XML Schema. The benefits of a formal description is that it is both concise and precise. MSL has already proved helpful in work on the design of XML Query. We expect that similar techniques can be used to extend MSL to include most or all of XML Schema.

We expect an MSL document to be released by the W3C and for MSL to continue to evolve, as will XML Schema. This paper concentrates on the key element of MSL and XML Schema, namely normalization, refinement, and validation.

This HTML version of the paper is based (due to problems in HTML conversion) on an older version of the MSL document. Please see the proceedings version or the online Latex version.

1. Introduction

XML is based on two simple ideas: represent documents and data as trees, and represent the types of documents and data using tree grammars. Tree grammars are represented using DTDs [4] or XML Schema [6,11,3]. XML Schema is being developed under the auspices of the World Wide Web Consortium (W3C), the body responsible for HTML and XML, among other things. As of this writing, XML Schema recently entered candidate recommendation status.

XML Schema is more powerful than DTDs. Among other things, it uses a uniform XML syntax, supports derivation of document types (similar to subclassing in object-oriented languages), permits `all' groups and nested definitions, and provides atomic data types (such as integers, floating point, dates) in addition to character data. XML Schema is also more complex than DTDs, requiring a couple of hundred pages to describe, as opposed to the thirty or so in the original specification of XML 1.0 (which included DTDs). The remainder of this paper assumes some familiarity with XML Schema.

MSL (Model Schema Language) is an attempt to formalize some of the core idea in XML Schema. The benefits of a formal description is that it is both concise and precise, although it does require some familiarity with mathematical notation.

MSL is described with an inference rule notation. Originally developed by logicians [7,13], this notation is now widely used for describing type systems and semantics of programming languages [10]. A basic understanding of grammar rules and first-order predicate logic should be adequate to understand this paper; all other notations are defined before they are used.

We hope our work on MSL may make XML Schema easier to understand, and may aid in the process of designing other specifications and tools that build on XML Schema. In particular, MSL has already proved helpful in work on the design of XML Query, another W3C standard currently under development. We expect that similar techniques can be used to extend MSL to include most or all of XML Schema.

MSL (like XML Schema) draws on standard ideas about type systems for semistructured data as described in the literature [1, 9], notably the use of regular expressions and tree grammars. In particular, MSL closely resembles the type system in Xduce [8]. Another example of formalizing part of an XML specification can be found in [12].

Many important aspects of XML Schema are not modeled by MSL. We have focussed on the core material in XML Schema Part I (Structures), as we believe this is the most complex. Many features of XML Schema are not modeled. These include the following.

We have begun to work on modeling the first two of these. We believe that most or all of the items in the above list could be modeled with additional effort.

In addition, MSL differs from XML Schema on the definition of restriction. We believe the definition of restriction in XML Schema is unnecessarily ad hoc, as explained in Appendix A.

We expect an MSL document to be released by the W3C and for MSL to continue to evolve, as will XML Schema. This paper concentrates on the key element of MSL and XML Schema, namely normalization, refinement, and validation.

The central point the reader should take away is that it is possible to create a simple model that captures the essence of these features. This was not obvious before we began, and indeed the initial version of MSL was much more complex than what is presented here.

The remainder of this paper is organized as follows. Section 2 gives an overview of MSL. Section 3 defines the basic MSL structures, including names, groups, and components. Section 4 describes normalization. Section 5 describes refinement. Section 6 describes validation. Appendix A lists problems with the current XML Schema Drafts (at the time of writing), uncovered during the MSL work. Appendix B lists suggestions for improving XML Schema, based on the MSL work.

2. Overview

This section uses a running example to introduce the MSL representation of a schema. MSL uses a mathematical notation that is easier to manipulate formally than the XML syntax of Schema. One aspect of MSL is the use of normalization to provide a unique name for each component of a schema.

Figure 1 shows a sample schema written in W3C XML Schema syntax, and Figure 2 shows an XML document which matches the schema. These will be used as running examples throughout this section.

<xsi:schema targetns = "http://www.foo.org/baz.xsd"
    xmlns = "http://www.foo.org/baz.xsd"
    xmlns:xsi = "http://www.w3.org/1999/XMLSchema">
  <xsi:element name="a" type="t"/>
  <xsi:simpleType name="b"/>
    <xsi:list base="xsi:integer"/>
  </xsi:simpleType>
  <xsi:complexType name="t">
    <xsi:attribute name="b" type="xsi:string"/>
    <xsi:attribute name="c" type="b" use="optional"/>
  </xsi:complexType>
  <xsi:complexType name="u">
    <xsi:complexContent>
      <xsi:extension base="t">
        <xsi:choice>
          <xsi:element name="d">
            <xsi:complexType>
              <xsi:sequence>
                <xsi:element name="a"
                    type="xsi:string"
                    minOccurs="1"
                    maxOccurs="unbounded"/>
              </xsi:sequence>
            </xsi:complexType>
          </xsi:element>
          <xsi:element name="e" type="xsi:string"/>
        </xsi:choice>
      </extension>
    </complexContent>
  </xsi:complexType>
</xsi:schema> 
Figure 1 - An XML Schema

<baz:a xmlns:baz="http://www.foo.org/baz.xsd"
    xsi:type="baz:u"
    b="zero"
    c="1 2">
  <d>
    <a>three</a>
    <a>four</a>
  </d>
</baz:a> 
Figure 2 - An XML document

Normalization

MSL uses a normalized form of a schema, which assigns a unique universal name to each component of a schema, and flattens the structure. A component is anything which may be defined or declared: an element, an attribute, a simple type, a complex type, a group, or an attribute group.

The following figure shows the normalized universal names of the components in our sample schema. For each name, we list two forms: the long form is the name proper, while the short form is an abbreviated version we use in examples to improve readability.

long form
       short form
http://www.foo.org/baz.xsd#element::a
       a
http://www.foo.org/baz.xsd#type::s
       s
http://www.foo.org/baz.xsd#type::t
       t
http://www.foo.org/baz.xsd#type::t/attribute::b
       t/@b
http://www.foo.org/baz.xsd#type::t/attribute::c
       t/@c
http://www.foo.org/baz.xsd#type::u
       u
http://www.foo.org/baz.xsd#type::u/element::d
       u/d
http://www.foo.org/baz.xsd#type::u/element::d/type::*
       u/d/ *
http://www.foo.org/baz.xsd#type::u/element::d/type::*/element::a
       u/d/ */a
http://www.foo.org/baz.xsd#type::u/element::e
       u/e

The names reflect the nesting structure of the original schema. Nested elements or attributes are given the name of the element or attribute; nested types are given the anonymous name *. The syntax of names is chosen to be similar to XPath [5].

MSL's normalized names clearly distinguish local names from global names. Where previously it might be possible to confuse the global a element with the local a element, now these are given distinct names, a and u/d/*/a respectively. The latter indicates that the global type u contains a local element d which contain an anonymous type * which contains a local element a. Each attribute or element contains at most one anonymous type, so using the name * for such types leads to no confusion.

Model groups

MSL uses a mathematical notation that is more compact than XML, and more amenable to formal use. Mathematical notation is used for model groups, components, and documents.

MSL uses standard regular expression notation [2] for model groups. In what follows, g stands for a model group (as does g1, g2, and so on). The constructors for model groups include the following.

e        empty sequence
 
Æ        empty choice
 
g1,g2     sequence, g1 followed by g2
 
g1 | g2     choice, g1 or g2
 
g1 & g2   all, g1 and g2 in either order
 
g{m,n} repetition, g repeated between minimum m and maximum n times
    (m is a natural number, n is a natural number or ¥)
mixed(g)    mixed content in group g
a[g]    attribute, with name a and content in group g
e[g]    element, with name e and content in group g
w            wildcard, with name in wildcard w
p            atomic datatype (such as xsi:string or xsi:integer)
t              complex type name
x            component name 

Here is an example of a group in MSL notation, which corresponds to an a element with content of type u in our running example.

a[
    (t/@b[xsi:string] t/@c[xsi:integer{0,¥}]{0,1}),
    (u/d[u/d/*/a[xsi:string]{1,¥}] | u/d/*/e[xsi:string])
] 

Note that the group constructors are used uniformly in several contexts. Repetition (g{m, n}) is used for lists of atomic datatypes, to indicate whether an attribute is optional, and for elements. Similarly, interleaving (g1 & g2) is used for attributes and for elements.

Components

Next, we show how components are represented in MSL notation. Each component is one of seven sorts: element, attribute, simple type, complex type, attribute group, or model group. Regardless of sort, each component is represented uniformly as a record with seven fields:


sort
is the sort of the component (s);
name
is the name of the component (x);
base
is the name of the base component of the structure (x);
derivation
specifies how the component is derived from the base (der);
refinement
is the set of permitted derivations from this component as base (ders);
abstract
is a boolean, representing whether this type is abstract (b);
content
is the content of the structure, a model group (g).

Here are the components of the normalized schema represented in MSL notation.

component(
  sort = element,
  name = a,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {restriction,extension},
  abstract = false,
  content = a[t]
) 
component(
  sort = simpleType,
  name = s,
  base = xsi:UrSimpleType,
  derivation = restriction,
  refinement = {restriction},
  abstract = false,
  content = xsi:integer{0,¥}
) 
component(
  sort = complexType,
  name = t,
  base = xsi:UrType,
  derivation = restriction,
  refinement = {restriction,extension},
  abstract = false,
  content = t/@b & t/@c
) 
component(
  sort = attribute,
  name = t/@b,
  base = xsi:UrAttribute,
  derivation = restriction,
  refinement = {restriction},
  abstract = false,
  content = t/@b[xsi:string]
) 
component(
  sort = attribute,
  name = t/@c,
  base = xsi:UrAttribute,
  derivation = restriction,
  derivation = {restricition},
  abstract = false,
  content = t/@c[s]{0,1}
)
component(
  sort = complexType,
  name = u,
  base = t,
  derivation = extension,
  refinement = {restriction,extension},
  abstract = false,
  content = (t/@b t/@c),
  (u/d | u/e)
) 
component(
  sort = element,
  name = u/d,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d[u/d/*]
) 
component(
  sort = complexType,
  name = u/d/*,
  base = xsi:UrType,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d/*{0,¥}
) 
component(
  sort = element,
  name = u/d/*/a,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/d/*/a[xsi:string]
) 
component(
  sort = element,
  name = u/e,
  base = xsi:UrElement,
  derivation = restriction,
  refinement = {},
  abstract = false,
  content = u/e[xsi:string]
) 

Observe that if we start with the content of the top-level a node, replace the name t with its refinement u, and then expand out all names (that is, replace the names t/@b and t/@c with the contents of those attributes, and so on), then the result is the same as the model group given in the previous subsection.

Note that MSL does not nest declarations to express their scope. Instead, the scope of a declaration is reflected in its normalized name.

Documents

MSL also provides a compact notation for XML documents, both before and after normalization. Here is the original document in MSL notation.

a[
  @xsi:type["u"],
  @b["zero"],
  @c[1,2],
  d[
    a["three"],
    a["four"]
    ]
  ] 

Note that attributes and elements are represented uniformly, as are sequences of attributes, sequences of elements, and lists of atomic datatypes.

Here is the normalized document in MSL notation.

a[
  @xsi:type["u"],
  t/@b["zero"],
  t/@c[1,2],
  u/d[
    u/d/*/a["three"],
    u/d/*/a["four"]
    ]
  ] 

Finally, here is the normalized document in MSL notation with type information added.

a[
  u ' t/@b[xsi:string ' "zero"],
  t/@c[s ' 1,2],
  u/d[
    u/d/* ' u/d/*/a[xsi:string ' "three"],
    u/d/*/a[xsi:string ' "four"]
    ]
  ] 

Unlike the xsi:type convention, the MSL notation allows one to uniformly express information about element and attribute types. The type of an element or attribute is indicated by writing x[t ' d] where x is an attribute or element name, t is a type name, and d is the content of the attribute or element.

3. Structures

This section defines the structures used in MSL: names, wildcards, atomic datatypes, model groups, components, and documents.

Names

A namespace is a URI, and a local name is an NCName, as in the Namespace recommendation. We let i range over namespaces and j range over local names.

    
    
namespace
    i
::=
URI  reference |    e
local name
    j
:=
NCName

A symbol space is one of the six symbol spaces in XML Schema. We let ss range over symbol spaces.

    
    
symbol space
    ss
::=
element    
   
|
attribute    
   
|
type    
   
|
attributeGroup    
   
|
modelGroup    
   
|
notation

(We make no further use of attributeGroup, modelGroup, and notation in this document.)

A symbol name consists of a symbol space paired with a local name or with * (the latter names an anonymous component). A name consists of a URI followed by a sequence of one or more symbol names. We let sn range over symbol names, and x range over names.

    
    
symbol name
    sn
::=
ss::j
    symbol space ss, local name j
   
|
ss::*
    symbol space ss, anonymous name
name
    x
::=
i#sn 1/¼/ snl
    namespace i, symbol names sn1,¼,sn l

It is sometimes convenient to use a short form of names. For these, we make the simplifying assumption that there are only two symbol spaces. Symbol names in the attribute symbol space are written @j, all other symbol names are written j (or * for anonymous names). The URI may be dropped when it is obvious from context.

Example names are shown above. The scope length of the second to last name, with short form u/d/*/a, is 4.

Before normalization, all names in a document have scope length equal to one. It is helpful to define functions to extract the namespace from a name, and to extract the symbol space and local name of the last symbol name. If x = i#sn1/.../snl and snl = ss::j then we define namespace(x) = i, symbol(x) = ss, and local(x) = j.

We also introduce several subclasses of names. An attribute name is the name of an attribute component, and similarly for element, simple type, and complex type names. We let a, e, s, k range over attribute, element, simple type, and complex type names.

    
    
attribute name
    a
::=
x    
element name
    e
::=
x    
simple type name
    s
::=
x    
complex type name
    k
::=
x

The class of a name must be consistent with its symbol space.

symbol(a)
=
attribute
symbol(e)
=
element
symbol(t)
=
type

Wildcards

A wildcard denotes a set of element names. A wildcard item is of the form *:* denoting any name in any namespace, or i:* denoting any name in namespace i. A wildcard consists of wildcard items, union of wildcards, or difference of wildcards. We let v range over wildcard items, and w range over wildcards.

    
    
wildcard item
    v
::=
*:*
    any namespace, any local name
   
|
i:*
    namespace i, any local name
   
|
i:j
    namespace i, local name j
wildcard expression
    u
::=
v
    all names in v
   
|
u1+ u2
    all names in u1 or in u2
   
|
u1! u2
    all names in u1 and not in u2
wildcard
    w
::=
u
    a wildcard without processing
   
|
u%strict
    a wildcard with strict processing
   
|
u%lax
    a wildcard with lax processing
   
|
u%skip
    a wildcard with skip processing

A wildcard union w1+w2 stands for any name in w1 or w2. A wildcard difference w1!w2 stands for any name in w1 but not in w2. For example, the wildcard *:*!(baz:*+xsi:string) denotes any name in any namespace, except for names in namespace baz, and local name string in namespace xsi.

Atomic datatypes

MSL takes as given the atomic datatypes from XML Schema Part 2 [3]. We let p range over atomic datatypes, and c range over constants of such datatypes.

Typically, an atomic datatype is either a primitive datatype, or is derived from another atomic datatype by specifying a set of facets. Lists of atomic datatypes are specified using repetition, while unions are specified using alternation, as defined below.

An example of an atomic datatype is xsi:string, and a constant of that datatype is "zero".

Model groups

We use traditional regular expression notation for model groups. Let g range over model groups.

    
    
group
    g
::=
e
    empty sequence
   
|
Æ
    empty choice
   
|
g1 , g2
    sequence, g1 followed by g2
   
|
g1 | g2
    choice, g1 or g2
   
|
g1 & g2
    all, g1 and g2 in either order
   
|
g{m,n}
    repetition of g between m and n times
   
|
a[g]
    attribute with name a and content in g
   
|
e[g]
    element with name e and content in g
   
|
mixed(g )
    mixed content in group g
   
|
w
    wildcard with name in w
   
|
p
    atomic datatype
   
|
x
    component name
minimum
    m
    natural number
maximum
    n
::=
m
    natural number
   
|
¥
    unbounded

An example of a group appears in Section .

The empty sequence matches only the empty document; it is an identity for sequence and interleaving. The empty choice matches no document; it is an identity for choice.

e,g
=
g
=
g ,e
Æ|g
=
g
=
g |e
e&g
=
g
=
g &e

For use with repetitions, we extend arithmetic to include ¥ in the obvious way. For any n we have n + ¥ = ¥+ n = ¥ and n £ ¥ is always true, and ;¥ < n is always false.

Components

A sort is one of the six sorts of component in XML Schema. We let srt range over sorts.

    
    
sort
    srt
::=
attribute    
   
|
element    
   
|
simpleType    
   
|
complexType    
   
|
attributeGroup    
   
|
modelGroup    
   
|
identityConstraint    

(We make no further use of attributeGroup or modelGroup in this document.)

A derivation specifies how a component is derived from its base. We let der range over derivations, and ders range over sets of derivations.

    
    
derivation
    der
::=
extension    
   
|
refinement    
derivation set
    ders
::=
{der1,¼,der l}

We let b range over booleans.

    
    
boolean
    b
::=
true    
   
|
false

A component is a record with seven fields.


sort
is the sort of the component (srt);
name
is the name of the component (x);
base
is the name of the base component of the structure (x);
derivation
specifies how the component is derived from the base (der);
refinement
is the set of permitted derivations from this component as base (ders);
abstract
is a boolean, representing whether this type is abstract (b);
content
is the content of the structure, a model group (g).

We let cmp range over components.

    
    
component
    cmp
::=
component(    
   
    sort = srt    
   
    name = x    
   
    base = x    
   
    derivation = der    
   
    refinement = ders    
   
    abstract = b    
   
    content = g    
   
)

Examples of components appear in Section

deref(a).sort
=
attribute
deref(e).sort
=
element
deref(s).sort
=
simpleType
deref(k).sort
=
complexType

Constraints

Recall that the group constructs are used uniformly in several contexts. Repetition (g{m, n}) is used for lists of atomic datatypes, to indicate whether an attribute is optional, and for elements. Similarly, interleaving (g1 & g2) is used for attributes and for elements. The advantage of this approach is that the semantics of groups is uniform, and need be given only once. Thus, for instance, how repetition acts is defined once, not separately for attributes and elements.

However, it is helpful to define additional syntactic categories that specify various subsets of groups. These syntactic classes are then used to constrain the content of components, for instance, to indicate that the content of an element component should be an element, and that the content of a type component should consist of attributes followed by elements.

An attribute group contains only attribute names, combined using interleaving. An element group contains no attribute names.

    
    
attribute group
    ag
::=
e
    empty sequence
   
|
ag1 & ag2
    all
   
|
a
    attribute name
element group
    eg
::=
e
    empty sequence
   
|
Æ
    empty choice
   
|
eg1 ,eg2
    sequence
   
|
eg1 |eg2
    choice
   
|
eg1 &eg2
    all
   
|
eg{m,n}
    repetition
   
|
w
    wildcard
   
|
e
    element name
   
|
t
    type name

(Interleaving in element groups might be further constrained, as in Schema Part 1, Section 5.7, All Group Limited.)

Given the above, we can specify the allowed contents of the four sort of component as follows. An attribute content is either an attribute or an optional attribute, where the content is a simple type name. An element content is an element where the content is a type name. A simple type content is an atomic datatype or a list of atomic datatype. A complex type content is an attribute group followed by either a simple type content or an element group.

    
    
attribute content
    ac
::=
a[s]
    attribute
   
|
a[s]{ 0,1}
    optional attribute
element content
    ec
::=
e[t]
    element
union content
    uc
::=
p
    simple type
   
|
uc | uc
    union of simple types
simple content
    sc
::=
uc
    union content
   
|
uc{m,n}
    list of a union content
complex content
    kc
::=
ag ,sc
    attributes follwed by simple content
   
|
ag ,eg
    attributes followed by elements
   
|
ag,mixed( eg)
    mixed content

The content of each sort of component corresponds to the syntax above. That is, the content of an attribute component is always an attribute content ac; the content of an element component is always an element content ec; the content of a simple type component is always a simple content sc; and the content of a complex type component is always a complex content kc. Further, for an attribute or element component the name of the component is the same as the name of the attribute or element in its content.

It is easy to confirm that the example components in Section Documents

A document is a sequence of items, where each item is an atomic datatype, or an attribute (with a document as content), or an element (with a document as content). We let d range over documents.

    
    
document
    d
::=
e
    empty document
   
|
d1,d2
    sequence
   
|
c
    constant of atomic datatype
   
|
a[d]
    attribute, with name a and content d
   
|
e[d]
    element, with name e and content d

Examples of documents appear in Section .

A typed document is a document with added type information for each element and attribute. We let td range over typed documents.

    
    
typed document
    td
::=
e
    empty document
   
|
td1,td2
    sequence
   
|
c
    constant of atomic datatype
   
|
a[s ' td]
    attribute, with simple type name s
   
|
e[t ' td]
    element, with type name t

Examples of typed documents appear in Section . Typed documents are the result of normalization, Section .

An document item is either a constant of atomic datatype, or an attribute, or an element.

    
    
document item
    di
::=
   
   
|
e
   constant
   
|
a[c]
    attribute
   
|
e[d]
    element

Document items are used to define interleaving, Section .

Inference rules

Operations such as normalization and validation are described with an inference rule notation. Originally developed by logicians [7,13], this notation is now widely used for describing type systems and semantics of programming languages [10]. In this notation, when all judgements above the line hold, then the judgement below the line holds as well. Here is an example of a rule used later on. We write d Î g if document d matches group g.

d1 Î g1                   d2 Î g2


d1 ,d2 Î g1, g2

The rule says that if both d1 Î g1 and d2 Î g2 hold, then d1 , d2 Î g1 , g2 holds as well. For instance, take d1 = a[1], d2 = b["two"], g1 = a[xsi:integer], and g2 = b[xsi:string]. Then since both a[1] Î a[xsi:integer] and b["two"] Î b[xsi:string] hold, we may conclude that a[1] , b["two"] Î a[xsi:integer] , b[xsi:string] holds.

4. Normalization

Normalization of a document replaces each name by the corresponding normalized name, and adds type information to the document. Normalization is performed with respect to a given schema; in our formalism the schema is determined by the dereferencing map, deref(). Section gives an example of a document before and after normalization.

Prior to normalization, all names in the document have exactly one symbol name. To build normalized names, we use an operation that extends a name with an additional symbol name. Let x1 and x2 be two names, and where the second name has only one symbol name. We write x1 |> x2 for the operation that extends x1 by adding the symbol name of x2 (if they are in the same namespace) or is the name x2 (if they are in different namespaces).

i#sn 1/¼/ snl |> i#sn
=
i#sn 1/¼/ snl /sn

For each sort there is a root type (AnyType, AnyElement, etc), for which we define

|> AnyElement
=
AnyElement.

In effect, the root types are in a fixed namespace, so the rule for roots is subsumed by the rule for different namespaces.

We write x |- a Þ a' or x |- e Þ e' to indicate that in the context specified by name x that attribute name a normalizes to a' or that element name e normalizes to e'. To normalize an attribute or element name we extend the context by the name (if extension yields the normalized name of some component; these are the `extend' rules), or use the element name directly (otherwise; these are the `reset' rules). We write x Î dom(deref()) and x Ï deref() to indicate whether or not x is in the domain of the dereferencing map; that is, whether or not x names some component.

|> a Î dom(deref())


|-Þ|> a

|> e Î dom(deref())


x |-Þ x   |> e

|> a Ï dom(deref())


x |-Þ a

|> a Ï dom(deref())


x |-Þ e

We write x |- d Þ dt to indicate that in the context specified by name x that document d normalizes to typed document td. We write @xsi:type Ï d if d does not contain an xsi:type attribute. Note that the type names in xsi:type attributes always refer to global types and need not be normalized.



x |-Þ c'


x |- e  Þ e

x |- d1 Þ td1      y |- d2 Þ td2


x |- d1,d1 Þ td1 , td2

x |- a Þ a'
deref(a').content = a'[s] or deref(a').content = a'[s]{0,1}
s
|- d Þ td


x |- a[d]  Þ a'[s ' td]

x |- e Þ e'
@xsi:type Ï d
deref(e').contnet = e'[t]
t |- d Þ td 

x |- e[d]  Þ e'[t ' td]

x |- e Þ e'
deref(e').content = e'[t]
t <:ext t'
t
|- d Þ td


x |- e[@xsi:type[t], d]  Þ e'[t ' td]

The typed element rule uses the relation x <: x', which is defined Section 5. Refinement

Base chain

We write x <: x' if starting from the component with name x one can reach the component with name x' by successively following base links.


x <:ext x
x <:ext x'    x' <:ext x''


x <:ext x''
deref(x).base = x'


x <:ext x'

Extension

We write g <:ext g' if group g is derived from group g' by adding attributes and elements. It is specified using attribute groups ag and element groups eg as defined in Section .



ag , eg < :ext (ag & ag') , eg , eg'

Restriction

We write g <:res g' if the instances of group g are a subset of the instance of group g'. That is, g <:res g' if for every document d such that d Î g it is also the case that d Î g'.

Constraints

A schema must satisfy certain constraints on refinement to be well-formed. Define x <:der x' to be x <:res x' if der = restriction and x <:ext x' if der = extension. We write |- x to indicate that the component with name x is well-formed with respect to refinement.

x' = deref(x).base
der = deref(x).derivation
der = deref(x').refinement
deref(x).content
< :der deref(x').content


|- x

6. Validation

We write c Îp p if constant c matches atomic datatype p. We do not specify this relation further, but simply assume it is as defined in Schema Part~2.

We write e Îv v and e Îw w if element name e matches wildcard item v or wildcard w. We write e Ïw w if it is not the case that e Îw w.



e Îw*:*

namespace(x) = i


e Îw i:*

namespace(e = i
local(e) = j


e Îw i:j

e ÎW v


e ÎW v

e ÎW w1


e ÎW w1 + w2

e ÎW w2


e ÎW w1 + w2

e Î W w1    e ÏW w2


e ÎW w1 ! w2

We write d interd' ; d'' (read ``d interleaves d' and d'''') if some interleaving of d' and d'' yields d. This relates one sequence to many pairs of sequences. For example,

a[1],a[2],a[3] inter a[1],a[2];a[3]
a[1],a[2],a[3] inter a[2];a[1],a[3]
a[1],a[2],a[3] inter e;a[1],a[2],a[3]

among other possibilities. This is used in the rule for interleaving, g1 & g2.


e  inter e;e

d1  interd'1;d"1      d2  inter d'2;d"2


d1 , d2  inter d'1 , d'2 ;d"1,d"2



inter d;e



inter e;d

We write d unmix d' (read ``d unmixes to d''') if d is a sequence of elements and character data and d' is the same sequence with all character data removed. For example, a[1],"x",a[2] unmix a[1],a[2] This is used in the rule for mixed content, mixed(g).



e  unmix e

d1  unmixd'1      d2  unmix d'2


d1 , d2  unmix d'1 , d'2



e[d]  unmix e[d']



unmix e

We write d Î g if document d matches model group g.



e  Î e

d1  Î g1     d2  Î g2


d1 , d2  Î g1 , g2

Î g1


Î g1 | g2

Î g2


Î g1 | g2

d |®inter d'1;d"2    d1 Î g1    d2 Î g2


d Î g1 & g2

d1 Î g    d2 Î g{m,n}


d1 ,d2  Î g{m +1,n+1}

d1 Î g    d2 Î g{0,n}


d1 ,d2  Î g{0,n+1}



e  Î g{0,n}

Î g


a[d]  Î a[g]

Î g


e[d]  Î e[g]

unmix d'  d'  Î g


Î mixed(g}

Î s


e[s ' d]  Î e[s]

Î t    t <: t'


e[t ' d]  Î e[t']

Î e    e ÎW w


Î w

c Îp p


Î p

Î e    e <: e'


Î e'

Î g    g = deref(x).content


Î x

The typed element and element refinement rules use the relation x <: x', which is defined Section . The check that t <: t' in the typed element rule is redundant, as it is also performed during normalization.

When processing a normalized document with types, the attribute and element rules are not required, the typed attribute and typed element are used instead.

A. Problems with XML Schema

B. Suggestions for XML Schema

C. Bibliography

[1] Serge Abiteboul and Peter Buneman and Dan Suciu, Data on the Web : From Relations to Semistructured Data and XML Morgan Kaufmann, 1999

[2] Alfred V. Aho, "Algorithms for Finding Patterns in Strings", Handbook of Theoretical Computer Science, Jan van Leeuwen, MIT Press Elsevier, Cambridge, Massachusetts, 1990

[3] Paul V. Biron and Ashok Malhotra, XML Schema Part 2: Datatypes, , World Wide Web Consortium, 2000

[4] Tim Bray and Jean Paoli and C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0, , World Wide Web Consortium, 1998

[5] James Clark and Steve DeRose, XML path language (XPath) version 1.0, , World Wide Web Consortium, 1999

[6] David C. Fallside, XML Schema Part 0: Primer , World Wide Web Consortium, 2000

[7] Gottlob Frege, "Begriffsschrift, a formula language, modeled upon that of arithmetic, for pure thought (1879)", From Frege to Godel: A Source Book in Mathematical Logic, 1879-1931, ed. Jan van Heijenoort, , Harvard University Press, Cambridge, Massachusetts, 1967

[8] Haruo Hosoya and Jerome Vouillon and Benjamin C. Pierce, "Regular Expression Types for XML", Proceedings of the International Conference on Functional Programming (ICFP), 2000

[9] Tova Milo and Dan Suciu, "Type Inference for Queries on Semistructured Data", Proceedings of the ACM Symposium on Principles of Database Systems, 1999

[10] John C. Mitchell, Foundations for Programming Languages, , MIT Press, 1996, Cambridge, Massachusetts

[11] Henry S. Thompson and David Beech and Murray Maloney and Noah Mendelsohn, XML Schema Part 1: Structures, , World Wide Web Consortium, 2000

[12] Philip Wadler, "A formal semantics of patterns in XSLT", Proceedings of Markup Technologies, ed. B. Tommie Usdin and Deborah A. Lapeyre and C. M. Sperberg-McQueen, Philadelphia, 1999

[13] Philip Wadler, "New Language, Old Logic", Dr Dobbs Journal, December, 2000

[14] Brown, Jr., Allen L., "Derivation, tolerance and validity: a formal model of type definition in XML Schemas", Proceedings of XML Europe 2000, ed. Pamela Gennusa, Paris, 2000

[15] Gaisi Takeuti, Proof Theory, Studies in Logic and the Foundations of Mathematics, vol. 81, North-Holland/American Elsevier, Amsterdam, 1975

[16] Dag Prawitz, Natural Deduction: A Proof-Theoretical Study, , Almqvist & Wiksell, Stockholm, 1965,

[17] Solomon Feferman, "Theories of Finite Type Related to Mathematical Practice", Handbook of Mathematical Logic, ed. Jon Barwise, Elsevier North Holland, Amsterdam, 1977