Analyzing XML schemas with the Schema Infoset Model

Contents:
Example: Analyzing schemas
Loading schemas
Convenient schema querying
Schema components model
Your report: Types missing max/min facets
Conclusion
Sample code
Resources
About the author

Easily perform complex queries on your schemas with this model

Level: Intermediate

Shane Curcuru (shane_curcuru@us.ibm.com)
Advisory Software Engineer, IBM
July 2002

As the use of schemas grows, the need for tools to manipulate schemas grows. The new Schema Infoset Model provides a complete modeling of schemas themselves, including the concrete representations as well as the abstract relationships within a schema or a set of schemas. This article will show some of the power of this library to easily query the model of a schema for detailed information about it; we could also update the schema to fix any problems found and write the schema back out.

Note: This tip assumes you have a basic knowledge of schema documents; there are a number of links to schema documentation and a tutorial in Resources.

Although there are a number of parsers and tools that use schemas to validate or analyze XML documents, tools that allow querying and advanced manipulation of schema documents themselves are still being built. The Schema Infoset Model (AKA org.eclipse.xsd.*, or just "the library") provides a rich API library that models schemas -- both their concrete representations (perhaps in a schema.xsd file) and the abstract concepts in a schema as defined by the specification. As anyone who has read the schema specs knows, they're quite detailed, and this model strives to expose all the details within any schema. This will then allow you to efficiently manage your schema collection, and empower higher level schema tools -- perhaps schema-aware parsers and transformers.

Schema Infoset Model UML diagrams

The library includes various UML diagrams for the actual library classes, which gives a quick overview of the relationships and attributes of common schema components.

Abstract Schema Component relationships
This diagram shows the relations between Schema Infoset Components -- the abstract relationships between schema objects as modeled in the library. Black diamonds show strong composition or aggregation; open diamonds show weak aggregation.

Abstract Schema Component attributes
This diagram shows some of the attributes of the abstract schema components as modeled in the library, as well as part of the class hierarchy.

Schema Library class listing
This listing shows the core classes included in org.eclipse.xsd.

These diagrams are included in the library's documentation, including several other UML diagrams for both the abstract and concrete class trees.

For an interface listing of the library showing all the schema objects modeled, please see Schema Infoset Model UML diagrams. The library also includes the UML diagrams used in building the library interfaces themselves; these diagrams show the relationships between the library objects, which very closely mimic the concepts in the schema specifications.

Example: Analyzing your schemas
In this example, you'll want to check your schema for possibly failing to specify restrictions on integer-derived types. This could be useful for ensuring that all order quantities in purchase orders have been bounded. Here, the schemas must be very specific, so you want to require that all simple types that derive from integers include both min/maxInclusive or min/maxExclusive facets. However, if the min/maxInclusive or min/maxExclusive facets are inherited from a type which this type derives from, that is still sufficient.

While you can use XSLT or XPath to query a schema's concrete representation in an .xsd file or inside some other .xml content, it is much more difficult to discover the type derivations and interrelationships that schema components actually have. Since the Schema Infoset Model library models both the concrete representation and the abstract concept of the schema, it can easily be used to collect details about its components, even when the schema may have deep type hierarchies or be defined in multiple schema files.

In this simple schema, you will find some types that meet the criteria of having max/min facets, and some that do not. (You can find the full schema in FindTypesMissingFacets.xsd included in the zip file.)

Listing 1. Sample schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.research.ibm.com/XML/NS/xsd"
xmlns="http://www.research.ibm.com/XML/NS/xsd">

<!-- SimpleType missing both max/min facets -->
<xsd:simpleType name="integer-noFacets">
    <xsd:restriction base="xsd:integer"/>
</xsd:simpleType>

<!-- Derived type has inherited min facet but missing max facet -->
<xsd:simpleType name="positiveInteger-inheritedMinFacet">
    <xsd:restriction base="xsd:positiveInteger"/>
</xsd:simpleType>

<!-- Derived type with both effective max/min facets -->
<xsd:simpleType name="positiveInteger-bothFacets">
    <xsd:restriction base="positiveInteger-inheritedMinFacet">
        <xsd:maxExclusive value="100"/>
    </xsd:restriction>
</xsd:simpleType>
<!-- etc... -->
</xsd:schema>

Loading schemas into the library
The library can read and write schema objects from a variety of sources. I'll show it using the org.eclipse.emf ResourceSet framework to easily load sets of schemas; you can also build and emit schemas directly from or to a DOM object that you manage yourself. The library provides a custom XSDResourceSet implementation that can intelligently and automatically load sets of schemas related by includes, imports, and redefines. The abstract relationship between related schemas is also modeled in the library.

Listing 2. Loading a schema
// String variable schemaURL is "FindTypesMissingFacets.xsd" or the URL to your schema
// Create a resource set and load the main schema file into it.
ResourceSet resourceSet = new ResourceSetImpl();
XSDResourceImpl xsdSchemaResource = (XSDResourceImpl)resourceSet.getResource(
        URI.createDeviceURI(schemaURL), true);

// getResources() returns an iterator over all the resources, therefore, the main resource
// and those that have been included, imported, or redefined.
for (Iterator resources = resourceSet.getResources().iterator(); 
    resources.hasNext(); /* no-op */)
{
    // Return the first schema object found, which is the main schema 
    //   loaded from the provided schemaURL
    Resource resource = (Resource)resources.next();
    if (resource instanceof XSDResourceImpl)
    {
        XSDResourceImpl xsdResource = (XSDResourceImpl)resource;
        // This returns a org.eclipse.xsd.XSDSchema object
        return xsdResource.getSchema();
    }
}

Convenient schema querying
Now that you have an XSDSchema object, you need to query it to find any types that are missing max/min facets. First, you'll use some convenient library methods to quickly find all of its simpleTypeDefinitions that derive from the built-in integer type. Since the library provides a complete model of the abstract meaning of a schema, this turns out to be very straightforward. You can query the XSDSchema for its getTypeDefinitions() listing, and then filter for XSDSimpleTypeDefinitions that actually inherit from the base integer type.

Listing 3. Getting a list of specific types
// A handy convenience method quickly gets all 
//   typeDefinitions within the schema
List allTypes = schema.getTypeDefinitions();
ArrayList allIntegerTypes = new ArrayList();

for (Iterator iter = allTypes.iterator(); 
        iter.hasNext(); /* no-op */)
{
    XSDTypeDefinition typedef = (XSDTypeDefinition)iter.next();
    // Filter out for only simpleTypes...
    if ((typedef instanceof XSDSimpleTypeDefinition) 
        // ... and filter for built-in integer types
        // Use a worker method in the very handy sample 
        //  program org.eclipse.xsd.util. XSDSchemaQueryTools
        && XSDSchemaQueryTools.isTypeDerivedFrom(typedef, 
                schema.getSchemaForSchemaNamespace(), "integer"))
    {
        // The filter found one; save it and continue.
        allIntegerTypes.add(typedef);
    }
}

The schema components model
Every component defined in the W3C schema specifications is modeled in detail in the library. Now that you have a list of all XSDSimpleTypeDefinitions that derive from an integer, you can query this list for ones that are missing either their max or min facets, and produce a report. Note that the library can conveniently group the effective max/minExclusive or max/minInclusive facets together for quick searching; it also provides detailed access to each type, including the actual lexical values if needed.

Listing 4. Querying XSDSimpleType components
for (Iterator iter = allIntegerTypes.iterator(); 
        iter.hasNext(); /* no-op */)
{
    XSDSimpleTypeDefinition simpleType = (XSDSimpleTypeDefinition)iter.next();
    // First, exclude any UNION or LIST types, since 
    //  the schema spec says they can't have min/max facets:
    //  Part 2: Datatypes in:
    //  '4.1.5 Constraints on Simple Type Definition Schema Components'
    if ((XSDVariety.LIST_LITERAL == simpleType.getVariety())
        || (XSDVariety.UNION_LITERAL == simpleType.getVariety()))
    {
        // Unions and lists cannot have min/max facets at all,
        //  so there's no need to report them
        continue;
    }

    // Get the effective max/min facets for each type - 
    //  this includes ones declared in this type or 
    //  ones that are inherited, and so forth
    XSDMaxFacet maxFacet = simpleType.getEffectiveMaxFacet();
    XSDMinFacet minFacet = simpleType.getEffectiveMinFacet();

    // If you don't have the proper ones, report the error.
    if ((null == maxFacet) || (null == minFacet))
    {
        if (null != simpleType.getName())
        {
            // A component's URI in the library is effectively 
            //  its <target namespace>#<name>
            System.out.println("Schema named component: " + simpleType.getURI() );
        }
        else
        {
            // It's an anonymous type, so ask the library 
            //  to construct a default 'alias' for it
            System.out.println("Schema anonymous component: " + simpleType.getAliasURI() );
        }
        System.out.print(" is missing these required facets: ");
        if (null == maxFacet)
        {
            System.out.print(" XSDMaxFacet (either inclusive or exclusive) ");
        }
        if (null == minFacet)
        {
            System.out.print(" XSDMinFacet (either inclusive or exclusive) ");
        }
        // You could also report on the facets this type does have like:
        // if (minFacet.isExclusive) {
        // System.out.println("minFacet.getValue=" + minFacet.getValue());
        // }
    }
}

Your report: Types missing max/min facets
With just a little bit of code, you've discovered some fairly detailed information about the schema. If you download the sample code and run it against the provided schema file, you should see a listing like this:

Listing 5. The output report
Schema missing max/min facet report on: FindTypesMissingFacets.xsd
Schema named component: http://www.research.ibm.com/XML/NS/xsd#integer-minFacet
  is missing these required facets:  XSDMaxFacet (either inclusive or exclusive)

Schema named component: http://www.research.ibm.com/XML/NS/xsd#integer-noFacets
  is missing these required facets:  XSDMaxFacet (either inclusive or exclusive)
  XSDMinFacet (either inclusive or exclusive)

Schema named component: http://www.research.ibm.com/XML/NS/xsd#positiveInteger-inheritedMinFacet
  is missing these required facets:  XSDMaxFacet (either inclusive or exclusive)

Conclusion
Although this is a contrived example, it does show how the library's detailed representation of a schema makes it easy to find exactly the parts of a schema you need. The library provides setter methods for the properties of schema components, so it is easy to update your sample to automatically fix any found types by adding any missing facets. And since the library models the concrete representation of the schema as well, you can write your updated schema back out to an .xsd file.

Sample code
A sample program, XSDFindTypesMissingFacets.java, shows the example in this article. It uses a schema document FindTypesMissingFacets.xsd which has a number of types with and without max/min facets.

You can download the sample program and the following sample .java files in a zip file.

Copies of several other sample .java files normally shipped with the Schema Infoset Model are also attached. These include:

Resources

This content was adapted from an article on IBM developerWorks at http://www.ibm.com/developerWorks/.

About the author
Shane Curcuru has been a developer and quality engineer at Lotus and IBM for 12 years and is a member of the Apache Software Foundation. He has worked on such diverse projects as Lotus 1-2-3, Lotus eSuite, Apache's Xalan-J XSLT processor, and a variety of XML Schema tools. Questions about this article or about automated testing can be sent to him at shane_curcuru@us.ibm.com.