Programming Java applications using XML4J

 

How do I construct a parser?

 

How do I construct a parser in XML4J version 2?

In XML4J version 2, the DOM api is implemented using the SAX api. XML4J version 2 has a modular architecture and comes pre-bundled with 4 configurations of the parser (all in com.ibm.xml.parsers package). These are:

  • Non Validating SAX parser (com.ibm.xml.parsers.SAXParser)
  • Validating SAX parser (com.ibm.xml.parsers.ValidatingSAXParser)
  • Non Validating DOM parser (com.ibm.xml.parsers.NonValidatingDOMParser)
  • Validating DOM parser (com.ibm.xml.parsers.DOMParser)

There are two ways the parser classes can be instantiated: The first way is to create a string containing the fully qualified name of the parser class. Pass this string to the org.xml.sax.helpers.ParserFactory.makeParser() method to instantiate it. This method is useful if your application will need to switch between different parser configurations. The code snippet shown below is using this method to instantiate a (validating) DOMParser.

    import org.xml.sax.Parser;
    import org.xml.sax.helpers.ParserFactory;
    import com.ibm.xml.parsers.DOMParser;
    import org.w3c.dom.Document;
    import org.xml.sax.SAXException;
    import java.io.IOException;
    ...
    String parserClass = "com.ibm.xml.parsers.DOMParser";
    String xmlFile = "file:///xml4j2/data/personal.xml";
    Parser parser = ParserFactory.makeParser(parserClass);
    try {
         parser.parse(xmlFile);
    }
    catch (SAXException se) {
         se.printStackTrace();
    }
    catch (IOException ioe) {
         ioe.printStackTrace();
    }
    // The next line is only for DOM Parsers
    Document doc = ((DOMParser) parser).getDocument();
    ...

    The second way to instantiate a parser class is to explicitly instantiate the parser class, as shown in this example, which is creating a validating DOM Parser. Use this way when you know exactly which parser configuration you need, and you are sure that you will not need to switch configurations.

      import com.ibm.xml.parsers.DOMParser;
      import org.w3c.dom.Document;
      import org.xml.sax.SAXException;
      import java.io.IOException;
       
      ...
      String xmlFile = "file:///xml4j2/data/personal.xml";
      DOMParser parser = new DOMParser();
      try {
           parser.parse(xmlFile);
      }
      catch (SAXException se) {
           se.printStackTrace();
      }
      catch (IOException ioe) {
           ioe.printStackTrace();
      }
      // The next line is only for DOM Parsers
      Document doc = parser.getDocument();
      ...

    Once you have the Document object, you can call any method on it as defined by the DOM specification.

How do I create a DOM parser?

 

How do I create a DOM parser?

Use one of the methods in the question above, and use com.ibm.xml.parsers.DOMParser to get a validating parser and com.ibm.xml.parsers.NonValidatingDOMParser to get a non-validating parser.

To access the DOM tree, you can call the getDocument() method on the parser.

How do I create a SAX parser?

 

How do I create a SAX parser?

Use one of the methods in the question above, and use com.ibm.xml.parsers.ValidatingSAXParser to get a validating parser and com.ibm.xml.parsers.SAXParser to get a non-validating parser.

Once you have the parser instance, you can use the standard SAX methods to set the various handlers provided by SAX.

How do I create XML4J v1 parser?

 

How do I create a parser compatible with XML4J version 1?

As an aid to developers  currently using XML4J version 1, classes in the com.ibm.xml.parser and com.ibm.xml.xpointer packages are provided for backward compatibility. If you need parser functionality that is provided in version 1 and there is no corresponding functionality in version 2, you can use the "TX compatibility" classes the same way you used them in version 1.

However, you cannot mix and match classes between the native classes and the TX compatibility classes. You should use the version 1 method for creating a parser class and causing the parser to read its input, as well as for setting all options. The DOM returned by the compatiblity classes will be an instance of the TX* classes from version 1.

Not all the functions available on com.ibm.xml.parser.Parser are supported or implemented:

  • Not supported:
    Calling these methods will throw java.lang.IllegalArgumentException.
    • addNoRequiredAttributeHandler
    • getReaderBufferSize
    • setErrorNoByteMark
    • setReaderBufferSize
       
  • Not implemented:
    These methods are present but should not be expected to function the same as in the old parser.
    • setProcessExternalDTD
    • setWarningNoDoctypeDecl
    • setWarningNoXMLDecl
    • setWarningRedefinedEntity
    • stop

XML4J version 1 occasionally inserted extra TX nodes in its DOM tree. Even though the compatibility classes provide a TX DOM tree, these extra nodes will not be present. If your application relies on their presence, you will need to modify your code.

Users who are moving to the new parser architecture but want to use the catalog file format supported by the old parser should use the com.ibm.xml.internal.TXCatalog class. See the question "How do I use catalogs?".

What is the difference?

 

What's the difference between the two DOM implementations?

In XML4J version 2, there are two different DOM implementations provided:

  1. The TX Compatibility DOM provides a large number of features not provided by the standard DOM API, but it is not tuned for performance.
  2. The Standard DOM provides just the standard DOM Level 1 API, and it is highly tuned for performance. 

Because XML4J version 2 is modular, you choose the DOM implementation you need for your application when you write your code. Note, however, that you cannot use both DOM's in the same parser at the same time.

A summary of DOM features is shown below:

 

Standard DOM

TX Compatibility DOM

performance

best

good

revalidation of modified DOM

YES

YES

write validation (you can ask "what is legal to insert here")

---

YES

XLink/XPointer

---

YES

namespace support

standard

expanded

What new options are available?

 

What new options are available on parsers?

  • setAllowJavaEncodingname()
    If set to true, it allows Java's names for encodings to be used as well as  names defined by the XML standard.
  • setWarningOnDuplicateAttDef()
    If set to true, it warns if there are duplicate attribute definitions.
  • setCheckNamespace()
    If set to true, it performs syntactic checking of namespaces when they are present.
  • setContinueAfterFatalError()
    If set to true, it keeps processing, even if a fatal error occurs
  • setDocumentTypeHandler()
    It sets the XMLDocumenthandler
  • setEntityHandler()
    It sets the EntityHandler
  • setValidationHandler()
    It sets the ValidationHandler
  • setDocumentHandler()
    It sets the SAX DocumentHandler
  • setLocale()
    It sets the locale to use for messages
  • setEntityResolver()
    It sets the SAX EntityResolver
  • setDTDHandler()
    It sets the SAX DTDHandler
  • setErrorHandler()
    It sets the SAX ErrorHandler

How do I use setNodeExpansion?

 

How do I use the setNodeExpansion call on DOMParser and NonValidatingDOMParser?

The native DOM parser classes, com.ibm.xml.parsers.DOMParser and com.ibm.xml.parsers.NonValidatingDOMParser now use a DOM implementation that takes advantage of lazy evaluation to improve performance. The setNodeExpansion call on these classes controls the use of lazy evaluation. There are two values for the argument to setNodeExpansion: FULL and DEFERRED(the default).

If node expansion is set to FULL, then the DOM classes behave as they always have, creating all nodes in the DOM tree by the end of parsing.

If node expansion is set to DEFERRED, nodes in the DOM tree are only created when they are accessed. This means that a call to getDocument will return a DOM tree that consists only of the Document node. When your program accesses a child of Document, the children of the Document node will be created. All the immediate children of a Node are created when any of that Node's children are accessed. This shortens the time it takes to parse an XML file and create a DOM tree. This also increases the time it takes to access a node that has not been created. After nodes have been created, they are cached, so this overhead only occurs on the first access to a Node.

How do I use namespaces?

 

How do I use namespaces?

In XML4J version 2, the easiest way to get namespace support is to use the TX compatibility classes that provide an API for dealing with namespace information.  There are no standard API's for namespace manipulation in the standard DOM and SAX packages.  The TX Compatibility classes provide additional, non-standard API's to work with namespaces.

When using the Standard DOM API, element names containing colons (":") are treated as normal element names.

NOTE: The namespace specification does not currently specify the behavior of validation in the presence of namespaces.  The behavior of validating parsers (all validating parsers, not just XML4J) when namespaces are in use is currently undefined. 

If you want to use "namespace-like" element names (e.g. a:foo) with validation, create a new DTD that contains fully qualified names from all the DTD's in use.  Since the colon character is treated as a normal element name character,  this merged DTD will allow you to do validation, using these "namespace-like" names.

How do I use catalogs?

 

How do I use catalogs?

XML4J Version 2 supports two catalog file formats: the SGML Open catalog that was supported in version 1, and the proposed XCatalog specification.

To use the original catalog file format, set a TXCatalog instance as the parser's EntityResolver. For example:

    XMLParser parser = new DOMParser();
    Catalog catalog = new TXCatalog(parser.getParserState());
    parser.getEntityHandler().setEntityResolver(catalog);

Once the catalog is installed, catalog files that conform to the TXCatalog format can be appended to the catalog by calling the loadCatalog method on the parser or the catalog instance. The following example loads the contents of two catalog files:

    parser.loadCatalog(new InputSource("catalogs/cat1.xml"));
    parser.loadCatalog(new InputSource("http://host/catalogs/cat2.xml"));

To use the XCatalog catalog, you must first have a catalog in XCatalog format. The current version of the XCatalog catalog supports the XCatalog proposal draft 0.2 posted to the xml-dev mailing list by John Cowan. XCatalog is an XML representation of the SGML Open TR9401:1997 catalog format. The current proposal supports public identifier maps, system identifier aliases, and public identifier prefix delegates. Refer to the XCatalog DTD for the full specification of this catalog format at http://www.ccil.org/~cowan/XML/XCatalog.html.

In order to use XCatalogs, you must write the catalog files with the following restrictions:

  • Use the XCatalog grammar.
  • Specify the <!DOCTYPE> line with the PUBLIC specified as "-//DTD XCatalog//EN" or make sure that the system identifier is able to locate the XCatalog 0.2 DTD.  XCatalog 0.2 DTD is included in the Jar file containing the com.ibm.xml.internal.XCatalog class. For example:

    <!DOCTYPE
       XCatalog
       PUBLIC "-//DTD XCatalog//EN"
       "com/ibm/xml/internal/xcatalog.dtd">
     
  • The enclosing document root element is not optional -- it must be specified.
  • The Version attribute of the has been modified from '#FIXED "1.0"' to '(0.1|0.2) "0.2"'.

To use this catalog in a parser, set an XCatalog instance as the parser's EntityResolver. For example:

    XMLParser parser = new SAXParser();
    Catalog catalog = new XCatalog(parser.getParserState());
    parser.getEntityHandler().setEntityResolver(catalog);

Once installed, catalog files that conform to the XCatalog grammar can be appended to the catalog by calling the loadCatalog method on the parser or the catalog instance. The following example loads the contents of two catalog files:

    parser.loadCatalog(new InputSource("catalogs/cat1.xml"));
    parser.loadCatalog(new InputSource("http://host/catalogs/cat2.xml"));

Limitations: The following are the current limitations of this XCatalog implementation:

  • No error checking is done to avoid circular Delegate or Extend references.
  • Do not specify a combination of catalog files that reference each other.

How do I use the revalidation API?

 

How do I use the revalidation API?

In XML4J version 2, you can validate a document after it has been parsed and converted to a DOM tree. To do this, use the RevalidatingDOMParser or the TXRevalidatingDOMParser classes. The validate method on this class takes a DOM node as an argument, and performs a validity check on the DOM tree rooted at that node, using the DTD of the current document. Currently, the native DOM prevents the insertion of invalid nodes, so this feature is not as useful for the native DOM.

This is an experimental feature, and the details of its operation will change in future releases of XML4J version 2. We are including it in order to hear your feedback on the functionality of these API's.

The sample program below parses a document, inserts an illegal node into the TX DOM and then tries to re-validate the document.

    import java.io.IOException;
    import com.ibm.xml.parser.TXElement;
    import com.ibm.xml.parsers.TXRevalidatingDOMParser;
    import org.xml.sax.SAXException;
    import org.w3c.dom.Document;
    import org.w3c.dom.Node;
    public class RevalidateSample {
    public static void main(String args[]) {
         String xmlFile = "file:///Work/xml4j2/data/personal.xml";
         TXRevalidatingDOMParser parser = new TXRevalidatingDOMParser();
         try {
              parser.parse(xmlFile);
         }
         catch (SAXException se) {
         System.out.println("SAX error: caught "+se.getMessage());
         se.printStackTrace();
         }
         catch (IOException ioe) {
              System.out.println("I/O Error: caught "+ioe);
              ioe.printStackTrace();
         }
         Document doc = parser.getDocument();
         System.out.println("Doing initial validation");
         Node pos = parser.validate(doc.getDocumentElement());
         if (pos == null) { System.out.println("ok."); }
         else { 
              System.out.println("Invalid at " + pos);
              System.out.println(pos.getNodeName());
         }
         // Now insert dirty data
         Node junk = new TXElement("bar");
         Node corrupt = doc.getDocumentElement();
         System.out.println("Corrupting: "+corrupt.getNodeName());
         corruptee.insertBefore(junk,corrupt.getFirstChild().get NextSibling());
         System.out.println("Doing post-corruption validation");
         position = parser.validate(doc.getDocumentElement());
         if (position == null) {
              System.out.println("ok.");
         }
         else {
              System.out.println("Invalid at " + position);          
              System.out.println(position.getNodeName());
         }
    }
    }

How do I handle errors?

 

How do I handle errors?

When you create a parser instance, the default error handler does nothing. This means that your program will fail silently when it encounters an error. You should register an error handler with the parser by supplying a class which implements the org.xml.sax.ErrorHandler interface. This is true regardless of whether your parser is a DOM based or SAX based parser. IBM alphaWorks XML For Java communityXchange-XML for Java

How does entity expansion work?

 

How does entity expansion work in XML4J version 2?

If you are using the TX Compatibility classes, you can already control entity expansion. (See the API docs for details).

If you are using the native 2.0 DOM classes, the function setExpandEntityReferences controls how entities appear in the DOM tree. When setExpandEntityReferences is set to false (the default), an occurance of an entity reference in the XML document will be represented by a subtree with an EntityReference node at the root whose children represent the entity expansion.

Unlike the TX compatibility classes and XML4J version 1.1.x, the entity expansion will be a DOM tree representing the structure of the entity expansion, not a text node containing the entity expansion as text.

If setExpandEntityReferences is true, an entity reference in the XML document is represented by only the nodes that represent the entity expansion. Again, unlike the TX compatibility classes and XML4J version 1.1.x, the entity expansion will be a DOM tree representing the structure of the entity expansion, not a text node containing the entity expansion as text.

What does "non-validating" mean?

 

Why does "non-validating" not mean "well-formedness checking only"?

Using a "non-validating" parser does not mean that only well-formedness checking is done!  There are still many things that the XML specification requires of the parser, including entity substitution, defaulting of attribute values, and attribute normalization.

This table describes what "non-validating" really means for XML4J parsers.  In this table, "no DTD" means no internal or external DTD subset is present. 

 

non-validating parsers

validating parsers

DTD present

no DTD

DTD present

no DTD

DTD is read

YES

no

YES

error

entity substitution

YES

no

YES

error

defaulting of attributes

YES

no

YES

error

attribute normalization

YES

no

YES

error

check against content model

no

no

YES

error

How do I associate my data with a node?

 

How do associate my own data with a node in the DOM tree?

The class com.ibm.xml.dom.NodeImpl provides a void setUserData(Object o) and an Object getUserData() method that you can use to attach any object to a node in the DOM tree.

How do I parse several documents?

 

How do I more efficiently parse several documents sharing a common DTD?

DTDs are not currently cached by the parser.  The common DTD, since it is specified in each XML document, will be re-parsed once for each document.

However, there are things that you can do now, to make the process of reading DTD's more efficient:

  • keep your DTD and DTD references local
  • use internal DTD subsets, if possible
  • load files from server to local client before parsing
  • Cache document files into a local client cache.  You should do an HTTP header request to check whether the document has changed, before accessing it over the network.
  • Do not reference an external DTD or internal DTD subset at all.  In this case, no DTD will be read.