Common Problems

I tried to use XML4J to parse HTML, and it generated an error.  What did I do wrong?

Unfortunately, HTML does not, in general, follow the XML grammar rules. Most HTML is actually not well-formed XML.  Therefore, the XML parser generates XML well-formedness errors.

Typical errors include:

  • Missing end tags, e.g. <P> with no </P> (end tags are not required in HTML)
  • Missing closing slash on <IMG HREF="foo" /> (not required in HTML)
  • Missing quotes on attribute values, e.g. <IMG width="600">  (not generally required in HTML)

 

I get an error: "invalid UTF-8 character"

There are many Unicode characters that are not allowed in an XML document, according to the XML spec. Typical disallowed characters are control characters, even if you escape them using the Character Reference form: &#xxxx; . See the XML spec, sections 2.2 and 4.1 for details. If the parser is generating this error, it is very likely that there's a character in there that you can't see. You can generally use a UNIX command like "od -hc" to find it.

NOTE: There was a bug in the 2.0.0 version that caused problems with many non-UTF-8 encodings. This bug has been fixed, and appears in the v2.0.2 version of the parser.

 

I get an error when I access EBCDIC XML files -- what's happening?

If an XML document/file is not UTF-8, then you MUST specify the encoding. When transcoding a UTF8 document to EBCDIC, remember to change this:

    <?xml version="1.0" encoding="UTF-8"?>

    to something like this:

    <?xml version="1.0" encoding="ebcdic-cp-us"?> .

 

I get an error on the EOF character (0x1A).

No, the parser isn't broken.  You're probably using the LPEX editor, which automatically inserts an End-of-file character at the end of your XML document (other editors might do this as well).  Unfortunately, the EOF character (0x1A) is an illegal character according to the XML specification, and XML4J correctly generates an error.

 

I get an error on the NP character (0x0C, also known as Control-L).

This character isn't legal XML.  The only legal XML characters below the Unicode 0x20 code position are 0x09 (TAB), 0x0A (LF), and 0x0C (CR).