Defining a document model for HTML documents

The HTML parser converts the text to code page UTF-8. It performs HTML tag recognition and classifies them into tag classes:

It recognizes all character entity references defined in HTML 4, like "ä" (ä) and resolves them to the corresponding code points in UTF-8.

It recognizes meta tags and parses the meta tag text.

Here is an example of an HTML document:

<HTML>
<HEAD>
<META NAME="year" CONTENT="2002">
<TITLE> The Firm </TITLE>
</HEAD>
<BODY>
<H1>Synopsis</H1>;


<H1>Prologue</H1>;:
:
</BODY>

Here is an example of an HTML document model:

<?xml version="1.0"?>
<HTMLModel>

 <HTMLFieldDefinition
 name="subtitle"
 tag="title" 
 exclude="YES" /> 

 <HTMLFieldDefinition                 - This is the start of text field
 name="header1"
 tag="h1"
 exclude="YES" />                     - This is the end of the text field

 <HTMLAttributeDefinition             - This is the start of the document
 name="year"                                   attribute
 tag="meta"
 meta-qualifier="year"
 type="NUMBER" />                     - This is the end of the document
                                            attribute
 </HTMLModel>

The first line, <?xml version="1.0"?>, specifies that the document model is written using XML tags. Note that this model is not written for XML format documents.

Each field is defined within a HTMLFieldDefinition or HTMLAttributeDefinition tag, which contain element parameters.

All the text field definitions must be contained within the <HTMLModel> tag.