The HTML parser converts the text to code page UTF-8. It performs HTML tag recognition and classifies them into tag classes:
It recognizes all character entity references defined in HTML 4, like "ä" (ä) and resolves them to the corresponding code points in UTF-8.
It recognizes meta tags and parses the meta tag text.
Here is an example of an HTML document:
<HTML> <HEAD> <META NAME="year" CONTENT="2002"> <TITLE> The Firm </TITLE> </HEAD> <BODY> <H1>Synopsis</H1>; <H1>Prologue</H1>;: : </BODY>
Here is an example of an HTML document model:
<?xml version="1.0"?> <HTMLModel> <HTMLFieldDefinition name="subtitle" tag="title" exclude="YES" /> <HTMLFieldDefinition - This is the start of text field name="header1" tag="h1" exclude="YES" /> - This is the end of the text field <HTMLAttributeDefinition - This is the start of the document name="year" attribute tag="meta" meta-qualifier="year" type="NUMBER" /> - This is the end of the document attribute </HTMLModel>
The first line, <?xml version="1.0"?>, specifies that the document model is written using XML tags. Note that this model is not written for XML format documents.
Each field is defined within a HTMLFieldDefinition or HTMLAttributeDefinition tag, which contain element parameters.
All the text field definitions must be contained within the <HTMLModel> tag.