Document models
A document model primarily controls what parts of a document's structure
need to be indexed and how they are indexed. Its purpose is to:
- Identify text fields that should be distinguished in the source document
- Determine the type of such a text field
- Assign a field name to the text field
When the document model identifies text as belonging to a text field,
the text is considered to be part of the textual content of the document,
and terms are extracted and stored in the index.
The elements of a document model vary depending on the parser used for
that document format:
- For HTML format, a document model uses the HTML tag names to define which
tags should be indexed, and how to handle meta-tag information.
- For XML format, there is no predefined set of tags, so a document model
must first define which tags are of interest. XML elements of the same name
can also be distinguished based on what other elements they are embedded in.
- For GPP (general purpose parser) format, the document model interacts
even more deeply with the parser, because it has to determine the boundaries
of the text fields. Here the field definition must specify strings for detecting
the boundaries of fields.
- For Outside In formats, a document model uses tags similar
to HTML tag names to define which tags should be indexed, and how to handle
meta-tag information. Note that the Outside In filtering format is also known
as INSO.
See the relevant "Defining a Document Model" section for information.
For information on the document model syntax in the form of a Document
Type Definition (DTD), and text field limitations, see Appendix G. Document model reference.