XML and Hypertext Markup Language (HTML) are derived from the more complex Standard Generalized Markup Language (SGML). SGML's complexity and high cost of implementation spurred the interest in developing alternatives.
HTML is the most widely used markup language for Web-based documents. As the popularity of HTML increases, the limitations of the language have become more apparent. Those limitations include restricting the user to a relatively small set of tags. HTML authors cannot create their own HTML tags, because commercially available Web browsers have no knowledge of tags that are not part of the HTML standards.
Another limitation of HTML is tags that control presentation are in the same file with tags that describe the document content. Although HTML 4 and Cascading Style Sheets enable HTML authors to separate content from presentation, HTML 4 remains weak in its ability to describe the content of a document.
XML overcomes limitations of HTML and other markup languages, while providing capabilities that are not a part of the earlier languages. Here's a simple XML document and an HTML document that contains the same data:
XML document | HTML document |
<?xml version="1.0" standalone="yes" ?> <state stateid="MN"> <city cityid="12"> <name>Johnson</name> <population>5000</population> </city> <city cityid="15"> <name>Pineville</name> <population>60000</population> </city> <city cityid="20"> <name>Lake Bell</name> <population>20</population> </city> </state> |
<html> <h1 id="MN">State</h1> <h2 id="12">City</h2> <dl> <dt>Name</dt> <dd>Johnson</dd> <dt>Population</dt> <dd>5000</dd> </dl> <h2 id="15">City</h2> <dl> <dt>Name</dt> <dd>Pineville</dd> <dt>Population</dt> <dd>60000</dd> </dl> <h2 id="20">City</h2> <dl> <dt>Name</dt> <dd>Lake Bell</dd> <dt>Population</dt> <dd>20</dd> </dl> </html> |
In the XML document, the tag names convey the meaning of the data they contain. The structure of the document is easily discerned and follows a pattern. In contrast, the HTML tag names reveal little about the meaning of their content and the structure is not particularly useful for manipulating the document and exchanging it between applications.