Class | BeautifulSoup |
In: |
lib/rubyful_soup.rb
|
Parent: | BeautifulStoneSoup |
This parser knows the following facts about HTML:
Most tags can‘t be nested at all. For instance, the occurance of a <p> tag should implicitly close the previous <p> tag.
<p>Para1<p>Para2 should be transformed into: <p>Para1</p><p>Para2
Some tags can be nested arbitrarily. For instance, the occurance of a <blockquote> tag should not implicitly close the previous <blockquote> tag.
Alice said: <blockquote>Bob said: <blockquote>Blah should NOT be transformed into: Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
Some tags can be nested, but the nesting is reset by the interposition of other tags. For instance, a <tr> tag should implicitly close the previous <tr> tag within the same <table>, but not close a <tr> tag in another table.
<table><tr>Blah<tr>Blah should be transformed into: <table><tr>Blah</tr><tr>Blah but, <tr>Blah<table><tr>Blah should NOT be transformed into <tr>Blah<table></tr><tr>Blah
Differing assumptions about tag nesting rules are a major source of problems with the BeautifulSoup class. If BeautifulSoup is not treating as nestable a tag your page author treats as nestable, try writing a subclass.