wrose
05-17-2010, 06:35 AM
Hi there,
When SgmlReader is parsing and an element is not closed as it expects, it will put in a closing tag when the parent element is closed. This leads to unfortunate behaviour in a couple of places. The following HTML has some seemingly benign unclosed elements.
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
<title>Hello!</title>
</head>
<body>
<div>
<img src="test.jpg">
<p>This paragraph follows the image, doesn't it?</p>
</div>
<form>
<select name="hello">
<option value=hi>Hello
<option value=there>There
<option value=p><p>Really shouldn't be here</p>
</select>
</form>
</body>
</html>
What happens when SgmlReader processes this (with the HTML DTD) is that the META tag is closed after the title, even though it is an empty element. Likewise, the IMG tag is closed after the subsequent P, as the parent DIV closes, when IMG tags are empty elements.
Similarly, the OPTION tags end up nesting, where it could be recognised from the DTD that they may only contain #PCDATA (not other elements) and therefore need to close when the next element tag is encountered.
These issues mean that the XML output is not XHTML (excluding the other issues such as missing alt tags on the IMG and action attribute on the FORM).
The following is the XML parsed by SgmlReader 1.8.6 for the HTML above:
<html>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8'>
<title>Hello!</title>
</meta>
</head>
<body>
<div>
<img src='test.jpg'>
<p>This paragraph follows the image, doesn't it?</p>
</img>
</div>
<form>
<select name='hello'>
<option value='hi'>Hello
<option value='there'>There
<option value='p'>
<p>Really shouldn't be here</p>
</option>
</option>
</option>
</select>
</form>
</body>
</html>
The following XML is produced by HTMLTidy when asked to read the same snippet. HTMLTidy complains about the P nested in the OPTION and removes it. It also correctly interprets the tags as only extend as far as they can while sticking to the DTD.
<html>
<head>
<meta name="generator"
content="HTML Tidy for Linux/x86 (vers 11 February 2007), see www.w3.org" />
<meta http-equiv="Content-type" content="text/html; charset=us-ascii" />
<title>Hello!</title>
</head>
<body>
<div>
<img src="test.jpg" />
<p>This paragraph follows the image, doesn't it?</p>
</div>
<form>
<select name="hello">
<option value="hi">Hello</option>
<option value="there">There</option>
</select>
</form>
</body>
</html>
Is it possible for SgmlReader to use the DTD to evaluate what should be included in an element as an additional check to ensure that nesting problems are not encountered?
When SgmlReader is parsing and an element is not closed as it expects, it will put in a closing tag when the parent element is closed. This leads to unfortunate behaviour in a couple of places. The following HTML has some seemingly benign unclosed elements.
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
<title>Hello!</title>
</head>
<body>
<div>
<img src="test.jpg">
<p>This paragraph follows the image, doesn't it?</p>
</div>
<form>
<select name="hello">
<option value=hi>Hello
<option value=there>There
<option value=p><p>Really shouldn't be here</p>
</select>
</form>
</body>
</html>
What happens when SgmlReader processes this (with the HTML DTD) is that the META tag is closed after the title, even though it is an empty element. Likewise, the IMG tag is closed after the subsequent P, as the parent DIV closes, when IMG tags are empty elements.
Similarly, the OPTION tags end up nesting, where it could be recognised from the DTD that they may only contain #PCDATA (not other elements) and therefore need to close when the next element tag is encountered.
These issues mean that the XML output is not XHTML (excluding the other issues such as missing alt tags on the IMG and action attribute on the FORM).
The following is the XML parsed by SgmlReader 1.8.6 for the HTML above:
<html>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8'>
<title>Hello!</title>
</meta>
</head>
<body>
<div>
<img src='test.jpg'>
<p>This paragraph follows the image, doesn't it?</p>
</img>
</div>
<form>
<select name='hello'>
<option value='hi'>Hello
<option value='there'>There
<option value='p'>
<p>Really shouldn't be here</p>
</option>
</option>
</option>
</select>
</form>
</body>
</html>
The following XML is produced by HTMLTidy when asked to read the same snippet. HTMLTidy complains about the P nested in the OPTION and removes it. It also correctly interprets the tags as only extend as far as they can while sticking to the DTD.
<html>
<head>
<meta name="generator"
content="HTML Tidy for Linux/x86 (vers 11 February 2007), see www.w3.org" />
<meta http-equiv="Content-type" content="text/html; charset=us-ascii" />
<title>Hello!</title>
</head>
<body>
<div>
<img src="test.jpg" />
<p>This paragraph follows the image, doesn't it?</p>
</div>
<form>
<select name="hello">
<option value="hi">Hello</option>
<option value="there">There</option>
</select>
</form>
</body>
</html>
Is it possible for SgmlReader to use the DTD to evaluate what should be included in an element as an additional check to ensure that nesting problems are not encountered?