PDA

View Full Version : Sgml Reader element closing heuristic is wrong



wrose
05-17-2010, 06:35 AM
Hi there,

When SgmlReader is parsing and an element is not closed as it expects, it will put in a closing tag when the parent element is closed. This leads to unfortunate behaviour in a couple of places. The following HTML has some seemingly benign unclosed elements.


<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
<title>Hello!</title>
</head>
<body>
<div>
<img src="test.jpg">
<p>This paragraph follows the image, doesn't it?</p>
</div>
<form>
<select name="hello">
<option value=hi>Hello
<option value=there>There
<option value=p><p>Really shouldn't be here</p>
</select>
</form>
</body>
</html>

What happens when SgmlReader processes this (with the HTML DTD) is that the META tag is closed after the title, even though it is an empty element. Likewise, the IMG tag is closed after the subsequent P, as the parent DIV closes, when IMG tags are empty elements.

Similarly, the OPTION tags end up nesting, where it could be recognised from the DTD that they may only contain #PCDATA (not other elements) and therefore need to close when the next element tag is encountered.

These issues mean that the XML output is not XHTML (excluding the other issues such as missing alt tags on the IMG and action attribute on the FORM).

The following is the XML parsed by SgmlReader 1.8.6 for the HTML above:


<html>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8'>
<title>Hello!</title>
</meta>
</head>
<body>
<div>
<img src='test.jpg'>
<p>This paragraph follows the image, doesn't it?</p>
</img>
</div>
<form>
<select name='hello'>
<option value='hi'>Hello
<option value='there'>There
<option value='p'>
<p>Really shouldn't be here</p>
</option>
</option>
</option>
</select>
</form>
</body>
</html>

The following XML is produced by HTMLTidy when asked to read the same snippet. HTMLTidy complains about the P nested in the OPTION and removes it. It also correctly interprets the tags as only extend as far as they can while sticking to the DTD.


<html>
<head>
<meta name="generator"
content="HTML Tidy for Linux/x86 (vers 11 February 2007), see www.w3.org" />
<meta http-equiv="Content-type" content="text/html; charset=us-ascii" />
<title>Hello!</title>
</head>
<body>
<div>
<img src="test.jpg" />
<p>This paragraph follows the image, doesn't it?</p>
</div>
<form>
<select name="hello">
<option value="hi">Hello</option>
<option value="there">There</option>
</select>
</form>
</body>
</html>

Is it possible for SgmlReader to use the DTD to evaluate what should be included in an element as an additional check to ensure that nesting problems are not encountered?

wrose
06-26-2010, 03:26 AM
I have eventually returned to this and discovered that, once again, PEBKAC. I was using the standard wrapper code to build a document using SgmlReader:


public static void Main(string[] args)
{
string html = "<html><head><meta http-equiv=\"Content-type\" content=\"text/html; charset=UTF-8\"><title>Hello!</title></head><body><form><select><option value=\"1\"><b>Hello</b><option value=\"2\">Two</form></body></html>";

Sgml.SgmlReader sgml = new Sgml.SgmlReader();

sgml.DocType = "HTML";
sgml.WhitespaceHandling = WhitespaceHandling.All;
sgml.CaseFolding = Sgml.CaseFolding.ToLower;
sgml.InputStream = new StringReader(html);

XmlDocument xml = new XmlDocument();
xml.PreserveWhitespace = false;
xml.XmlResolver = null;
xml.Load(sgml);

/* Do stuff with document */
}


I had put the Html.dtd, htmllat1.ent, HTMLspecial.ent and HTMLsymbol.ent files in my project. But I had forgotten to set the build action to Embedded Resource, so when SgmlReader had attempted to load the DTD it was silently failing. When the DTD is loaded correctly, elements are closed more sensibly.

SteveB
07-02-2010, 03:08 PM
Thanks for following up with the solution to your issue. I'm sure others may run into a similar error. Certainly, I've done this countless times.

bradsmokess
01-25-2011, 05:33 AM
i do not think i was ready for you to put words in quotation marks. i'll think about what you really want when you put your search in quotes and try to return something that makes sense.