Hi
I'm trying to parse the following file using SgmlReader:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pl" lang="pl">
<body>
<p>A&B</p>
</body>
</html>
Here's my code:
Code:
string html = System.IO.File.ReadAllText("a.htm", Encoding.UTF8);
SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.InputStream = new StringReader(html);
var xdoc = XDocument.Load(sgmlReader);
var xdoc2 = XDocument.Parse(html);
var ns = XNamespace.Get("http://www.w3.org/1999/xhtml");
Console.WriteLine(xdoc.Descendants(ns + "p").Single().Value);
Console.WriteLine(xdoc2.Descendants(ns + "p").Single().Value);
And here's the output (I've manuall added spaces between characters to avoid conflicts with forum markup language):
Why using SgmlReader I've got "A & # 3 8 ; B" instead of "A & B". Is there a way to configure SgmlReader to return simply "A&B"?
I've tried with setting sgmlReader.docType = "HTML" but it does not help. Also using XmlDocument instead of XDocument does not help.
Thanks for any help!
Bartek