Results 1 to 2 of 2

Thread: & problem with SgmlReader

  1. #1

    Default & problem with SgmlReader

    Hi

    I'm trying to parse the following file using SgmlReader:

    Code:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pl" lang="pl">
    <body>
    	<p>A&amp;B</p>
    </body>
    </html>
    Here's my code:

    Code:
    string html = System.IO.File.ReadAllText("a.htm", Encoding.UTF8);
    SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.InputStream = new StringReader(html);
    var xdoc = XDocument.Load(sgmlReader);
    var xdoc2 = XDocument.Parse(html);
    
    var ns = XNamespace.Get("http://www.w3.org/1999/xhtml");
    Console.WriteLine(xdoc.Descendants(ns + "p").Single().Value);
    Console.WriteLine(xdoc2.Descendants(ns + "p").Single().Value);
    And here's the output (I've manuall added spaces between characters to avoid conflicts with forum markup language):

    A & # 38 ; B
    A & B
    Why using SgmlReader I've got "A & # 3 8 ; B" instead of "A & B". Is there a way to configure SgmlReader to return simply "A&B"?

    I've tried with setting sgmlReader.docType = "HTML" but it does not help. Also using XmlDocument instead of XDocument does not help.

    Thanks for any help!

    Bartek

  2. #2

    Default

    Although I do not have a solution to your issue, I am experiencing something similar. When I load an SgmlReader object into an XMLdocument object, it does not seem to handle special characters correctly. For instance, if the document stream contains the special character '&ndash', once loaded into XmlDocument and "saved" all occurrences get changed to [ndash ] . Same with any other special characters. I am not sure how to get around this issue or if there is a setting in SgmlReader I have overlooked.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts