PDA

View Full Version : SgmlReader: Ignore SystemId/DTD?



bhaal
11-16-2009, 11:42 AM
I was wondering, shouldn't the SgmlReader be capable of ignoring the SystemId of a document, and parse it regardless of its content?

I am using the current version in combination with XDocument (System.Xml.Linq) to read various xml and sgml documents.
However, my testcases fail when:
- the input file is SGML (thus the SgmlReader comes into play) and
- the file contains a doctype with SystemId.
It usually fails with FileNotFound/DirectoryNotFoundException, since it simply cannot find the DTD file inside Entity.Open.
In my case, I just want to open the document, and not validate it; hence I wouldn't care if it is actually valid against the referenced DTD or not. Usually, the DTD is not right next to the document where SgmlReader expects it, but in all other cases the DTD is not available/known in general.

Is there any way to load those documents anyways? XDocument alone succeeds (for Xml), and I think XmlReader also has options to allow for this (not sure about this tho).

This is the Snippet I call for my tests:

public static XDocument LoadDocument(string fileName)
{
XDocument ret;

try
{
ret = XDocument.Load(fileName);
}
catch
{
try
{
var sgmlReader = new Sgml.SgmlReader();
sgmlReader.Href = fileName;
sgmlReader.StripDocType = false;
ret = XDocument.Load(sgmlReader);
}
catch
{
throw new XmlException("Could not load " + fileName + " as Xml Document");
}
}

return ret;
}

void Test()
{
//<!DOCTYPE root PUBLIC "publicId" "systemId" [subset]>
//<root/>
var doc = LoadDocument("some.xml");
Assert.AreEqual("publicId", doc.DocumentType.PublicId);
Assert.AreEqual("systemId", doc.DocumentType.SystemId);
Assert.AreEqual("subset", doc.DocumentType.InternalSubset);

//<!DOCTYPE root PUBLIC "publicId" "systemId" [subset]>
//<root>
doc = LoadDocument("some.sgm");
Assert.AreEqual("publicId", doc.DocumentType.PublicId);
Assert.AreEqual("systemId", doc.DocumentType.SystemId);
Assert.AreEqual("subset", doc.DocumentType.InternalSubset);
}
Loading "some.sgm" fails with "Unable to find 'current\working\directory\\systemId'", since the file is obviously not there.

Any chance I can get those files to load?

Regards, BhaaL

SteveB
12-01-2009, 10:09 PM
Your best bet is step into the SgmlReader code and propose a patch to obtain the behavior you need. That's the benefit of it being open source. :)

bhaal
12-02-2009, 09:25 AM
I was hoping that there was some built-in way to do this already...

However, in the meantime, I actually did that.
I added a property thats lets me specify whether any DTD should be ignored.
See my attached Patch for the changes, in case you're interrested to add this back into the trunk.

Works for my specific case, but any comments on it would certainly be welcome, since I don't/can't know if it breaks anything else by making LazyLoadDtd do nothing in that case.

- BhaaL

SteveB
01-20-2010, 05:55 AM
Thanks for the patch. I opened a bug report for it (http://bugs.developer.mindtouch.com/view.php?id=7536). Thanks!

SteveB
02-18-2010, 10:56 PM
I commit the patch to trunk. Thanks again for it!

bhaal
03-18-2010, 09:00 AM
Saw it with the update earlier, thanks!
Just one thing about it: it seems you moved the check for m_IgnoreDtd from LazyLoadDtd to the Dtd Property.
This unfortunately breaks what I intended in first place since ParseDocType also calls LazyLoadDtd - and thats the exact path where the problem occurs.

SteveB
03-22-2010, 06:01 PM
Oh, man, how embarrassing. I must have misapplied the patch. :( Ok, round 2.

rush7645
04-27-2010, 09:00 AM
So has a new working patch been released for this yet?

SteveB
04-27-2010, 10:20 AM
Yes, the fix has been applied. You can download the latest trunk build from our svn repo (https://svn.mindtouch.com/source/public/sgmlreader/trunk/dist/).