scottt732
09-17-2009, 03:14 PM
Hi,
I'm not sure whether this is something you would want me to contribute to the SgmlReader project or should I start this as a separate project on my own.
I've written a Fiddler2 plug-in that intercepts responses from web servers and runs them through SgmlReader. This lets you get a balanced XML document into Firefox from any page your hit in the browser. For non-XHTML sites, it tends to break the layout, but it's not important given the purpose of the plug-in.
With the Inspector feature of the Firebug Firefox extension, you can click on an element and get an XPath expression to exactly what you're looking for in the page--visually. Alternatively, you can just walk through the DOM visually with Firebug. Knowing how the page gets transformed by SgmlReader and having an XPath expression that gets the data out of that transformed document, you end up having the information you need to scrape the data out in pure code. By using Fiddler2 as the middle-man, you can see all of the HTTP headers going back and forth and even snoop the HTTPS communications.
I'm planning on adding a code generation feature to the plug-in so that it will produce C# code that performs the same exact HTTP request (complete with the user-agent header from your browser) with or without the use of a CookieContainer. That would basically let you create a scraping application in a few minutes that performs the exact same sequence of requests that you carried out in your browser and it would carry auth cookies and session data between the requests.
Sooo... new project or part of SgmlReader?
Scott Holodak
www.sholo.net
I'm not sure whether this is something you would want me to contribute to the SgmlReader project or should I start this as a separate project on my own.
I've written a Fiddler2 plug-in that intercepts responses from web servers and runs them through SgmlReader. This lets you get a balanced XML document into Firefox from any page your hit in the browser. For non-XHTML sites, it tends to break the layout, but it's not important given the purpose of the plug-in.
With the Inspector feature of the Firebug Firefox extension, you can click on an element and get an XPath expression to exactly what you're looking for in the page--visually. Alternatively, you can just walk through the DOM visually with Firebug. Knowing how the page gets transformed by SgmlReader and having an XPath expression that gets the data out of that transformed document, you end up having the information you need to scrape the data out in pure code. By using Fiddler2 as the middle-man, you can see all of the HTTP headers going back and forth and even snoop the HTTPS communications.
I'm planning on adding a code generation feature to the plug-in so that it will produce C# code that performs the same exact HTTP request (complete with the user-agent header from your browser) with or without the use of a CookieContainer. That would basically let you create a scraping application in a few minutes that performs the exact same sequence of requests that you carried out in your browser and it would carry auth cookies and session data between the requests.
Sooo... new project or part of SgmlReader?
Scott Holodak
www.sholo.net