PDA

View Full Version : how to extract the anchor text



roger nie
10-29-2009, 02:59 AM
these days ,i'm working at refining the content of a html document.I extract all the hyperlinks and the corresponding anchor text in a html document.But while extracting the anchor text ,i just use the character matching technology,such as IndexOf (),...but it takes long time.
Do you have other methods to extrct the anchor text of an hyperlink,and tell me .Thanks!

rberinger
10-29-2009, 12:24 PM
these days ,i'm working at refining the content of a html document.I extract all the hyperlinks and the corresponding anchor text in a html document.But while extracting the anchor text ,i just use the character matching technology,such as IndexOf (),...but it takes long time.
Do you have other methods to extrct the anchor text of an hyperlink,and tell me .Thanks!

There are several ways of doing this efficently (jQuery, XPath from DekiScript) if you provide a more specific "road map" of what your trying to accomplish I'm sure we can help. What are you doing with the text once you extract it?

crb
10-30-2009, 07:31 PM
I guess he's referring to SGMLReader.

roger nie
11-04-2009, 06:36 AM
There are several ways of doing this efficently (jQuery, XPath from DekiScript) if you provide a more specific "road map" of what your trying to accomplish I'm sure we can help. What are you doing with the text once you extract it?
i'm working at a meta mobile search engine ,i've constructed the search engine,and it works.But while used in mobile phones,the web page is too big ,so i want to extract the main content of a web page for the search result.
bye the way,the engine is constructed in ASP.NET

roger nie
11-04-2009, 06:38 AM
I guess he's referring to SGMLReader.
I've used SGMLReader for formating the html documents.but when i come to the next step,new question appears.