Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python -
i need automatically scan lots of html documents ad banners surrounded anchor tag, e.g.:
<a href="http://ad_network.com/abc.html"> <img src="ad_banner.jpg"> </a>
as newbie xpath, can select such anchors via lxml so:
text = ''' <a href="http://ad_network.com/abc.html"> <img src="ad_banner.jpg"> </a>''' root = lxml.html.fromstring(text) print root.xpath('//a[contains(@href,("ad_network.")) or contains(@href,("other_ad_network."))][descendant::img]')
in example check on 2 different domains: "ad_network." , "other_ad_network.". however, there on 25 domains check , xpath expression terribly long connecting conatains-directives "or". , fear expression pretty inefficient concerning cpu ressources. there syntax checking on multiple "contains" values?
i concerned links via regex in single line of code. yet, although html code normalized lxml, regex seems never choice kind of work ... appreciated!
it might not bad bunch of 'or's. build xpath python don't writer's cramp , precompile it. actual xpath code in libxml , should fast.
sites=['aaa', 'bbb'] contains = ' or '.join('contains(@href,(%s))' % site site in sites) anchor_xpath = etree.xpath('//a[%s][descendant::img]' % contains)
Comments
Post a Comment