python - selenium with scrapy for dynamic page -
i'm trying scrape product information webpage, using scrapy. to-be-scraped webpage looks this:
- starts product_list page 10 products
- a click on "next" button loads next 10 products (url doesn't change between 2 pages)
- i use linkextractor follow each product link product page, , information need
i tried replicate next-button-ajax-call can't working, i'm giving selenium try. can run selenium's webdriver in separate script, don't know how integrate scrapy. shall put selenium part in scrapy spider?
my spider pretty standard, following:
class productspider(crawlspider): name = "product_spider" allowed_domains = ['example.com'] start_urls = ['http://example.com/shanghai'] rules = [ rule(sgmllinkextractor(restrict_xpaths='//div[@id="productlist"]//dl[@class="t2"]//dt'), callback='parse_product'), ] def parse_product(self, response): self.log("parsing product %s" %response.url, level=info) hxs = htmlxpathselector(response) # actual data follows
any idea appreciated. thank you!
it depends on how need scrape site , how , data want get.
here's example how can follow pagination on ebay using scrapy
+selenium
:
import scrapy selenium import webdriver class productspider(scrapy.spider): name = "product_spider" allowed_domains = ['ebay.com'] start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.tr0.trc0.xpython&_nkw=python&_sacat=0&_from=r40'] def __init__(self): self.driver = webdriver.firefox() def parse(self, response): self.driver.get(response.url) while true: next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a') try: next.click() # data , write scrapy items except: break self.driver.close()
here examples of "selenium spiders":
- executing javascript submit form functions using scrapy in python
- https://gist.github.com/cheekybastard/4944914
- https://gist.github.com/irfani/1045108
- http://snipplr.com/view/66998/
there alternative having use selenium
scrapy
. in cases, using scrapyjs
middleware enough handle dynamic parts of page. sample real-world usage:
Comments
Post a Comment