python - selenium with scrapy for dynamic page -


i'm trying scrape product information webpage, using scrapy. to-be-scraped webpage looks this:

  • starts product_list page 10 products
  • a click on "next" button loads next 10 products (url doesn't change between 2 pages)
  • i use linkextractor follow each product link product page, , information need

i tried replicate next-button-ajax-call can't working, i'm giving selenium try. can run selenium's webdriver in separate script, don't know how integrate scrapy. shall put selenium part in scrapy spider?

my spider pretty standard, following:

class productspider(crawlspider):     name = "product_spider"     allowed_domains = ['example.com']     start_urls = ['http://example.com/shanghai']     rules = [         rule(sgmllinkextractor(restrict_xpaths='//div[@id="productlist"]//dl[@class="t2"]//dt'), callback='parse_product'),         ]      def parse_product(self, response):         self.log("parsing product %s" %response.url, level=info)         hxs = htmlxpathselector(response)         # actual data follows 

any idea appreciated. thank you!

it depends on how need scrape site , how , data want get.

here's example how can follow pagination on ebay using scrapy+selenium:

import scrapy selenium import webdriver  class productspider(scrapy.spider):     name = "product_spider"     allowed_domains = ['ebay.com']     start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.tr0.trc0.xpython&_nkw=python&_sacat=0&_from=r40']      def __init__(self):         self.driver = webdriver.firefox()      def parse(self, response):         self.driver.get(response.url)          while true:             next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')              try:                 next.click()                  # data , write scrapy items             except:                 break          self.driver.close() 

here examples of "selenium spiders":


there alternative having use selenium scrapy. in cases, using scrapyjs middleware enough handle dynamic parts of page. sample real-world usage:


Comments

Popular posts from this blog

c++ - Creating new partition disk winapi -

Android Prevent Bluetooth Pairing Dialog -

VBA function to include CDATA -