我一度知道您需要使用像硒这样的webtoolkit来自动执行抓取.
我将如何能够单击Google Play商店上的下一个按钮,以便仅出于我的大学目的刮取评论!
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from selenium import webdriver
import time
class Product(scrapy.Item):
title = scrapy.Field()
class FooSpider(CrawlSpider):
name = 'foo'
start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]
def __init__(self, *args, **kwargs):
super(FooSpider, self).__init__(*args, **kwargs)
self.download_delay = 0.25
self.browser = webdriver.Chrome(executable_path="C:\chrm\chromedriver.exe")
self.browser.implicitly_wait(60) #
def parse(self,response):
self.browser.get(response.url)
sites = response.xpath('//div[@class="single-review"]/div[@class="review-header"]')
items = []
for i in range(0,200):
time.sleep(20)
button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div")
button.click()
self.browser.implicitly_wait(30)
for site in sites:
item = Product()
item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract()
yield item
我已经更新了代码,一次又一次地给了我40个重复项.for循环怎么了?
似乎正在更新的源代码没有传递到xpath,这就是为什么它返回相同的40个项目的原因
解决方法:
我会做这样的事情:
from scrapy import CrawlSpider
from selenium import webdriver
import time
class FooSpider(CrawlSpider):
name = 'foo'
allow_domains = 'foo.com'
start_urls = ['foo.com']
def __init__(self, *args, **kwargs):
super(FooSpider, self).__init__(*args, **kwargs)
self.download_delay = 0.25
self.browser = webdriver.Firefox()
self.browser.implicitly_wait(60)
def parse_foo(self.response):
self.browser.get(response.url) # load response to the browser
button = self.browser.find_element_by_xpath("path") # find
# the element to click to
button.click() # click
time.sleep(1) # wait until the page is fully loaded
source = self.browser.page_source # get source of the loaded page
sel = Selector(text=source) # create a Selector object
data = sel.xpath('path/to/the/data') # select data
...
不过,最好不要等待固定的时间.因此,您可以使用此处描述的方法http://www.obeythetestinggoat.com/how-to-get-selenium-to-wait-for-page-load-after-a-click.html代替time.sleep(1).
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。