如何抓取网页数据(TheCode:我最近开始使用selenium和scrapy进行网页抓取)

优采云 发布时间: 2021-10-29 20:12

  如何抓取网页数据(TheCode:我最近开始使用selenium和scrapy进行网页抓取)

  情况:

  我最近开始使用 selenium 和 scrapy 进行网页抓取。我在一个项目中工作。我有一个收录 42,000 个邮政编码的 csv 文件。我的工作是获取邮政编码并输入 网站 输入邮政编码。并获取所有结果。

  问题:

  这里的问题是,在执行此操作时,我必须不断单击“加载更多”按钮,直到显示所有结果,只有完成后才能采集数据。

  这可能不是什么大问题,但是每个邮政编码需要 2 分钟,而我有 42,000 个邮政编码。

  编码:

   import scrapy

from numpy.lib.npyio import load

from selenium import webdriver

from selenium.common.exceptions import ElementClickInterceptedException, ElementNotInteractableException, ElementNotSelectableException, NoSuchElementException, StaleElementReferenceException

from selenium.webdriver.common.keys import Keys

from items import CareCreditItem

from datetime import datetime

import os

from scrapy.crawler import CrawlerProcess

global pin_code

pin_code = input("enter pin code")

class CareCredit1Spider(scrapy.Spider):

name = 'care_credit_1'

start_urls = ['https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty//?Sort=D&Radius=75&Page=1']

def start_requests(self):

directory = os.getcwd()

options = webdriver.ChromeOptions()

options.headless = True

options.add_experimental_option("excludeSwitches", ["enable-logging"])

path = (directory+r"\\Chromedriver.exe")

driver = webdriver.Chrome(path,options=options)

#URL of the website

url = "https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty/" +pin_code + "/?Sort=D&Radius=75&Page=1"

driver.maximize_window()

#opening link in the browser

driver.get(url)

driver.implicitly_wait(200)

try:

cookies = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')

cookies.click()

except:

pass

i = 0

loadMoreButtonExists = True

while loadMoreButtonExists:

try:

load_more = driver.find_element_by_xpath('//*[@id="next-page"]')

load_more.click()

driver.implicitly_wait(30)

except ElementNotInteractableException:

loadMoreButtonExists = False

except ElementClickInterceptedException:

pass

except StaleElementReferenceException:

pass

except NoSuchElementException:

loadMoreButtonExists = False

try:

previous_page = driver.find_element_by_xpath('//*[@id="previous-page"]')

previous_page.click()

except:

pass

name = driver.find_elements_by_class_name('dl-result-item')

r = 1

temp_list=[]

j = 0

for element in name:

link = element.find_element_by_tag_name('a')

c = link.get_property('href')

yield scrapy.Request(c)

def parse(self, response):

item = CareCreditItem()

item['Practise_name'] = response.css('h1 ::text').get()

item['address'] = response.css('.google-maps-external ::text').get()

item['phone_no'] = response.css('.dl-detail-phone ::text').get()

yield item

now = datetime.now()

dt_string = now.strftime("%d/%m/%Y")

dt = now.strftime("%H-%M-%S")

file_name = dt_string+"_"+dt+"zip-code"+pin_code+".csv"

process = CrawlerProcess(settings={

'FEED_URI' : file_name,

'FEED_FORMAT':'csv'

})

process.crawl(CareCredit1Spider)

process.start()

print("CSV File is Ready")

  项目.py

  

import scrapy

class CareCreditItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

Practise_name = scrapy.Field()

address = scrapy.Field()

phone_no = scrapy.Field()

  问题:

  基本上我的问题很简单。有没有办法优化这段代码以使其执行得更快?或者,还有什么其他可能的方法可以在不花费很长时间的情况下处理这些数据?

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线