自动采集编写(project项目名称创建列表)

优采云发布时间: 2021-11-26 03:01

　　爬虫项目编写流程：

　　创建项目：scrapy项目项目名称创建爬虫名称：scrapy genspider爬虫名称“限制域” 明确要求：写items.py，写spiders/xxx.py，写爬虫文件，处理请求和响应，提取数据（yield item） pipelines.py，写pipeline文件，处理spider返回的item数据，比如本地持久化存储，写settings.py，启动pipeline组件：ITEM_PIPELINES={}，其他中间件设置比如headers执行爬虫：scrapy爬虫爬虫名

　　1.创建项目

　　scrapy startproject Tencent

cd Tencent # 进入到项目目录

　　2.创建爬虫

　　scrapy genspider tencent.job "www.tencent.com"

　　引用模块和设置配置时，不是从项目文件夹开始，而是从项目文件夹的下一级开始。

　　3.打开网站查看数据：#a

　　4.明确需求，需要爬取的字段，写Tencent/items.py：

　　# Tencent/Tencent/items.py

import scrapy

class TencentItem(scrapy.Item):

# 职位：

positionName = scrapy.Field()

# 职位详情链接：

positionLink = scrapy.Field()

# 职位类别

positionType = scrapy.Field()

# 招聘人数

peopleNumber = scrapy.Field()

# 工作地址

workLocation = scrapy.Field()

# 发布时间

publishTime = scrapy.Field()

　　5.写一个爬虫，腾讯/spiders/tencent_job.py：

　　import scrapy

class TencentJobSpider(scrapy.Spider):

name = "tencent.job"

allowed_domains = ["tencent.com"]

base_url = "https://hr.tencent.com/position.php?keywords=python&lid=2218&tid=87&start="

offset = 0 # 起始页start=0，每10个换一页

start_url = [base_url + str(offset)]

　　分析页面以获取我们想要的字段。

　　在我们想要的信息中，点击check查看元素，可以看到节点是：

　　//tr[@class='even']

　　或者 //tr[@class='odd']

　　将两者放在一起： //tr[@class='even'] |//tr[@class='odd']

　　每个tr下面，是td列表，这些td列表就是各个位置的相关信息。

　　职位名称：//tr[@class='even'] | /td[1]/a/text()

　　注意：xpath的下标从1开始

　　职位详情链接：//tr[@class='even'] | /td[1]/a/@href

　　………………

　　import scrapy

from Tencent.items import TencentItem

class TencentJobSpider(scrapy.Spider):

name = "tencent.job"

allowed_domains = ["tencent.com"]

base_url = "https://hr.tencent.com/position.php?keywords=python&lid=2218&tid=87&start="

offset = 0 # 起始页start=0，每10个换一页

start_url = [base_url + str(offset)]

def parse(self, response):

node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

# response.xpath之后的对象，仍然是xpath对象

for node in node_list:

#xpath方法之后，返回的仍然是xpath对象；xpath的extract()方法，以unicode编码返回xpath对象的内容

positionName = node.xpath('./td[1]/a/text()').extract()

# 注意要在td前面加上"./"，表示此td是node_list的下一级

positionLink = node.xpath('./td[1]/a/@href').extract()

positionType = node.xpath('./td[2]/text()').extract()

peopleNumber = node.xpath('./td[3]/text()').extract()

workLocation = node.xpath('./td[4]/text()').extract()

publishTime = node.xpath('./td[5]/text()').extract()

item = TencentItem()

# xpath及其extract返回的都是列表，因此取其第一个元素；并使用encode("utf-8")编码为utf8

item['positionName'] = positionName[0].encode("utf-8") if positionName else ""

item['positionType'] = positionType[0].encode("utf-8") if positionType else ""

item['positionLink'] = positionLink[0].encode("utf-8") if positionLink else ""

item['peopleNumber'] = peopleNumber[0].encode("utf-8") if peopleNumber else ""

item['workLocation'] = workLocation[0].encode("utf-8") if workLocation else ""

item['publishTime'] = publishTime[0].encode("utf-8") if publishTime else ""

yield item

if self.offset < 2190:

self.offset += 10 # 每页10条数据

url = self.base_url + self.offset

# 构建下一页的url请求对象

yield scrapy.Request(url, callback=self.parse)

# 1.如果下面的请求内容不一样，则需要自己再写一个回调方法，回调自定义的方法

# 2.这里返回的是url对象，引擎接受到以后判断它不是item对象，不会发送给管道处理；

# 是Request对象，引擎将发送给调度器，去请处理请求

# 3.这里要使用yield，不断的返回，直到self.offset >=2190。

　　调用管道时要注意：

　　每个响应对应一个 parse() 方法；在for循环中，循环一次，生成一个item对象；for 循环中的所有项目对象都对应于一个管道 process_item() 方法。

　　即for循环中有一个item对象，共享一个管道对象process_item()；即pipeline类只会被实例化一次，process_item()方法会被多次调用。调用 process_items() 时，它是一个管道对象；因此，只需要初始化一次，并且只需要打开和关闭文件一次。

　　6.写管道文件：保存数据

　　1)。编写管道类：Tencent/pipelines.py

　　import json

class TencentPipeline(object):

def __init__(self):

self.f = open("tencent.json", "w")

def process_item(self, item, spider):

# 要处理中文，ensure_ascii要改为False

content = json.dumps(dict(item), ensure_ascii=False) + ",\n"

# 此时，不再需要encode("utf-8")，因为在得到item的时侯已经encode("utf-8")

# 总之，从网络请求来的数据，要encode("utf-8")编码一次。

self.f.write(content)

return item

def close_spider(self, spider):

self.f.close()

　　2)在Tencent/settings.py文件中，启用管道并添加上面的管道类

　　ITEM_PIPELINES = {

'Tencent.pipelines.TencentPipeline': 300,

}

　　3)。确保爬虫文件Tencent/spiders/xx.py中爬虫类的parse()方法和返回的item类数据可以使用管道。

　　7.执行爬虫：

　　crapy crawl 爬虫名称

　　8.优化：

　　在上面的例子中，假设总页数；适用于没有下一页标签的网站。

　　在这个项目中，实际上还有一个页面可以点击网站。到达最后一页时，无法单击下一页。

　　因此，可以根据是否可以点击下一页来判断是否是最后一页。

　　看页面，下一页的节点是：//a[@id='text']/@href

　　最后一个page节点的href内容为“javascript:;”，其标签为：//a[@class='noative']

　　但是第一页的前一页的标签也是：//a[@class='noative']

　　但是id不一样，第一页的前一页的id是'prev'

　　因此，结合这两个条件：//a[@class='noative' and @id='next']

　　 yield item

if not response.xpath("//a[@class='noative' and @id='next']"):

url = response.xpath("//a[@id='next']/@href").extract()[0]

# 构建下一页的url请求对象

yield scrapy.Request(self.base_url + url, callback=self.parse)

　　注意：在同一个标签中，xpah可以用and、or表示；如果不在同一个标签中，请使用 | & 代表

0

2021-11-26

自动采集编写

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

自动采集编写(project项目名称创建列表)

0 个评论

发起人

AI时代内容工厂

自动采集编写(project项目名称创建列表)

0 个评论

发起人

相关问题