scrapy分页抓取网页(Scrapy高级Python爬虫框架 )

优采云发布时间: 2022-02-07 15:11

　　scrapy分页抓取网页(Scrapy高级Python爬虫框架

)

　　介绍

　　Scrapy 是一个高级的 Python 爬虫框架。它不仅收录爬虫功能，还可以方便地将爬虫数据保存为csv、json等文件。

　　首先我们安装 Scrapy。

　　它可用于数据挖掘、信息处理或存储历史数据等一系列程序中。它最初是为网页抓取（更准确地说，网页抓取）而设计的，但也可用于检索 API（例如 Amazon Associates Web 服务）或通用网络爬虫返回的数据。Scrapy 用途广泛，可用于数据挖掘、监控和自动化测试。

　　Scrapy 使用 Twisted 异步网络库来处理网络通信。整体结构大致如下

　　安装 linux 或 mac

　　pip3 install scrapy

　　视窗

　　#下载twisted

http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

#安装wheel模块之后才能安装.whl文件

pip3 install wheel

#安装twisted

pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl

pip3 install pywin32

#安装scrapy

pip3 install scrapy

　　使用创建项目

　　格式：scrapy startproject 项目名称

　　scrapy startproject spider

　　项目创建后会生成一个目录，如下：

　　项目名称/

- spiders # 爬虫文件

- chouti.py

- cnblgos.py

....

- items.py # 持久化

- pipelines # 持久化

- middlewares.py # 中间件

- settings.py # 配置文件（爬虫）

scrapy.cfg # 配置文件（部署）

　　创建爬虫

　　格式：

　　cd 项目名称

　　scrapy genspider 爬虫名称将被爬取网站

　　cd spider

scrapy genspider chouti chouti.com

　　爬虫创建后会在spiders文件夹下生成一个文件

　　打开chouti.py后如下：

　　运行爬虫

　　scrapy crawl chouti

scrapy crawl chouti --nolog # 不打印日志

　　例子

　　# -*- coding: utf-8 -*-

import scrapy

class ChoutiSpider(scrapy.Spider):

'''

爬去抽屉网的帖子信息

'''

name = 'chouti'

allowed_domains = ['chouti.com']

start_urls = ['http://chouti.com/']

def parse(self, response):

# 获取帖子列表的父级div

content_div = response.xpath('//div[@id="content-list"]')

# 获取帖子item的列表

items_list = content_div.xpath('.//div[@class="item"]')

# 打开一个文件句柄，目的是为了将获取的东西写入文件

with open('articles.log','a+',encoding='utf-8') as f:

# 循环item_list

for item in items_list:

# 获取每个item的第一个a标签的文本和url链接

text = item.xpath('.//a/text()').extract_first()

href = item.xpath('.//a/@href').extract_first()

# print(href, text.strip())

# print('-'*100)

f.write(href+'\n')

f.write(text.strip()+'\n')

f.write('-'*100+'\n')

# 获取分页的页码，然后让程序循环爬去每个链接

# 页码标签对象列表

page_list = response.xpath('//div[@id="dig_lcpage"]')

# 循环列表

for page in page_list:

# 获取每个标签下的a标签的url，即每页的链接

page_a_url = page.xpath('.//a/@href').extract()

# 将域名和url拼接起来

page_url = 'https://dig.chouti.com' + page_a_url

# 重要的一步！！！！

# 导入Request模块，然后实例化一个Request对象，然后yield它

# 就会自动执行Request对象的callback方法，爬去的是url参数中的链接

from scrapy.http import Request

yield Request(url=page_url,callback=self.parse)

0

2022-02-07

scrapy分页抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

scrapy分页抓取网页(Scrapy高级Python爬虫框架 )

0 个评论

发起人