scrapy分页抓取网页(本文爬取某网站产品信息（包含图片下载）的实战教学博客)

优采云发布时间: 2021-11-18 15:15

　　内容

　　概述

　　本文记录了使用Scrapy抓取网站的产品信息（包括图片下载）的全过程，也可以作为Scrapy实践教学博客。

　　首先从All Products页面开始，先抓取所有分类页面的链接：比如

　　然后从每个产品类别页面抓取产品详情页面链接：如

　　最后分析商品详情页的响应，提取需要的数据，下载相关图片

　　开始

　　首先需要安装scrapy，pip命令

　　pip install scrapy

　　启动项目

　　在Pycharm工作目录下新建目录scrapy_demo（以后其他scrapy爬虫项目也可以放在这个目录下），打开终端终端，使用cd命令进入scrapy_demo目录，使用scrapy命令创建该项目：

　　scrapy startproject product

　　其中product为爬虫项目名称，可以修改

　　目录结构应该如下：（products_spider.py是后面添加的）

　　目录结构的详细解释请参考官方文档

　　爬虫初始化

　　import scrapy

from ..items import ProductItem

class ProductsSpider(scrapy.Spider):

"""

Products Spider

"""

name = "products" # 爬虫的名字, 后面启动爬虫需要用到

host = 'http://www.example.com'

def start_requests(self):

urls = [

'http://www.example.com/products.html'

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response, **kwargs):

# @todo 处理首页响应

pass

　　爬虫执行过程：首先会执行start_requests方法，最后yield Request会发送多个请求。请求的响应将被请求中的参数回调指定的函数接收和处理。这里对主页请求的响应将被解析函数处理。

　　import scrapy

class ProductItem(scrapy.Item):

# define the fields for your item here like:

name = scrapy.Field() # 产品名称

images = scrapy.Field() # 产品图片, 是一个列表

category = scrapy.Field() # 产品分类名

price = scrapy.Field() # 产品价格

description = scrapy.Field() # 产品描述, 长文本

pass

　　处理响应

　　回到products_spider.py，接下来需要处理响应。

　　 def parse(self, response, **kwargs):

# 从首页获取各个分类页面url

tree = etree.HTML(response.text) # 注意这里需要 from lxml.html import etree

hrefs = tree.xpath("hrefs xpath express")

for href in hrefs:

# 发起分类页面请求

yield scrapy.Request(url=self.host + href, callback=self.parse_category)

def parse_category(self, response):

# 从分类页面获取产品详情页面url

tree = etree.HTML(response.text)

product_urls = tree.xpath("products url xpath express")

category = tree.xpath("categroy text xpath express")[0]

for url in product_urls:

# 发起产品详情页面请求

yield scrapy.Request(url=self.host + url, callback=self.parse_product)

def parse_product(self, response):

# 解析产品详情页面, 将数据汇总到 Item 中

tree = etree.HTML(response.text)

item = ProductItem()

item['name'] = tree.xpath('xxxxx/text()')[0]

yield item

　　回调函数参数

　　 def parse_category(self, response):

# 省略

yield scrapy.Request(url=self.host + url, callback=self.parse_product, cb_kwargs={'cate': category})

pass

def parse_product(self, response, cate):

# 省略

item['category'] = cate

　　图片下载（待续）

　　使用中间件 ImagesPipeline

　　相关资源

　　官方文档：点击跳转

0

2021-11-18

scrapy分页抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

scrapy分页抓取网页(本文爬取某网站产品信息（包含图片下载）的实战教学博客)

0 个评论

发起人

AI时代内容工厂

scrapy分页抓取网页(本文爬取某网站产品信息（包含图片下载）的实战教学博客)

0 个评论

发起人

相关问题