完整的采集神器(Python进阶资料和高级开发教程，欢迎进阶中和进想深入Python的小伙伴！)

优采云发布时间: 2022-02-28 11:04

　　python 采集网站数据，本教程使用scrapy spider

　　1、安装 Scrapy 框架

　　命令行执行：

　　 pip install scrapy

　　如果安装的scrapy依赖包与你之前安装的其他python包冲突，建议使用Virtualenv安装

　　安装完成后，随便找个文件夹创建爬虫

　　scrapy startproject 你的蜘蛛名称

　　文件夹目录

　　/imgrdrct/https://img-blog.csdnimg.cn/20201223093154574.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NjczNzc1NQ==,size_16,color_FFFFFF,t_70

　　爬虫规则写在spiders目录中

　　items.py - 要抓取的数据

　　pipelines.py - 执行数据保存

　　设置 - 配置

　　middlewares.py—下载器

　　以下是采集小说网站

　　的源码

　　首先在items.py中定义采集的数据

　　# author 小白

import scrapy

class BookspiderItem(scrapy.Item):

# define the fields for your item here like:

i = scrapy.Field()

book_name = scrapy.Field()

book_img = scrapy.Field()

book_author = scrapy.Field()

book_last_chapter = scrapy.Field()

book_last_time = scrapy.Field()

book_list_name = scrapy.Field()

book_content = scrapy.Field()

pass

　　编写采集规则

　　保存数据

　　import os

class BookspiderPipeline(object):

def process_item(self, item, spider):

curPath = 'E:/小说/'

tempPath = str(item['book_name'])

targetPath = curPath + tempPath

if not os.path.exists(targetPath):

os.makedirs(targetPath)

book_list_name = str(str(item['i'])+item['book_list_name'])

filename_path = targetPath+'/'+book_list_name+'.txt'

print('------------')

print(filename_path)

with open(filename_path,'a',encoding='utf-8') as f:

f.write(item['book_content'])

return item

　　执行

　　scrapy crawl BookSpider

　　采集

　　完成一个新颖的程序

　　在这里推荐

　　scrapy shell 爬取的网页url

　　然后 response.css('') 测试规则是否正确

　　推荐我自己建的Python开发学习群：810735403，群都是通过学习Python开发的。如果您正在学习 Python，欢迎您加入。大家都是软件开发党，时不时分享干货。（仅与Python软件开发相关），包括我自己编译的2020年最新Python进阶资料和进阶开发教程的副本。欢迎有志于深入 Python 的进阶伙伴！

0

2022-02-28

完整的采集神器

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

完整的采集神器(Python进阶资料和高级开发教程，欢迎进阶中和进想深入Python的小伙伴！)

0 个评论

发起人

AI时代内容工厂

完整的采集神器(Python进阶资料和高级开发教程，欢迎进阶中和进想深入Python的小伙伴！)

0 个评论

发起人

相关问题