干货教程:python采集小说网站完整教程（附完整代码）

优采云发布时间: 2022-11-01 22:21

　　Python 采集网站数据，本教程使用刮擦蜘蛛

　　1. 安装刮擦框架

　　命令行执行：

　　 pip install scrapy

　　如果安装的抓取依赖包与你最初安装的其他python包冲突，建议使用Virtualenv进行安装

　　它

　　安装完成后，只需找到一个文件夹即可创建爬虫

　　scrapy startproject 你的蜘蛛名称

　　文件夹目录

　　爬虫

　　规则写在爬虫目录中

　　items.py – 需要爬网的数据

　　pipelines.py - 执行数据保存

　　设置 – 配置

　　middlewares.py – 下载器

　　以下是采集新颖网站的源代码

　　在 items.py 中定义采集数据

　　第一

　　# author 小白

import scrapy

class BookspiderItem(scrapy.Item):

# define the fields for your item here like:

i = scrapy.Field()

book_name = scrapy.Field()

book_img = scrapy.Field()

book_author = scrapy.Field()

book_last_chapter = scrapy.Field()

book_last_time = scrapy.Field()

book_list_name = scrapy.Field()

book_content = scrapy.Field()

pass

　　编写采集规则

　　# author 小白

<p>

import scrapy

from ..items import BookspiderItem

class Book(scrapy.Spider):

name = "BookSpider"

start_urls = [

'http://www.xbiquge.la/xiaoshuodaquan/'

]

def parse(self, response):

bookAllList = response.css('.novellist:first-child>ul>li')

for all in bookAllList:

booklist = all.css('a::attr(href)').extract_first()

yield scrapy.Request(booklist,callback=self.list)

def list(self,response):

book_name = response.css('#info>h1::text').extract_first()

book_img = response.css('#fmimg>img::attr(src)').extract_first()

book_author = response.css('#info p:nth-child(2)::text').extract_first()

book_last_chapter = response.css('#info p:last-child::text').extract_first()

book_last_time = response.css('#info p:nth-last-child(2)::text').extract_first()

bookInfo = {

'book_name':book_name,

'book_img':book_img,

'book_author':book_author,

'book_last_chapter':book_last_chapter,

'book_last_time':book_last_time

}

list = response.css('#list>dl>dd>a::attr(href)').extract()

i = 0

for var in list:

i += 1

bookInfo['i'] = i # 获取抓取时的顺序，保存数据时按顺序保存

yield scrapy.Request('http://www.xbiquge.la'+var,meta=bookInfo,callback=self.info)

def info(self,response):

self.log(response.meta['book_name'])

content = response.css('#content::text').extract()

item = BookspiderItem()

item['i'] = response.meta['i']

item['book_name'] = response.meta['book_name']

item['book_img'] = response.meta['book_img']

item['book_author'] = response.meta['book_author']

item['book_last_chapter'] = response.meta['book_last_chapter']

item['book_last_time'] = response.meta['book_last_time']

item['book_list_name'] = response.css('.bookname h1::text').extract_first()

item['book_content'] = ''.join(content)

yield item

</p>

　　保存数据

　　import os

class BookspiderPipeline(object):

def process_item(self, item, spider):

curPath = 'E:/小说/'

tempPath = str(item['book_name'])

targetPath = curPath + tempPath

if not os.path.exists(targetPath):

os.makedirs(targetPath)

book_list_name = str(str(item['i'])+item['book_list_name'])

filename_path = targetPath+'/'+book_list_name+'.txt'

print('------------')

print(filename_path)

with open(filename_path,'a',encoding='utf-8') as f:

f.write(item['book_content'])

return item

　　执行

　　scrapy crawl BookSpider

　　完成新节目的采集

　　这里推荐

　　scrapy shell 爬取的网页url

　　然后 response.css（''）测试规则是否正确

　　这里我还是要推荐我自己的Python开发学习小组：810735403，这个小组正在学习Python开发，如果你正在学习Python，欢迎加入，大家都是一个软件开发方，不时分享干货（只和Python软件开发相关），包括我自己编了2020年最新的Python高级信息和高级开发教程，欢迎来到高级，并希望深入了解 Python 合作伙伴！

　　内容分享:2522期：影视解说文案*敏*感*词*：自动采集一键伪原创打造爆款文案(工具+解说稿

　　免费下载或者VIP会员资源可以直接商业化吗？

　　本站所有资源版权归原作者所有。此处提供的资源仅供参考和学习使用，请勿直接用于商业用途。如因商业用途发生版权纠纷，一切责任由用户承担。更多信息请参考VIP介绍。

　　提示下载完成但无法解压或打开？

　　最常见的情况是下载不完整：可以将下载的压缩包与网盘容量进行对比。如果小于网盘指示的容量，就是这个原因。这是浏览器下载bug，建议使用百度网盘软件或迅雷下载。如果排除了这种情况，可以在对应资源底部留言，或者联系我们。

　　在资产介绍文章中找不到示例图片？

　　对于会员制、全站源代码、程序插件、网站模板、网页模板等各类素材，文章中用于介绍的图片通常不收录在相应的下载中材料包。这些相关的商业图片需要单独购买，本站不负责（也没有办法）找到来源。某些字体文件也是如此，但某些资产在资产包中会有字体下载链接列表。

　　付款后无法显示下载地址或无法查看内容？

　　如果您已经支付成功但网站没有弹出成功提示，请联系站长提供支付信息供您处理

　　购买此资源后可以退款吗？

　　源材料是一种虚拟商品，可复制和传播。一经批准，将不接受任何形式的退款或换货请求。购买前请确认您需要的资源

0

2022-11-01

完整的采集神器

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

干货教程:python采集小说网站完整教程（附完整代码）

0 个评论

发起人

AI时代内容工厂

干货教程:python采集小说网站完整教程（附完整代码）

0 个评论

发起人

相关问题