c爬虫抓取网页数据(Python开发的一个快速,高层次处理网络通讯的整体架构大致)

优采云发布时间: 2021-12-11 03:08

　　Scrapy 是一个由 Python 开发的快速、高级的屏幕抓取和网页抓取框架，用于抓取网站和从页面中提取结构化数据。Scrapy 用途广泛，可用于数据挖掘、监控和自动化测试。

　　Scrapy 使用 Twisted 异步网络库来处理网络通信。整体结构大致如下（注：图片来自网络）：

　　Scrapy主要包括以下组件：

　　使用Scrapy可以轻松完成在线数据采集的工作，它已经为我们完成了很多工作，无需自己开发。

　　1. 安装python

　　目前最新的Scrapy版本是0.22.2。这个版本需要python 2.7，所以需要先安装python 2.7。这里我使用centos服务器进行测试，因为系统自带python，需要先查看python版本。

　　检查python版本：

　　$ python -V

Python 2.6.6

　　升级版本到2.7：

　　$ Python 2.7.6: $ wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz $ tar xf Python-2.7.6.tar.xz $ cd Python-2.7.6 $ ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib" $ make && make altinstall

　　建立软连接并使系统默认python指向python2.7

　　$ mv /usr/bin/python /usr/bin/python2.6.6

$ ln -s /usr/local/bin/python2.7 /usr/bin/python

　　再次检查python版本：

　　$ python -V

Python 2.7.6

　　安装

　　这里我们使用 wget 来安装 setuptools：

　　$ wget https://bootstrap.pypa.io/ez_setup.py -O - | python

　　安装 zope.interface

　　$ easy_install zope.interface

　　安装扭曲

　　Scrapy使用Twisted异步网络库来处理网络通信，所以需要安装twisted。

　　在安装twisted之前，需要先安装gcc：

　　$ yum install gcc -y

　　然后，通过easy_install安装twisted：

　　$ easy_install twisted

　　如果出现以下错误：

　　$ easy_install twisted

Searching for twisted

Reading https://pypi.python.org/simple/twisted/

Best match: Twisted 14.0.0 Downloading https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815 Processing Twisted-14.0.0.tar.bz2 Writing /tmp/easy_install-kYHKjn/Twisted-14.0.0/setup.cfg Running Twisted-14.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kYHKjn/Twisted-14.0.0/egg-dist-tmp-vu1n6Y twisted/runner/portmap.c:10:20: error: Python.h: No such file or directory twisted/runner/portmap.c:14: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token twisted/runner/portmap.c:31: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token twisted/runner/portmap.c:45: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘PortmapMethods’ twisted/runner/portmap.c: In function ‘initportmap’: twisted/runner/portmap.c:55: warning: implicit declaration of function ‘Py_InitModule’ twisted/runner/portmap.c:55: error: ‘PortmapMethods’ undeclared (first use in this function) twisted/runner/portmap.c:55: error: (Each undeclared identifier is reported only once twisted/runner/portmap.c:55: error: for each function it appears in.)

　　请安装 python-devel 并再次运行：

　　$ yum install python-devel -y

$ easy_install twisted

　　如果出现以下异常：

　　error: Not a recognized archive type: /tmp/easy_install-tVwC5O/Twisted-14.0.0.tar.bz2

　　请手动下载安装，下载地址在这里

　　$ wget https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815 $ tar -vxjf Twisted-14.0.0.tar.bz2 $ cd Twisted-14.0.0 $ python setup.py install

　　安装 pyOpenSSL

　　先安装一些依赖：

　　$ yum install libffi libffi-devel openssl-devel -y

　　然后，通过easy_install安装pyOpenSSL：

　　$ easy_install pyOpenSSL

　　安装 Scrapy

　　先安装一些依赖：

　　$ yum install libxml2 libxslt libxslt-devel -y

　　最后，安装 Scrapy：

　　$ easy_install scrapy

　　2. 使用 Scrapy

　　安装成功后，可以了解Scrapy的一些基本概念和用法，学习Scrapy项目dirbot的例子。

　　Dirbot 项目所在地。该项目收录一个 README 文件，详细描述了该项目的内容。如果您熟悉 Git，可以查看其源代码。或者，您可以通过单击下载以 tarball 或 zip 格式下载文件。

　　这里以一个例子来说明如何使用Scrapy创建爬虫项目。

　　新建筑

　　在爬取之前，您需要创建一个新的 Scrapy 项目。输入要保存代码的目录，然后执行：

　　$ scrapy startproject tutorial

　　该命令会在当前目录下新建一个目录tutorial，其结构如下：

　　.

├── scrapy.cfg

└── tutorial

├── __init__.py

├── items.py

├── pipelines.py

├── settings.py

└── spiders

└── __init__.py

　　这些文件主要是：

　　定义项目

　　Items 是将加载捕获数据的容器。它的工作原理类似于 Python 中的字典，但它提供了更多保护，例如填充未定义的字段以防止拼写错误。

　　它是通过创建一个scrapy.item.Item 类并将其属性定义为scrpy.item.Field 对象来声明的，就像一个对象关系映射（ORM）。

　　我们控制通过建模所需项目获得的站点数据。例如，我们要获取站点的名称、url 和网站描述。我们定义了这三个属性的域。为此，我们编辑教程目录中的 items.py 文件，我们的 Item 类将如下所示

　　from scrapy.item import Item, Field

class DmozItem(Item): title = Field() link = Field() desc = Field()

　　一开始可能看起来有点混乱，但是定义这些项目会让您在使用其他 Scrapy 组件时知道您的项目是什么。

　　写蜘蛛

　　Spider 是一个用户编写的类，用于从域（或域组）中获取信息。我们定义了用于下载的 URL 的初步列表、如何跟踪链接以及如何解析这些网页的内容以提取项目。

　　要创建 Spider，您可以为 scrapy.spider.BaseSpider 创建一个子类并确定三个主要的强制性属性：

　　该方法负责解析返回的数据，匹配捕获的数据（解析为item），跟踪更多的URL。

　　在tutorial/spiders目录下创建DmozSpider.py

　　from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body)

　　运行项目

　　$ scrapy crawl dmoz

　　该命令从域启动爬虫，第三个参数是DmozSpider.py中name属性的值。

　　xpath 选择器

　　Scrapy 使用一种称为 XPath 选择器的机制，该机制基于 XPath 表达式。如果您想了解有关选择器和其他机制的更多信息，可以查看。

　　以下是 XPath 表达式及其含义的一些示例：

　　这些只是使用 XPath 的几个简单示例，但实际上 XPath 非常强大。如果您想了解更多关于 XPATH 的知识，我们向您推荐这个 XPath 教程

　　为了方便XPaths的使用，Scrapy提供了Selector类，共有三种方法

　　提取数据

　　我们可以使用以下命令选择网站中的每个元素：

　　sel.xpath('//ul/li')

　　然后网站描述：

　　sel.xpath('//ul/li/text()').extract()

　　网站标题：

　　sel.xpath('//ul/li/a/text()').extract()

　　网站链接：

　　sel.xpath('//ul/li/a/@href').extract()

　　如前所述，每个 xpath() 调用都会返回一个选择器列表，因此我们可以结合 xpath() 来挖掘更深的节点。我们将使用这些功能，因此：

　　sites = sel.xpath('//ul/li')

for site in sites:

title = site.xpath('a/text()').extract()

link = site.xpath('a/@href').extract() desc = site.xpath('text()').extract() print title, link, desc

　　使用物品

　　scrapy.item.Item的调用接口类似于python的dict，Item收录多个scrapy.item.Field。这类似于 django 的模型和

　　Item通常用在Spider的parse方法中，用于保存解析后的数据。

　　最后修改爬虫类，使用Item保存数据，代码如下：

　　from scrapy.spider import Spider

from scrapy.selector import Selector from dirbot.items import Website class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] def parse(self, response): """ The lines below is a spider contract. For more info see: http://doc.scrapy.org/en/latest/topics/contracts.html @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ @scrapes name """ sel = Selector(response) sites = sel.xpath('//ul[@class="directory-url"]/li') items = [] for site in sites: item = Website() item['name'] = site.xpath('a/text()').extract() item['url'] = site.xpath('a/@href').extract() item['description'] = site.xpath('text()').re('-\s([^\n]*?)\\n') items.append(item) return items

　　现在，您可以再次运行该项目以查看结果：

　　$ scrapy crawl dmoz

　　使用项目管道

　　在settings.py中设置ITEM_PIPELINES，默认为[]，类似django的MIDDLEWARE_CLASSES等。

　　Spider的解析返回的Item数据会依次被ITEM_PIPELINES列表中的Pipeline类处理。

　　Item Pipeline 类必须实现以下方法：

　　可以另外实现以下两种方法：

　　保存捕获的数据

　　保存信息最简单的方法是通过，命令如下：

　　$ scrapy crawl dmoz -o items.json -t json

　　除了 json 格式，还支持 JSON 行、CSV 和 XML 格式。您还可以通过接口扩展一些格式。

　　这种方法对于小项目来说已经足够了。如果是比较复杂的数据，可能需要写一个Item Pipeline进行处理。

　　所有抓到的物品都会以JSON格式保存在新生成的items.json文件中

　　总结

　　以上介绍了如何创建爬虫项目的过程，可以参考以上过程再次联系。作为学习示例，也可以参考这个文章：scrapy中文教程（爬取cnbeta示例）。

　　这个文章中的爬虫代码如下：

　　from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from cnbeta.items import CnbetaItem class CBSpider(CrawlSpider): name = 'cnbeta' allowed_domains = ['cnbeta.com'] start_urls = ['http://www.cnbeta.com'] rules = ( Rule(SgmlLinkExtractor(allow=('/articles/.*\.htm', )), callback='parse_page', follow=True), ) def parse_page(self, response): item = CnbetaItem() sel = Selector(response) item['title'] = sel.xpath('//title/text()').extract() item['url'] = response.url return item

　　需要注意的是：

　　3. 学习资料

　　联系Scrapy是因为想爬取知乎的一些数据。一开始找了一些相关的资料和其他人的实现方法。

　　Github上有人或多或少意识到了知乎数据的爬取。我搜索了以下仓库：

　　其他信息：

　　抓取和交互式示例：

　　有一些知识点需要梳理：

　　4. 总结

　　以上是这几天学习Scrapy的笔记和知识汇总。参考网上的一些文章，写了这篇文章。谢谢你，希望这篇文章文章能对你有所帮助。如果您有任何想法，请留言；如果你喜欢这篇文章，请帮忙分享一下，谢谢！

　　最初发表于：

0

2021-12-11

c爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

c爬虫抓取网页数据(Python开发的一个快速,高层次处理网络通讯的整体架构大致)

0 个评论

发起人

AI时代内容工厂

c爬虫抓取网页数据(Python开发的一个快速,高层次处理网络通讯的整体架构大致)

0 个评论

发起人

相关问题