c爬虫抓取网页数据(Python开发的一个快速,高层次处理网络通讯的整体架构大致)

优采云发布时间: 2021-10-18 06:00

　　Scrapy 是一个由 Python 开发的快速、高级的屏幕抓取和网页抓取框架，用于抓取网站和从页面中提取结构化数据。Scrapy 用途广泛，可用于数据挖掘、监控和自动化测试。

　　Scrapy 使用 Twisted 异步网络库来处理网络通信。整体结构大致如下（注：图片来自网络）：

　　Scrapy主要包括以下组件：

　　使用Scrapy可以轻松完成在线数据采集的工作，它为我们做了很多工作，无需自己开发。

　　1. 安装python

　　目前最新的Scrapy版本是0.22.2。这个版本需要python 2.7，所以需要先安装python 2.7。这里我使用centos服务器进行测试，因为系统自带python，需要先查看python版本。

　　检查python版本：

　　$ python -V

Python 2.6.6

　　升级版本到2.7：

　　$ Python 2.7.6:

$ wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz

$ tar xf Python-2.7.6.tar.xz

$ cd Python-2.7.6

$ ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"

$ make && make altinstall

　　建立软连接并使系统默认python指向python2.7

　　$ mv /usr/bin/python /usr/bin/python2.6.6

$ ln -s /usr/local/bin/python2.7 /usr/bin/python

　　再次检查python版本：

　　$ python -V

Python 2.7.6

　　安装

　　这里我们使用 wget 来安装 setuptools：

　　$ wget https://bootstrap.pypa.io/ez_setup.py -O - | python

　　安装 zope.interface

　　$ easy_install zope.interface

　　安装扭曲

　　Scrapy使用Twisted异步网络库来处理网络通信，所以需要安装twisted。

　　在安装twisted之前，需要先安装gcc：

　　$ yum install gcc -y

　　然后，通过easy_install安装twisted：

　　$ easy_install twisted

　　如果出现以下错误：

　　$ easy_install twisted

Searching for twisted

Reading https://pypi.python.org/simple/twisted/

Best match: Twisted 14.0.0

Downloading https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815

Processing Twisted-14.0.0.tar.bz2

Writing /tmp/easy_install-kYHKjn/Twisted-14.0.0/setup.cfg

Running Twisted-14.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kYHKjn/Twisted-14.0.0/egg-dist-tmp-vu1n6Y

twisted/runner/portmap.c:10:20: error: Python.h: No such file or directory

twisted/runner/portmap.c:14: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token

twisted/runner/portmap.c:31: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token

twisted/runner/portmap.c:45: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘PortmapMethods’

twisted/runner/portmap.c: In function ‘initportmap’:

twisted/runner/portmap.c:55: warning: implicit declaration of function ‘Py_InitModule’

twisted/runner/portmap.c:55: error: ‘PortmapMethods’ undeclared (first use in this function)

twisted/runner/portmap.c:55: error: (Each undeclared identifier is reported only once

twisted/runner/portmap.c:55: error: for each function it appears in.)

　　请安装 python-devel 并再次运行：

　　$ yum install python-devel -y

$ easy_install twisted

　　如果出现以下异常：

　　error: Not a recognized archive type: /tmp/easy_install-tVwC5O/Twisted-14.0.0.tar.bz2

　　请手动下载安装，下载地址在这里

　　$ wget https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815

$ tar -vxjf Twisted-14.0.0.tar.bz2

$ cd Twisted-14.0.0

$ python setup.py install

　　安装 pyOpenSSL

　　先安装一些依赖：

　　$ yum install libffi libffi-devel openssl-devel -y

　　然后，通过easy_install安装pyOpenSSL：

　　$ easy_install pyOpenSSL

　　安装 Scrapy

　　先安装一些依赖：

　　$ yum install libxml2 libxslt libxslt-devel -y

　　最后，安装 Scrapy：

　　$ easy_install scrapy

　　2. 使用 Scrapy

　　安装成功后，可以了解Scrapy的一些基本概念和用法，学习Scrapy项目dirbot的例子。

　　Dirbot 项目所在地。该项目收录一个 README 文件，详细描述了该项目的内容。如果您熟悉 Git，可以查看其源代码。或者，您可以通过单击下载以 tarball 或 zip 格式下载文件。

　　这里以一个例子来说明如何使用Scrapy创建爬虫项目。

　　新建筑

　　在爬取之前，您需要创建一个新的 Scrapy 项目。输入要保存代码的目录，然后执行：

　　$ scrapy startproject tutorial

　　该命令会在当前目录下新建一个目录tutorial，其结构如下：

　　.

├── scrapy.cfg

└── tutorial

├── __init__.py

├── items.py

├── pipelines.py

├── settings.py

└── spiders

└── __init__.py

　　这些文件主要是：

　　定义项目

　　Items 是将加载捕获数据的容器。它的工作原理类似于 Python 中的字典，但它提供了更多保护，例如填充未定义的字段以防止拼写错误。

　　它是通过创建一个scrapy.item.Item 类并将其属性定义为scrpy.item.Field 对象来声明的，就像一个对象关系映射（ORM）。

　　我们控制通过建模所需项目获得的站点数据。例如，我们要获取站点的名称、url 和网站描述。我们定义了这三个属性的域。为此，我们编辑教程目录中的 items.py 文件，我们的 Item 类将如下所示

　　from scrapy.item import Item, Field

class DmozItem(Item):

title = Field()

link = Field()

desc = Field()

　　一开始可能看起来有点混乱，但是定义这些项目会让你知道在使用其他 Scrapy 组件时你的项目是什么。

　　写一个蜘蛛（蜘蛛）

　　Spider 是一个用户编写的类，用于从域（或域组）中获取信息。我们定义了用于下载的 URL 的初步列表、如何跟踪链接以及如何解析这些网页的内容以提取项目。

　　要创建 Spider，您可以为 scrapy.spider.BaseSpider 创建一个子类并确定三个主要的强制性属性：

　　该方法负责解析返回的数据，匹配捕获的数据（解析为item），跟踪更多的URL。

　　在tutorial/spiders目录下创建DmozSpider.py

　　from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, \'wb\').write(response.body)

　　运行项目

　　$ scrapy crawl dmoz

　　该命令从域启动爬虫，第三个参数是DmozSpider.py中name属性的值。

　　xpath 选择器

　　Scrapy 使用一种称为 XPath 选择器的机制，该机制基于 XPath 表达式。如果您想了解有关选择器和其他机制的更多信息，可以查看。

　　以下是 XPath 表达式及其含义的一些示例：

　　这些只是使用 XPath 的几个简单示例，但实际上 XPath 非常强大。如果您想了解更多关于 XPATH 的信息，我们向您推荐这个 XPath 教程

　　为了方便XPaths的使用，Scrapy提供了Selector类，共有三种方法

　　提取数据

　　我们可以使用以下命令选择网站中的每个元素：

　　sel.xpath(\'//ul/li\')

　　然后网站描述：

　　sel.xpath(\'//ul/li/text()\').extract()

　　网站标题：

　　sel.xpath(\'//ul/li/a/text()\').extract()

　　网站链接：

　　sel.xpath(\'//ul/li/a/@href\').extract()

　　如前所述，每个 xpath() 调用都会返回一个选择器列表，因此我们可以结合 xpath() 来挖掘更深的节点。我们将使用这些功能，因此：

　　sites = sel.xpath(\'//ul/li\')

for site in sites:

title = site.xpath(\'a/text()\').extract()

link = site.xpath(\'a/@href\').extract()

desc = site.xpath(\'text()\').extract()

print title, link, desc

　　使用物品

　　scrapy.item.Item的调用接口类似于python的dict，Item收录多个scrapy.item.Field。这类似于 django 的模型和

　　Item通常用于Spider的parse方法，用于保存解析后的数据。

　　最后修改爬虫类，使用Item保存数据，代码如下：

　　from scrapy.spider import Spider

from scrapy.selector import Selector

from dirbot.items import Website

class DmozSpider(Spider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",

]

def parse(self, response):

"""

The lines below is a spider contract. For more info see:

http://doc.scrapy.org/en/latest/topics/contracts.html

@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/

@scrapes name

"""

sel = Selector(response)

sites = sel.xpath(\'//ul[@class="directory-url"]/li\')

items = []

for site in sites:

item = Website()

item[\'name\'] = site.xpath(\'a/text()\').extract()

item[\'url\'] = site.xpath(\'a/@href\').extract()

item[\'description\'] = site.xpath(\'text()\').re(\'-\s([^\n]*?)\\n\')

items.append(item)

return items

　　现在，您可以再次运行该项目以查看结果：

　　$ scrapy crawl dmoz

　　使用项目管道

　　在settings.py中设置ITEM_PIPELINES，默认为[]，类似django的MIDDLEWARE_CLASSES等。

　　Spider的解析返回的Item数据会依次被ITEM_PIPELINES列表中的Pipeline类处理。

　　Item Pipeline 类必须实现以下方法：

　　还可以另外实现以下两种方法：

　　保存捕获的数据

　　保存信息最简单的方法是通过，命令如下：

　　$ scrapy crawl dmoz -o items.json -t json

　　除了 json 格式，还支持 JSON 行、CSV 和 XML 格式。您还可以通过接口扩展一些格式。

　　这种方法对于小项目来说已经足够了。如果是比较复杂的数据，可能需要写一个Item Pipeline进行处理。

　　所有抓到的物品都会以JSON格式保存在新生成的items.json文件中

　　总结

　　以上介绍了如何创建爬虫项目的过程，可以参考以上过程再次联系。作为学习示例，也可以参考这个文章：scrapy中文教程（爬取cnbeta示例）。

　　这个文章中的爬虫代码如下：

　　from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import Selector

from cnbeta.items import CnbetaItem

class CBSpider(CrawlSpider):

name = \'cnbeta\'

allowed_domains = [\'cnbeta.com\']

start_urls = [\'http://www.cnbeta.com\']

rules = (

Rule(SgmlLinkExtractor(allow=(\'/articles/.*\.htm\', )),

callback=\'parse_page\', follow=True),

)

def parse_page(self, response):

item = CnbetaItem()

sel = Selector(response)

item[\'title\'] = sel.xpath(\'//title/text()\').extract()

item[\'url\'] = response.url

return item

　　需要注意的是：

　　3. 学习资料

　　联系Scrapy是因为想爬取知乎的一些数据。一开始找了一些相关资料和别人的实现方法。

　　Github上有人或多或少意识到了知乎数据的爬取。我搜索了以下仓库：

　　其他信息：

　　抓取和交互式示例：

　　有一些知识点需要梳理：

　　4. 总结

　　以上是这几天学习Scrapy的笔记和知识汇总。参考了网上的一些文章，写了这篇文章。这次真是万分感谢。希望这篇文章文章能对你有所帮助。如果您有任何想法，请留言；如果你喜欢这篇文章，请帮忙分享，谢谢！

　　最初发表于：

0

2021-10-18

c爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

c爬虫抓取网页数据(Python开发的一个快速,高层次处理网络通讯的整体架构大致)

0 个评论

发起人

AI时代内容工厂

c爬虫抓取网页数据(Python开发的一个快速,高层次处理网络通讯的整体架构大致)

0 个评论

发起人

相关问题