抓取网页数据(Python中的正则表达式教程输出结果及总结表(图) )

优采云发布时间: 2022-02-25 13:01

　　抓取网页数据(Python中的正则表达式教程输出结果及总结表(图)

)

　　摘要：这篇文章是关于使用Python爬取网页数据的三种方法；它们是正则表达式（re）、BeautifulSoup 模块和 lxml 模块。本文所有代码均在python3.5中运行。

　　本文截取【中央气象台】首页头条信息()：

　　它的 HTML 层次结构是：

　　抓取href、title和tags的内容。

　　一、正则表达式

　　复制外层HTML：

　　高温警告

　　代码：

　　# coding=utf-8

import re, urllib.request

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

html = html.decode('utf-8') #python3版本中需要加入

links = re.findall('<a target="_blank" href="(.+?)" title',html)

titles = re.findall('a target="_blank" .+? title="(.+?)"',html)

tags = re.findall('a target="_blank" .+? title=.+?(.+?)/a',html)

for link,title,tag in zip(links,titles,tags):

print(tag,url+link,title)

/pre/p

p正则表达式符号 '.' 表示匹配任何字符串（\n 除外）；'+' 表示匹配前面正则表达式的零次或多次出现；'？' 表示匹配 0 或 1 前面的正则表达式。更多信息请参考Python中的正则表达式教程/p

p输出如下：/p

p高温预警中央气象台7月13日18:00继续发布高温橙色预警/p

p山洪灾害气象预警水利部、中国气象局于7月13日18:00联合发布山洪灾害气象预警/p

p强对流天气预警中央气象台于7月13日18:00继续发布强对流天气蓝色预警/p

p地质灾害气象风险预警国土资源部、中国气象局于7月13日18:00联合发布地质灾害气象风险预警/p

p二、BeautifulSoup 模块/p

pBeautiful Soup 是一个非常流行的 Python 模块。该模块可以解析网页并提供方便的界面来定位内容。/p

p复制选择器：/p

p#alarmtip > ul > li.waring > a:nth-child(1)/p

p因为这里我们抓取的是多个数据，而不仅仅是第一个，所以我们需要把它改成：/p

p#alarmtip > ul > li.waring > a/p

p代码：/p

ppre class="brush:python;gutter:true;"from bs4 import BeautifulSoup

import urllib.request

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html,'lxml')

content = soup.select('#alarmtip > ul > li.waring > a')

for n in content:

link = n.get('href')

title = n.get('title')

tag = n.text

print(tag, url + link, title)

　　输出与上面相同。

　　三、lxml 模块

　　Lxml 是基于 XML 解析库 libxml2 的 Python 包装器。这个模块是用C语言编写的，解析速度比Beautiful Soup快，但是安装过程比较复杂。

　　代码：

　　import urllib.request,lxml.html

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

tree = lxml.html.fromstring(html)

content = tree.cssselect('li.waring > a')

for n in content:

link = n.get('href')

title = n.get('title')

tag = n.text

print(tag, url + link, title)

　　输出与上面相同。

　　四、将抓取的数据存储在列表或字典中

　　以 BeautifulSoup 模块为例：

　　from bs4 import BeautifulSoup

import urllib.request

url = 'http://www.nmc.cn'

html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html,'lxml')

content = soup.select('#alarmtip > ul > li.waring > a')

######### 添加到列表中

link = []

title = []

tag = []

for n in content:

link.append(url+n.get('href'))

title.append(n.get('title'))

tag.append(n.text)

######## 添加到字典中

for n in content:

data = {

'tag' : n.text,

'link' : url+n.get('href'),

'title' : n.get('title')

}

　　五、总结

　　表 2.1 总结了每种抓取方法的优缺点。

0

2022-02-25

抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页数据(Python中的正则表达式教程输出结果及总结表(图) )

0 个评论

发起人

AI时代内容工厂

抓取网页数据(Python中的正则表达式教程输出结果及总结表(图) )

0 个评论

发起人

相关问题