爬虫 | Python爬取网页数据

优采云发布时间: 2022-05-25 09:21

　　爬虫 | Python爬取网页数据

　　之前也更过爬虫方面的内容，今天再更一次。后面会陆续更一些爬虫方面的内容(HTML, requests, bs4, re ...)，中间可能会插播一些 numpy 和 pandas 方面的内容。在时间允许的情况下会更一些WRF模式方面的内容。也算是立了个更新内容的 flag，但是更新时间就不立了==

　　----------- 华丽的分割线 ------------

　　当你没有数据的时候怎么办呢？有些时候能直接得到 csv 格式数据，或是通过API获取数据。然而，有些时候只能从网页获取数据。这种情况下，只能通过网络爬虫的方式获取数据，并转为满足分析要求的格式。

　　本文利用Python3和BeautifulSoup爬取网页中的天气预测数据，然后使用 pandas 分析。

　　Web网页组成

　　我们查看网页时，浏览器会向web服务器发送请求，而且通常使用 GET 方法发送请求，然后服务器返回响应，通过浏览器的解析就能看到所请求的页面了。web服务器返回的文件主要是以下几种类型：

　　浏览器接收到所有文件之后，会对网页进行渲染，然后向我们展示。虽然显示网页的幕后发生了很多过程，但是在爬取数据时我们并不需要了解这些过程。在爬取网页数据时，主要关注的就是网页的主要内容，因此，主要关注HTML。

　　HTML

　　HTML(超文本标记语言)是创建网页时所需要的语言，但并不是像Python一样的编程语言。相反，它是告诉浏览器如何排版网页内容的标记语言。HTML类似文本编辑器，可以对字体进行处理(加粗，放大缩小)，创建段落等。

　　为了更有效率的爬取网页数据，我们需要先快速的了解一下HTML。HTML由一系列标签(tags)构成。最基本的标签是。标签的作用就是告诉浏览器网页中有什么。我们可以使用下面的标签创建最基本的HTML文档(注：打开文本编辑器，复制以下内容，然后存储为以 html 为后缀的任意名称文件，比如 document.html)。

　　然后用浏览器打开存储的文件。因为只包含一对标签，标签中没有添加任何内容，所以用浏览器打开后不会看到任何内容。

　　下面，除了标签之外，添加了和标签。标签包含网页的主要内容，标签包含的是网页的标题。在进行网页爬取时，这三个标签是非常有用的。

　　除了多了两个标签之外，并没有添加其它内容，因此用浏览器打开之后仍是空文档。

　　现在，我们向网页中添加一些内容，用

　　标签来标识。

　　标签所对应的内容表示在网页中是一个段落。

Here's a paragraph of text! Here's a second paragraph of text!

　　用浏览器打开之后是以下内容(上面的颜色是为了标识，真正显示时是黑色字体)：

Here's a paragraph of text! Here's a second paragraph of text!

　　通常所使用的标签名称依赖于其相对于其它标签的位置。

　　还可以添加一些属性到html文档中来改变其行为：

Here's a paragraph of text! Learn Data Science Online Here's a second paragraph of text! Python

　　页面内容如下所示：

Here's a paragraph of text! Learn Data Science Online Here's a second paragraph of text! Python

　　在上面的示例中，添加了两个标签。标签表示链接，告诉浏览器此链接会转到另一个网页。href 属性表示链接的地址。紧随其后的字符串表示别名。

　　和

　　均是非常常见的 html 标签，还有一些其它标签，比如：

　　完整标签列表在这里[注1]。

　　在正式开始爬取网页前，先了解一下 class 和 id 属性。这些特殊属性确定了 HTML 元素名称，当我们爬取时更容易进行交互。一个元素可以有多个类，一个类可以和元素之间共享。每个元素只能有一个 id，而一个 id 只能在一个网页中使用一次。class 和 id 是可选的，不是每一个元素都有 class 和 id。

　　强行解释：你(元素)有很多朋友(类)，朋友(类)之间可能有你(元素)这个交集(共享)，而你(元素)只有一个*敏*感*词*(id)，比如你在认证领奖时*敏*感*词*只能用一次，不能一个*敏*感*词*领多次。朋友和*敏*感*词*是可选的，因为你可能没有朋友(孤独行者)，也没有*敏*感*词*(小黑孩)。

　　添加 class 和 id 到示例中：

Here's a paragraph of text! Learn Data Science Online Here's a second paragraph of text! Python

　　看起来和上面的示例是一样的结果（添加了 class 和 id 并不会影响网页内容和布局）：

Here's a paragraph of text! Learn Data Science Online Here's a second paragraph of text! Python

　　requests 库

　　爬取网页数据的第一步就是下载网页。我们可以利用requests 库向web服务器发送 GET 请求下载网页内容。使用requests时有几种不同的请求，GET 请求是其中一种，了解更多请看。

　　现在，我们试着下载一个简单的网页。首先，需要使用 requests.get 方法下载页面：

import requests page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

　　运行了get 请求之后，会获得响应对象，其中包含了状态码属性，表示是否下载成功。

page.status_code 200

　　状态码为 200 表示网页下载成功。我们不需要完整的了解状态码，通常情况下状态码以2开始即表示成功。状态码以4或5开始表示出错。

　　使用 content 属性可以打印页面内容：

page.content b'\n\n \n A simple example page\n \n \n Here is some simple content for this page.\n \n'

　　BeautifulSoup 解析网页

　　下载好页面之后，使用 BeautifulSoup 解析页面内容，然后从 p 标签提取文本。导入库然后创建实例来解析网页：

from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser')

　　使用 prettify 属性可以将页面内容打印出来：

print(soup.prettify()) A simple example page Here is some simple content for this page.

　　因为所有标签都是嵌套的，我们可以一次移动一层。使用 soup 的 children 属性可以选择页面的所有顶层元素。

　　注意：children 返回的是*敏*感*词*，需要调用 list 函数转换为列表。

list(soup.children) ['html', '\n', A simple example page Here is some simple content for this page. ]

　　上述结果表明，页面顶层有两个标签：和标签。换行符 (\n) 也在列表中。下面看一下列表中每个元素的类型：

>> [type(item) for item in list(soup.children)] [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

　　每一项都是 BeautifulSoup 对象。 Dcotype 对象包含文档类型信息，NavigableString 呈现的是包含文档中的文本，Tag对象包含其它嵌套标签。最重要且经常用到的对象是 Tag 对象。

　　Tag 对象在HTML文档中起到导航作用，可以用来获取标签和文本。更多BeautifulSoup 对象看这里 [注2]。

　　通过 soup.children 获取 html 标签信息：

　　html = list(soup.children)[2]

　　children 属性返回的每一项都是 BeautifulSoup 对象，因此可以直接调用 children 方法。获取 html 标签的子标签信息：

list(html.children) ['\n', A simple example page , '\n', Here is some simple content for this page. , '\n']

　　如上所示，有两个顶层标签：和。如果想要获取 title 和 p 标签对应的信息，需要先获取其所对应的父标签信息。比如，获取 p 标签信息，要先获取标签信息：

　　body = list(html.children)[3]

　　因为标签中只有 p 标签，所以可以很方便的获取 p 标签信息：

list(body.children) ['\n', Here is some simple content for this page., '\n']

　　获取 p 标签信息：

　　p = list(body.children)[1]

　　获取 p 标签之后，使用 get_text 方法可以提取标签中的信息：

p.get_text() 'Here is some simple content for this page.'

　　获取所有标签信息

　　上面所演示的内容对于了解页面导航信息非常有用，但是使用了很多命令来完成意见非常简单的任务。如果你想提取单个标签，可以使用 find_all 方法，可以获取页面中的所有标签实例：

soup = BeautifulSoup(page.content, 'html.parser') soup.find_all('p') [Here is some simple content for this page.]

　　注意： find_all 返回的是列表，为了获取指定标签信息，需要循环或指定索引。

　　获取标签之后同样用 get_text 方法获取文本信息：

soup.find_all('p')[0].get_text() 'Here is some simple content for this page.'

　　如果不想获取标签所有实例，可以使用 find 方法获取标签的第一个实例：

soup.find('p') Here is some simple content for this page.

　　利用 class 和 id 搜索标签

　　前面介绍了 class 和 id，但是还没有介绍它们的有用之处。class 和 id 是 CSS 所使用的，主要用来确定 HTML 元素应该使用什么类型。可以使用它们爬取特定元素。比如爬取下列网页时（URL：）：

A simple example page <p span class="hljs-class" style="color: rgb(51, 51, 51); font-weight: 400; font-style: normal;"span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: bold; font-style: normal;"class/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"inner/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"text/span span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"item/span" span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"id/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span"> First paragraph. Second paragraph. First outer paragraph. Second outer paragraph.

　　创建 BeautifulSoup 对象：

page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html") soup = BeautifulSoup(page.content, 'html.parser') soup A simple example page <p span class="hljs-class" style="color: rgb(51, 51, 51); font-weight: 400; font-style: normal;"span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: bold; font-style: normal;"class/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"inner/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"text/span span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"item/span" span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"id/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span"> First paragraph. Second paragraph. First outer paragraph. Second outer paragraph.

　　现在，使用 find_all 方法通过 class 和 id 搜索项。比如，搜索 class 值为 outer-text 的 p 标签：

soup.find_all('p', class_='outer-text') [<p span class="hljs-class" style="color: rgb(51, 51, 51); font-weight: 400; font-style: normal;"span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: bold; font-style: normal;"class/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"outer/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"text/span span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"item/span" span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"id/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"second/span"> First outer paragraph. , Second outer paragraph. ]

　　也可以搜索 class 值为 outer-text 的任何标签：

soup.find_all(class_="outer-text") [<p span class="hljs-class" style="color: rgb(51, 51, 51); font-weight: 400; font-style: normal;"span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: bold; font-style: normal;"class/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"outer/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"text/span span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"item/span" span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"id/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"second/span"> First outer paragraph. , Second outer paragraph. ]

　　当然也可以通过 id 搜索元素：

soup.find_all(id="first") [<p span class="hljs-class" style="color: rgb(51, 51, 51); font-weight: 400; font-style: normal;"span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: bold; font-style: normal;"class/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"inner/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"text/span span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"item/span" span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"id/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span"> First paragraph. ]

　　CSS选择器

　　CSS选择器（用于确定HTML标签类型）同样可以用来搜索项。比如：

　　更多选择器在这里 [注3]。

　　BeautifulSoup 对象支持使用 select 方法通过选择器搜索页面。使用选择器获取 div 标签下的所有 p 标签：

soup.select("div p") [<p span class="hljs-class" style="color: rgb(51, 51, 51); font-weight: 400; font-style: normal;"span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: bold; font-style: normal;"class/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"inner/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"text/span span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span-span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"item/span" span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"id/span="span class="hljs-title" style="color: rgb(136, 0, 0); font-weight: bold; font-style: normal;"first/span"> First paragraph. , Second paragraph. ]

　　注意： select 方法返回的时 BeautifulSoup 对象列表，就像 find 和 find_all 。

　　下载天气数据

　　目前，我们已经知道了提取网页信息的方法。下一步就是确定要爬取的网页。下面以爬取美国国家天气服务的天气信息为例：

　　网页显示了一周的天气预报信息，包括时间，温度以及一些描述信息。

　　了解网页结构

　　第一步，使用 Chrome 开发工具查看网页布局，使用其它浏览器也可以。

　　按F12即可打开开发者工具，即下图中红色框部分。

　　Elements 部分包含了网页中的所有标签，通过标签你可以确定页面的布局。

　　右击页面中 Extended Forecast所对应的网页部分(下图中红色框部分)，然后选择 "Inspect"（检查），然后就会定位到 Elements 中的标签(*敏*感*词*阴影部分的父标签)。

　　然后就能获取到所有的预测数据，在此例中对应的是 id 为 seven-day-forecast 的

　　标签。

　　打开

　　标签的内容就可以发现每一天的预测数据：日期，温度，简要描述。下图中绿色框和红色框分别对应的是一天的预测（包含在 class 为 tombstone-container 的

　　标签内）。

　　现在已经知道如何下载网页并解析网页了，下面我们开始实战：

page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168") soup = BeautifulSoup(page.content, 'html.parser') seven_day = soup.find(id="seven-day-forecast") forecast_items = seven_day.find_all(class_="tombstone-container") tonight = forecast_items[0] print(tonight.prettify()) Tonight

Mostly Clear

Low: 49 °F

　　提取页面信息

　　单标签信息提取

　　预测项 tonight 中包含了我们所需要的所有信息，其中包含了四项：

　　提取预测项名称，简要描述及温度：

period = tonight.find(class_="period-name").get_text() short_desc = tonight.find(class_="short-desc").get_text() temp = tonight.find(class_="temp").get_text() print(period) print(short_desc) print(temp) Tonight Mostly Clear Low: 49 °F

　　现在，从 img 标签中提取 title 属性。将 BeautifulSoup 对象视作字典，传递需要的属性作为键：

img = tonight.find("img") desc = img['title'] print(desc) Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph.

　　提取所有信息

　　上面介绍了如何提起单标签信息，下面介绍如何利用CSS选择器和列表解析，一次提取所有信息：

period_tags = seven_day.select(".tombstone-container .period-name") periods = [pt.get_text() for pt in period_tags] periods ['Tonight', 'Thursday', 'ThursdayNight', 'Friday', 'FridayNight', 'Saturday', 'SaturdayNight', 'Sunday', 'SundayNight']

　　按照上面的方式获取了有序的时间名称，现在获取另外3个字段：

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")] temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")] descs = [d["title"] for d in seven_day.select(".tombstone-container img")] print(short_descs) print(temps) print(descs) ['Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny', 'Slight ChanceRain', 'Rain Likely', 'Rain Likely', 'Rain Likely', 'Chance Rain'] ['Low: 49 °F', 'High: 63 °F', 'Low: 50 °F', 'High: 67 °F', 'Low: 57 °F', 'High: 64 °F', 'Low: 57 °F', 'High: 64 °F', 'Low: 55 °F'] ['Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph. ', 'Thursday: Sunny, with a high near 63. North wind 3 to 5 mph. ', 'Thursday Night: Mostly clear, with a low around 50. Light and variable wind becoming east southeast 5 to 8 mph after midnight. ', 'Friday: Sunny, with a high near 67. Southeast wind around 9 mph. ', 'Friday Night: A 20 percent chance of rain after 11pm. Partly cloudy, with a low around 57. South southeast wind 13 to 15 mph, with gusts as high as 20 mph. New precipitation amounts of less than a tenth of an inch possible. ', 'Saturday: Rain likely. Cloudy, with a high near 64. Chance of precipitation is 70%. New precipitation amounts between a quarter and half of an inch possible. ', 'Saturday Night: Rain likely. Cloudy, with a low around 57. Chance of precipitation is 60%.', 'Sunday: Rain likely. Cloudy, with a high near 64.', 'Sunday Night: A chance of rain. Mostly cloudy, with a low around 55.']

　　存储数据到 DataFrame

　　下面将数据存储到 pandas 的 DataFrame 中并分析之。DataFrame 可以存储表型数据并很容易的进行数据分析。

　　将上述信息传递给 DataFrame 类，字典中的键表示列名，键值表示每一列的值：

import pandas as pd weather = pd.DataFrame({ "period": periods, "short_desc": short_descs, "temp": temps, "desc":descs }) weather desc period short_desc temp 0 Tonight: Mostly clear, with a low around 49. W... Tonight Mostly Clear Low: 49 °F 1 Thursday: Sunny, with a high near 63. North wi... Thursday Sunny High: 63 °F 2 Thursday Night: Mostly clear, with a low aroun... ThursdayNight Mostly Clear Low: 50 °F 3 Friday: Sunny, with a high near 67. Southeast ... Friday Sunny High: 67 °F 4 Friday Night: A 20 percent chance of rain afte... FridayNight Slight ChanceRain Low: 57 °F 5 Saturday: Rain likely. Cloudy, with a high ne... Saturday Rain Likely High: 64 °F 6 Saturday Night: Rain likely. Cloudy, with a l... SaturdayNight Rain Likely Low: 57 °F 7 Sunday: Rain likely. Cloudy, with a high near... Sunday Rain Likely High: 64 °F 8 Sunday Night: A chance of rain. Mostly cloudy... SundayNight Chance Rain Low: 55 °F

　　现在，我们可以对数据进行简单的分析。比如利用正则表达式和 Series.str.extract 方法获取温度的数值：

temp_nums = weather["temp"].str.extract("(?P\d+)", expand=False) weather["temp_num"] = temp_nums.astype('int') temp_nums 0 49 1 63 2 50 3 67 4 57 5 64 6 57 7 64 8 55 Name: temp_num, dtype: object

　　然后计算温度的平均值：

weather["temp_num"].mean() 58.444444444444443

　　如果某天晚上你要出去，可以查看晚上的天气信息：

is_night = weather["temp"].str.contains("Low") weather["is_night"] = is_night is_night 0 True 1 False 2 True 3 False 4 True 5 False 6 True 7 False 8 True Name: temp, dtype: bool weather[is_night]

　　下一步

　　现在你已经了解了如何爬取网页并提取数据。下一步就是选择一个网站然后继续练习。

　　Just do it！

　　注1：

　　注2：#kinds-of-objects

　　注3：

　　预览时标签不可点

　　收录于合集 #

　　个

0

2022-05-25

python网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

爬虫 | Python爬取网页数据

0 个评论

发起人