python抓取网页数据( 介绍网页抓取是了解数据的最常用技术之一Python)

优采云发布时间: 2021-10-09 14:37

　　python抓取网页数据(

介绍网页抓取是了解数据的最常用技术之一Python)

　　介绍

　　网页抓取是理解 Internet 数据最常用的技术之一。它是一种从网页中提取有价值和需要的数据的技术，这些数据被视为执行各种计算操作以生成有用信息的输入值。在本文中，我们将学习如何采集发布在任何网页上的电子邮件数据。我们使用最流行的编程语言之一 Python 来提取数据元素值，因为它具有丰富的库，可以帮助执行各种所需的活动。

　　以下步骤将帮助您了解如何在任何网页上查找电子邮件。

　　第1步

　　我们需要为我们的程序导入所有必要的库。

　　#import packages

from bs4 import BeautifulSoup

import requests

import requests.exceptions

from urllib.parse import urlsplit

from collections import deque

import re

　　第2步

　　选择用于从给定 URL 提取电子邮件的 URL。

　　# 要抓取的 url 队列

new_urls = deque(['https://www.gtu.ac.in/page.aspx?p=ContactUsA'])

　　第 3 步

　　我们只需要处理给定的 URL 一次，因此请跟踪您处理的 URL。

　　# a set of urls that we have already crawled

processed_urls = set()

　　第四步

　　在抓取给定 URL 时，我们可能会遇到多个电子邮件 ID，因此将它们保留在集合中。

　　# a set of crawled emails

emails = set()

　　第 5 步

　　是时候开始爬行了。我们需要爬取队列中所有的URL，维护爬取过的URL列表，从网页中获取页面内容。如果您遇到任何错误，请移至下一页。

　　# process urls one by one until we exhaust the queue

while len(new_urls):

# move next url from the queue to the set of processed urls

url = new_urls.popleft()

processed_urls.add(url)

# get url's content

print("Processing %s" % url)

try:

response = requests.get(url)

except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):

# ignore pages with errors

continue

　　第 6 步

　　现在我们需要提取当前 URL 的一些基本部分，这是将文档中的相对链接转换为绝对链接的重要部分：

　　# extract base url and path to resolve relative links

parts = urlsplit(url)

base_url = "{0.scheme}://{0.netloc}".format(parts)

path = url[:url.rfind('/')+1] if '/' in parts.path else url

　　第 7 步

　　从页面内容中提取电子邮件并将其添加到电子邮件集合中。

　　# extract all email addresses and add them into the resulting set

new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))

emails.update(new_emails)

　　第 8 步

　　处理完当前页面后，就可以搜索到其他页面的链接，加入到URL队列中（这就是爬行的魅力所在）。获取一个 Beautifulsoup 对象来解析 HTML 页面。

　　# create a beutiful soup for the html document

soup = BeautifulSoup(response.text)

　　步骤 9

　　汤对象收录 HTML 元素。现在找到所有带有 href 属性的锚标记来解析相关链接并保留处理过的 URL 的记录。

　　# find and process all the anchors in the document

for anchor in soup.find_all("a"):

# extract link url from the anchor

link = anchor.attrs["href"] if "href" in anchor.attrs else ''

# resolve relative links

if link.startswith('/'):

link = base_url + link

elif not link.startswith('http'):

link = path + link

# add the new url to the queue if it was not enqueued nor processed yet

if not link in new_urls and not link in processed_urls:

new_urls.append(link)

　　第 10 步

　　列出从给定 URL 中提取的所有电子邮件 ID。

　　for email in emails:

print(email)

　　总结

　　本文解释了如何执行网页抓取，尤其是当您使用诸如 BeautifulSoup、采集s、requests、re 和 urllib.parse 等 Python 包来定位 HTML 页面上的任何数据时。

0

2021-10-09

python抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取网页数据( 介绍网页抓取是了解数据的最常用技术之一Python)

0 个评论

发起人