网页源代码抓取工具( 解码后三种本篇将不做详述requests模块())

优采云发布时间: 2022-01-15 12:30

　　网页源代码抓取工具(

解码后三种本篇将不做详述requests模块())

　　import urllib.request

# 打开指定需要爬取的网页

response=urllib.request.urlopen('http://www.baidu.com')

# 或者是

# from urllib import request

# response = request.urlopen('http://www.baidu.com')

# 打印网页源代码

print(response.read().decode())

　　添加decode()是为了避免下图中的十六进制内容

　　添加 decode() 进行解码后

　　以下三个本文不再详细介绍

　　请求模块

　　requests 模块是一种在 python 中实现 HTTP 请求的方法。它是一个第三方模块。该模块在实现HTTP请求时比urllib模块简单很多，操作也更加人性化。

　　以 GET 请求为例：

　　import requests

response = requests.get('http://www.baidu.com/')

print('状态码：', response.status_code)

print('请求地址：', response.url)

print('头部信息：', response.headers)

print('cookie信息：', response.cookies)

# print('文本源码：', response.text)

# print('字节流源码：', response.content)

　　输出如下：

　　状态码： 200

请求地址： http://www.baidu.com/

头部信息： {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 10 May 2020 02:43:33 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:23 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

cookie信息：

　　这里解释一下response.text和response.content的区别

　　以 POST 请求为例

　　import requests

data={'word':'hello'}

response = requests.post('http://www.baidu.com',data=data)

print(response.content)

　　请求头处理

　　当爬取页面使用反爬虫设置来防止恶意的采集信息，从而拒绝用户访问时，我们可以通过模拟浏览器的头部信息进行访问，可以解决反爬虫设置的问题。.

　　通过浏览器进入指定网页，鼠标右键，选择“检查”，选择“网络”，刷新页面选择第一条消息，右侧消息头面板会显示下图中的请求头信息

　　例如：

　　import requests

url = 'https://www.bilibili.com/'

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}

response = requests.get(url, headers=headers)

print(response.content.decode())

　　网络超时

　　访问页面时，如果页面长时间没有响应，系统会判断页面超时，无法打开页面。

　　例如：

　　import requests

url = 'http://www.baidu.com'

# 循环发送请求50次

for a in range(0, 50):

try:

# timeout数值可根据用户当前网速，自行设置

response = requests.get(url, timeout=0.03) # 设置超时为0.03

print(response.status_code)

except Exception as e:

print('异常'+str(e)) # 打印异常信息

　　部分输出如下：

　　代理服务

　　设置代理IP可以解决不久前可以抓取的网页现在不能抓取，然后报错——连接尝试失败，因为连接方一段时间后没有正确回复或者连接的主机没有响应。

　　例如：

　　import requests

# 设置代理IP

proxy = {'http': '117.45.139.139:9006',

'https': '121.36.210.88:8080'

}

# 发送请求

url = 'https://www.baidu.com'

response = requests.get(url, proxies=proxy)

# 也就是说如果想取文本数据可以通过response.text

# 如果想取图片，文件，则可以通过 response.content

# 以字节流的形式打印网页源代码,bytes类型

print(response.content.decode())

# 以文本的形式打印网页源代码，为str类型

print(response.text) # 默认”iso-8859-1”编码，服务器不指定的话是根据网页的响应来猜测编码。

　　美丽的汤模块

　　Beautiful Soup 模块是一个 Python 库，用于从 HTML 和 XML 文件中提取数据。Beautiful Soup 模块自动将输入文档转换为 Unicode 编码，将输出文档转换为 UTF-8 编码。不需要考虑编码方式，除非文档没有指定编码方式。在这种情况下，Beautiful Soup 无法自动识别编码方式。然后，只需说明原创编码即可。

　　例如：

　　from bs4 import BeautifulSoup

html_doc = """

The Dormouse's story

<p class="title">The Dormouse's story

　　Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

　　...

"""

# 创建对象

soup = BeautifulSoup(html_doc, features='lxml')

# 或者创建对象打开需要解析的html文件

# soup = BeautifulSoup(open('index.html'), features='lxml')

print('源代码为：', soup)# 打印解析的HTML代码</p>

　　结果如下：

　　The Dormouse's story

<p class="title">The Dormouse's story

　　Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

　　...

</p>

　　用美汤爬百度首页标题

　　from bs4 import BeautifulSoup

import requests

response = requests.get('http://news.baidu.com')

soup = BeautifulSoup(response.text, features='lxml')

print(soup.find('title').text)

　　结果如下：

　　百度新闻-海量中文信息平台

　　最后，希望大家喜欢，给我点个赞吧！

0

2022-01-15

网页源代码抓取工具

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页源代码抓取工具( 解码后三种本篇将不做详述requests模块())

0 个评论

发起人