python抓取网页数据(抓取网页本文将举例说明抓取数据的三种方式：1.不同)

优采云发布时间: 2021-09-19 21:11

　　0.序言0.1抓取网页

　　本文将介绍三种获取网页数据的方法：正则表达式、BeautifulSoup和lxml

　　有关用于获取网页内容的代码的详细信息，请参阅Python web crawler-您的第一个爬虫。使用此代码获取并抓取整个网页

<p>import requests

def download(url, num_retries=2, user_agent='wswp', proxies=None):

'''下载一个指定的URL并返回网页内容

参数：

url(str): URL

关键字参数：

user_agent(str):用户代理（默认值：wswp）

proxies（dict）：代理（字典）: 键：‘http’'https'

值：字符串（‘http(s)://IP’）

num_retries(int):如果有5xx错误就重试（默认：2）

#5xx服务器错误，表示服务器无法完成明显有效的请求。

#https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81

'''

print('==========================================')

print('Downloading:', url)

headers = {'User-Agent': user_agent} #头部设置，默认头部有时候会被网页反扒而出错

try:

resp = requests.get(url, headers=headers, proxies=proxies) #简单粗暴，.get(url)

html = resp.text #获取网页内容，字符串形式

if resp.status_code >= 400: #异常处理，4xx客户端错误返回None

print('Download error:', resp.text)

html = None

if num_retries and 500

0

2021-09-19

python抓取网页数据

0 个评论

要回复文章请先登录或注册