python网页数据抓取(1.使用urllib.request获取网页使用Python爬取网页数据 )

优采云发布时间: 2021-09-14 14:08

　　python网页数据抓取(1.使用urllib.request获取网页使用Python爬取网页数据

)

　　过去当你需要一些网页的信息时，用Python写一个爬虫来爬取非常方便。

　　转载：

　　1. 使用 urllib.request 获取网页使用 Python 抓取网页数据1. 使用 urllib.request 获取网页

　　urllib 是 Python 中的内置 HTTP 库。使用 urllib 可以通过非常简单的步骤高效采集数据；配合Beautiful等HTML解析库，可以为采集网络数据编写大型爬虫；

　　注：示例代码是用Python3编写的； urllib 是 Python2 中 urllib 和 urllib2 的组合，Python2 中的 urllib2 对应于 Python3 中的 urllib.request。

　　简单例子：

　　import urllib.request # 引入urllib.request

response = urllib.request.urlopen('http://www.zhihu.com') # 打开URL

html = response.read() # 读取内容

html = html.decode('utf-8') # 解码

print(html)

　　2.伪造的请求头信息

　　有时候爬虫发起的请求会被服务器拒绝。这时候就需要将爬虫伪装*敏*感*词*类用户的浏览器。这通常是通过伪造请求头信息来实现的，例如：

　　import urllib.request

head = {}

head['User-Agent']='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0..'

req = urllib.request.Request(url,head) # 伪造请求头

response = urllib.request.urlopen(req)

html = response.read().decode('utf-8')

print(html)

　　3.伪造的请求体

　　爬取一些网站时，需要POST数据到服务器，然后需要伪造请求体；

　　为了实现有道词典的在线翻译脚本，在Chrome中打开开发工具，在Network下找到方法为POST的请求。观察数据，可以发现请求体中的'i'是需要翻译的URL编码内容。因此，可以伪造请求体，例如：

　　import urllib.request

import urllib.parse

import json

while True:

content = input('请输入要翻译的内容:')

if content == 'exit!':

break

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'

# 请求主体

data = {}

data['type'] = "AUTO"

data['i'] = content

data['doctype'] = "json"

data['xmlVersion'] = "1.8"

data['keyfrom'] = "fanyi.web"

data['ue'] = "UTF-8"

data['action'] = "FY_BY_CLICKBUTTON"

data['typoResult'] = "true"

data = urllib.parse.urlencode(data).encode('utf-8')

head = {}

head['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0'

req = urllib.request.Request(url,data,head) # 伪造请求头和请求主体

response = urllib.request.urlopen(req)

html = response.read().decode('utf-8')

target = json.loads(html)

print('翻译结果: ',(target['translateResult'][0][0]['tgt']))

　　也可以使用add_header()方法来伪造请求头，比如：

　　import urllib.request

import urllib.parse

import json

while True:

content = input('请输入要翻译的内容(exit!):')

if content == 'exit!':

break

url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'

# 请求主体

data = {}

data['type'] = "AUTO"

data['i'] = content

data['doctype'] = "json"

data['xmlVersion'] = "1.8"

data['keyfrom'] = "fanyi.web"

data['ue'] = "UTF-8"

data['action'] = "FY_BY_CLICKBUTTON"

data['typoResult'] = "true"

data = urllib.parse.urlencode(data).encode('utf-8')

req = urllib.request.Request(url,data)

req.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0')

response = urllib.request.urlopen(req)

html=response.read().decode('utf-8')

target = json.loads(html)

print('翻译结果: ',(target['translateResult'][0][0]['tgt']))

　　4.使用代理IP

　　为了避免采集爬虫过于频繁导致的IP阻塞问题，可以使用代理IP，如：

　　# 参数是一个字典{'类型':'代理ip:端口号'}

proxy_support = urllib.request.ProxyHandler({'type': 'ip:port'})

# 定制一个opener

opener = urllib.request.build_opener(proxy_support)

# 安装opener

urllib.request.install_opener(opener)

#调用opener

opener.open(url)

　　注意：使用爬虫过于频繁地访问目标站点会占用大量服务器资源。大型分布式爬虫可以爬取某个站点甚至对该站点发起DDOS攻击；因此，应该合理使用爬虫来抓取数据安排抓取频率和时间；如：在服务器相对空闲的时间（如：清晨）爬取，完成一个爬取任务后暂停一段时间等；

　　5.检查网页的编码方式

　　虽然大部分网页都采用UTF-8编码，但有时您会遇到使用其他编码方式的网页，因此您必须了解网页的编码方式才能正确解码抓取到的页面；

　　chardet是python的第三方模块，使用chardet可以自动检测网页的编码；

　　安装chardet：pip install charest

　　使用：

　　import chardet

url = 'http://www,baidu.com'

html = urllib.request.urlopen(url).read()

>>> chardet.detect(html)

{'confidence': 0.99, 'encoding': 'utf-8'}

　　6.获取跳转链接

　　有时网页的某个页面需要在原创URL的基础上进行一次甚至多次重定向才能最终到达目的页面，因此需要正确处理重定向；

　　通过requests模块的head()函数获取跳转链接的URL，如

　　url='https://unsplash.com/photos/B1amIgaNkwA/download/'

res = requests.head(url)

re=res.headers['Location']

0

2021-09-14

python网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python网页数据抓取(1.使用urllib.request获取网页使用Python爬取网页数据 )

0 个评论

发起人

AI时代内容工厂

python网页数据抓取(1.使用urllib.request获取网页使用Python爬取网页数据 )

0 个评论

发起人

相关问题