抓取网页数据(STM32)

优采云发布时间: 2022-02-18 16:20

　　抓取网页数据(STM32)

　　一、urllib库的使用1、简单使用urllib库爬取整个网页源码

　　import urllib.request

url='http://www.baidu.com'

response=urllib.request.urlopen(url)

print(type(response)) #

print(dir(response)) # dir用来显示该对象所有的方法和属性

print(type(response.read())) #

print(response.read().decode()) # decode()默认使用UTF-8进行解码

print(response.geturl()) # http://www.baidu.com

print(response.getcode()) # 200

print(response.info()) # Bdpagetype: 1...

print(response.headers) # 和info()一样

　　2、简单使用urllib进行带参数的GET请求

　　import urllib.request

import urllib.parse

url='http://www.baidu.com/s'

params={'wd':'NBA全明星'} # 字典格式

params=urllib.parse.urlencode(params) # 对参数进行url编码

print(params) # wd=NBA%E5%85%A8%E6%98%8E%E6%98%9F

url=url+'?'+params # http://www.baidu.com/s?wd=NBA%E5%85%A8%E6%98%8E%E6%98%9F

response=urllib.request.urlopen(url)

print(response.read().decode())

　　3、使用 urllib 发出 POST 请求

　　import urllib.request

import urllib.parse

url='http://httpbin.org/post'

data={'username':'恩比德','age':28}

data=urllib.parse.urlencode(data)

print(data) # username=%E6%81%A9%E6%AF%94%E5%BE%B7&age=28 url解码可使用urllib.parse.unquote(str)

data=bytes(data.encode()) # POST请求发送的数据必须是bytes,一般先要进行url编码，然后转换为bytes

request=urllib.request.Request(url,data=data) # 只要指定data参数，就一定是POST请求

response=urllib.request.urlopen(request) # 使用构造的Request对象，通过urlopen方法发送请求

print(response.read().decode())

　　4、使用urllib伪装成浏览器发出请求

　　import urllib.request

url='http://httpbin.org/get'

# User-Agent用来指定使用的浏览器

headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'}

request=urllib.request.Request(url,headers=headers) # 指定headers参数构造Request对象

response=urllib.request.urlopen(request) # 使用Request对象发出请求

print(response.read().decode())

　　5、使用代理IP发出请求

　　import urllib.request

import random

# 免费的代理IP列表,zhimahttp.com,免费代理IP寿命短，速度慢，匿名度不高

proxy_list = [

{'http': '183.166.180.184:9999'},

{'http': '221.230.216.169:8888'},

{'http': '182.240.34.61:9999'},

{'http': '121.226.154.250:8080'}

]

proxy = random.choice(proxy_list) # 随机选择一个

print(proxy)

# 1.构造ProxyHandler

proxy_handler = urllib.request.ProxyHandler(proxy)

# 2.使用构造好的ProxyHandler对象来自定义opener对象

opener = urllib.request.build_opener(proxy_handler)

# 3.使用代理发出请求

request = urllib.request.Request('http://www.baidu.com') # 构造请求对象传入opener对象的open()方法

response = opener.open(request)

print(response.read().decode())

　　我们可以使用代理服务器并每隔一段时间更换一次代理。如果一个IP被封禁，可以换成另一个IP继续爬取数据，可以有效解决被网站封禁的情况。

　　6、URLError 异常并捕获

　　import urllib.request

import urllib.error

url='http://www.whit.edu.cn/net' # 错误url

request = urllib.request.Request(url)

try:

urllib.request.urlopen(request)

except urllib.error.HTTPError as e:

print(e.code) # 404

　　7、捕捉超时异常

　　import urllib.request

import urllib.error

try:

url = 'http://218.56.132.157:8080'

# timeout设置超时的时间

response = urllib.request.urlopen(url, timeout=1)

result = response.read()

print(result)

except Exception as error:

print(error) #

　　二、requests库的使用1、使用requests发出带参数的get请求

　　import requests

params={'wd':'爬虫'}

url='http://www.baidu.com/s'

response=requests.get(url,params=params)

print(response.text)

# print(type(response)) #

# print(response.encoding) # utf-8

# print(type(response.content)) #

# print(response.content.decode()) # 和response.text的结果一样

　　说明：使用request发送带参数的GET请求时，参数为字典格式，不带url编码，不拼接url。将此与“1”中的“2”进行比较。

　　Response 对象的常用属性：

　　text:响应内容的字符串形式

　　encoding：响应内容的编码

　　status_code:状态码

　　2、使用 requests 库发送 POST 请求

　　import requests

url="http://httpbin.org/post"

data={'name':'库里'} # POST请求要发送的数据

response=requests.post(url,data)

print(response.text)

　　注意：使用requests发送POST请求时，要发送的数据是字典格式，不需要URL编码，也不需要转换成字节。与“1”中的“3”比较。

0

2022-02-18

抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页数据(STM32)

0 个评论

发起人