python网页数据抓取( 代理ip爬取网页数据的方法-)

优采云发布时间: 2022-03-15 05:16

　　python网页数据抓取(

代理ip爬取网页数据的方法-)

　　在网络营销时代，很多模式已经不能适应新的互联网时代，往往达不到营销效果。为了更好地运营网络营销，需要使用很多营销工具来做好每一步。和线上问答的推广一样，代理IP的支持也是必不可少的。需要在营销过程中找到最有效的工具，以提高效率，最大限度地发挥网络营销的效果。

　　使用Python爬取网页表格数据的代码如下。

　　<p style="line-height: 2em;">'''

Python 3.x

描述：本DEMO演示了使用爬虫（动态）代理IP请求网页的过程，代码使用了多线程

逻辑：每隔5秒从API接口获取IP，对于每一个IP开启一个线程去抓取网页源码

'''

import requests

import time

import threading

from requests.packages import urllib3

ips = []

# 爬数据的线程类

class CrawlThread(threading.Thread):

def __init__(self,proxyip):

super(CrawlThread, self).__init__()

self.proxyip=proxyip

def run(self):

# 开始计时

start = time.time()

#消除关闭证书验证的警告

urllib3.disable_warnings()

#使用代理IP请求网址，注意第三个参数verify=False意思是跳过SSL验证（可以防止报SSL错误）

html=requests.get(url=targetUrl, proxies={"http" : 'http://' + self.proxyip, "https" : 'https://' + self.proxyip}, verify=False, timeout=15).content.decode()

# 结束计时

end = time.time()

# 输出内容

print(threading.current_thread().getName() + "使用代理IP, 耗时 " + str(end - start) + "毫秒 " + self.proxyip + " 获取到如下HTML内容：\n" + html + "\n*************")

# 获取代理IP的线程类

class GetIpThread(threading.Thread):

def __init__(self,fetchSecond):

super(GetIpThread, self).__init__()

self.fetchSecond=fetchSecond

def run(self):

global ips

while True:

# 获取IP列表

res = requests.get(apiUrl).content.decode()

# 按照\n分割获取到的IP

ips = res.split('\n')

# 利用每一个IP

for proxyip in ips:

if proxyip.strip():

# 开启一个线程

CrawlThread(proxyip).start()

# 休眠

time.sleep(self.fetchSecond)

if __name__ == '__main__':

# 获取IP的API接口

apiUrl = "http:xxxx"

# 要抓取的目标网站地址

targetUrl = "http://ip.chinaz.com/getip.aspx"

# 获取IP时间间隔，建议为5秒

fetchSecond = 5

# 开始自动获取IP

GetIpThread(fetchSecond).start()

</p>

　　本文介绍使用python爬虫代理ip爬取网页数据的方法。让我们浏览以了解更多信息！

0

2022-03-15

python网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python网页数据抓取( 代理ip爬取网页数据的方法-)

0 个评论

发起人