ajax抓取网页内容(如何分析ajax接口，模拟ajax请求爬取数据(组图))

优采云发布时间: 2021-10-01 19:24

　　抓取ajax网站可以通过解析ajax接口获取返回的json数据，从而抓取到我们想要的数据，以今日头条为例，如何解析ajax接口，模拟ajax请求抓取数据。

　　以今天的头条街拍为例。网页上一页仅显示部分数据。您需要向下滚动才能查看后续数据。下面我们来分析一下它的ajax接口。

　　打开开发者工具，选择network，点击XHR过滤掉ajax请求，可以看到这里有很多参数，其中一个可以一目了然的就是keyword，就是我们搜索到的关键字。然后查看它的预览信息。

　　在预览信息中，可以看到有很多数据信息，点击一下，可以看到里面收录了很多有用的信息，比如街拍标题、图片地址等。

　　当鼠标向下滑动时，会过滤掉多一个ajax请求，如下图

　　可以看到offset参数从0变成了20，仔细看网页，可以发现网页上正好显示了20条信息。

　　这是当鼠标移动到第三页时，可以看到offset参数变为40。当第一页offset参数为0时，第二页offset参数为20，第三页参数为40。就是不难发现，offset参数其实就是offset，用来实现翻页参数。然后我们可以使用urlencode方法将这些参数拼接在url后面，发起ajax请求，通过控制传入的offset参数来控制翻页，然后使用response.json()获取网页返回的json数据。

　　代码思路：1.分析网页的ajax接口，需要传入哪些数据2.通过urlencode关键参数将url拼接到请求后，通过控制指定抓取哪些页面内容偏移参数。3.生成不同页面的请求，获取json数据中图片的url信息4.请求图片的url，下载图片5.保存到文件夹中。

　　实际代码

　　import requests

from urllib.parse import urlencode,urljoin

import os

from hashlib import md5

headers = {

"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36",

"X-Requested-With":"XMLHttpRequest"

}

def get_page(offset):

"""

:param offset: 偏移量，控制翻页

:return:

"""

params = {

"offset":offset,

"format":"json",

"keyword":"街拍",

"autoload":"true",

"count":"20",

"cur_tab":"1",

"from":"search_tab"

}

url = "https://www.toutiao.com/search_content/?" + urlencode(params)

try:

response = requests.get(url,headers=headers,timeout=5)

if response.status_code == 200:

return response.json()

except requests.ConnectionError as e:

return None

def get_image(json):

"""

:param json: 获取到返回的json数据

:return:

"""

if json:

for item in json.get("data"):

title = item.get("title")

images = item.get("image_list")

if images:

for image in images:

yield {

"title":title,

"image":urljoin("http:",image.get("url")) if type(image) == type({"t":1}) else urljoin("http:",image)

}

def save_images(item):

"""

将图片保存到文件夹，以标题命名文件夹

:param item: json数据

:return:

"""

if item.get("title") != None:

if not os.path.exists(item.get("title")):

os.mkdir(item.get("title"))

else:

pass

try:

response = requests.get(item.get("image"))

if response.status_code == 200:

file_path = "{0}/{1}.{2}".format(item.get("title") if item.get("title") != None else "Notitle",md5(response.content).hexdigest(),"jpg")

if not os.path.exists(file_path):

with open(file_path,"wb") as f:

f.write(response.content)

else:

print("Already Downloaded",file_path)

except requests.ConnectionError:

print("Failed Download Image")

def main(offset):

"""

控制爬取的主要逻辑

:param offset: 偏移量

:return:

"""

json = get_page(offset)

for item in get_image(json):

print(item)

save_images(item)

groups = [i*20 for i in range(1,10)]

if __name__ == '__main__':

for group in groups:

main(group)

　　爬取的结果

　　通过分析ajax接口，比selenium模拟更容易模拟ajax请求进行爬取，但是代码复用性较差，因为每个网页的接口不同，所以在捕获ajax加载的数据时，仍然使用selenium模拟直接抓取接口数据，需要根据自己的实际需要选择。

0

2021-10-01

ajax抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

ajax抓取网页内容(如何分析ajax接口，模拟ajax请求爬取数据(组图))

0 个评论

发起人

AI时代内容工厂

ajax抓取网页内容(如何分析ajax接口，模拟ajax请求爬取数据(组图))

0 个评论

发起人

相关问题