ajax抓取网页内容(有待开发者network如果看ajax加载需要Doc标签待有机会 )

优采云发布时间: 2021-11-09 11:06

　　ajax抓取网页内容(有待开发者network如果看ajax加载需要Doc标签待有机会

)

　　先解释一下什么是ajax动态网页。扫微博的时候经常会遇到这个。一直往下拉，就会一直有数据加载，然后就会显示在你的界面上，类似下图。

　　通过改变offset也可以发现网页加载了数据（步长为20)（观察修改通过的XHR标签）

　　然后获取数据。

　　待学习谷歌开发者 F12

　　网络

　　如果你看ajax加载，你需要XHR标签

　　如果你看源码，你需要Doc标签

　　有机会我们会详细分析代码。目前只有崔老师爬取成功的代码。有一个池层。

　　from requests.exceptions import RequestException

import json

import re

from bs4 import BeautifulSoup

import requests

from urllib.parse import urlencode

from requests import codes # ？from 和 import 的区别

import os

from hashlib import md5 # ？

from multiprocessing.pool import Pool # 多进程池

def get_page_index(offset, keyword):

data = {

'offsets': offset,

'format': 'json',

'keyword': keyword,

'autoload': 'true',

'count': '20',

'cur_tab': 3

}

url = 'http://www.toutiao.com/search_content/?' + urlencode(data) # 这是网页的地址，urlencode是url的一种编码方式

try:

response = requests.get(url)

if response.status_code == codes.ok:

return response.json()

except requests.ConnectionError:

print('请求索引出错')

return None

def get_images(json):

if json.get('data'):

data = json.get('data')

for item in data:

if item.get('cell_type') is not None:

continue

title = item.get('title')

images = item.get('image_list')

for image in images:

yield {

'image': 'https:' + image.get('url'),

'title': title

}

def save_image(item):

img_path = 'img' + os.path.sep + item.get('title')

if not os.path.exists(img_path):

os.makedirs(img_path)

try:

resp = requests.get(item.get('image'))

if codes.ok == resp.status_code:

file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(

file_name=md5(resp.content).hexdigest(),

file_suffix='jpg')

if not os.path.exists(file_path):

with open(file_path, 'wb') as f:

f.write(resp.content)

print('Downloaded image path is %s' % file_path)

else:

print('Already Downloaded', file_path)

except requests.ConnectionError:

print('Failed to Save Image，item %s' % item)

def main(offset):

json = get_page_index(offset,'街拍')

for item in get_images(json):

print(item)

save_image(item)

GROUP_START = 0

GROUP_END = 7

if __name__ == '__main__':

pool = Pool()

groups = ([x * 20 for x in range(GROUP_START,GROUP_END + 1)])

pool.map(main,groups) # 池化匹配

pool.close()

pool.join()

0

2021-11-09

ajax抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

ajax抓取网页内容(有待开发者network如果看ajax加载需要Doc标签待有机会 )

0 个评论

发起人

AI时代内容工厂

ajax抓取网页内容(有待开发者network如果看ajax加载需要Doc标签待有机会 )

0 个评论

发起人

相关问题