js抓取网页内容(上文，继续探索抓取和处理，当然没有soup方便易用)

优采云发布时间: 2022-01-03 22:00

　　跟着上面的，继续探索

　　抓取JS动态页面内容

　　以上内容是对首页内容的抓取和处理。其实使用urllib2库和正则表达式也可以做到（当然，没有好用的美汤）。这还没能带出selenium + phantomJS的主要用途。本文将模拟JS渲染的翻页。

　　网页元素定位

　　从本网页的翻页工具栏可以看到，如果要翻页，有以下三种可能的解决方法：

　　直接点击翻页按钮，可以跳转到该页，点击下一页，上一页按钮跳转到该页，在右边输入页码，点击确定跳转到本页

　　为了方便写循环等因素，我选择第三个选项。

　　找到对应的html源码如下：

　　使用代码定位

　　总页数

　　driver.find_element_by_xpath('//div[@class="paginations"]/span[@class="skip-wrap"]/em').text

　　页码输入框

　　driver.find_element_by_xpath('//input[@aria-label="页码输入框"]')

　　“确定”按钮

　　driver.find_element_by_xpath('//button[@aria-label="确定跳转"]')

　　模拟翻页

　　遇到的问题

　　# 睡2秒让网页加载完再去读它的html代码# http://www.tuicool.com/articles/22eY7vQtime.sleep(2)

　　# http://www.jianshu.com/p/9d408e21dc3a# 之前是使用 driver.close()，但这个不确保关闭 phantomjs.exe# 会导致一直占用着内存driver.quit()

　　第三版结果

　　可以抓取所有页面的照片

　　#!/usr/bin/env python# -*- coding: utf-8 -*-# @Date : 2017-06-18 22:32:26# @Author : kk (zwk.patrick@foxmail.com)# @Link : blog.csdn.net/PatrickZhengfrom selenium import webdriverfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesfrom bs4 import BeautifulSoupimport requests, urllib2import os.pathimport time# 设置 Headers# https://www.zhihu.com/question/35547395user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'dcap = dict(DesiredCapabilities.PHANTOMJS)dcap["phantomjs.page.settings.userAgent"] = (user_agent)driver = webdriver.PhantomJS(executable_path='D:\workplace\spider\phantomjs-2.1.1-windows\phantomjs.exe', desired_capabilities=dcap)driver.get('https://www.taobao.com/markets/mm/mmku')# 获取总共的页数pages = int(driver.find_element_by_xpath('//div[@class="paginations"]/span[@class="skip-wrap"]/em').text)print 'Total pages: %d' % pagesfor i in range(1, 3): soup = BeautifulSoup(driver.page_source, 'lxml') print '第 %d 页：\n' % i # 每个MM的展示是放在属性class=cons_li的div中 cons_li_list = soup.select('.cons_li') lenOfList = len(cons_li_list) print lenOfList for cons_li in cons_li_list: name = cons_li.select('.item_name')[0].get_text().strip('\n') print name img_src = cons_li.select('.item_img img')[0].get('src') if img_src is None: img_src = cons_li.select('.item_img img')[0].get('data-ks-lazyload') print img_src filename = name + os.path.splitext(img_src)[1] with open(filename, 'wb') as f: # urllib2 可以添加 headers # http://www.jianshu.com/p/6094ff96536d request = urllib2.Request(img_src if img_src.startswith('http') else 'http:'+img_src, None, headers) response = urllib2.urlopen(request) f.write(response.read()) # 找到页码输入框 pageInput = driver.find_element_by_xpath('//input[@aria-label="页码输入框"]') pageInput.clear() pageInput.send_keys(str(i+1)) # 找到“确定”按钮，并点击 ok_button = driver.find_element_by_xpath('//button[@aria-label="确定跳转"]') ok_button.click() # 睡2秒让网页加载完再去读它的html代码 # http://www.tuicool.com/articles/22eY7vQ time.sleep(2)# http://www.jianshu.com/p/9d408e21dc3a# 之前是使用 driver.close()，但这个不确保关闭 phantomjs.exe# 会导致一直占用着内存driver.quit()print 'done.'

　　以上源码放在Patrick-kk的github上，欢迎学习交流

0

2022-01-03

js抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

js抓取网页内容(上文，继续探索抓取和处理，当然没有soup方便易用)

0 个评论

发起人

AI时代内容工厂

js抓取网页内容(上文，继续探索抓取和处理，当然没有soup方便易用)

0 个评论

发起人

相关问题