python抓取动态网页(利用Selenium+Phantomjs动态获取网站数据信息信息的例子 )

优采云 发布时间: 2022-01-31 16:26

  python抓取动态网页(利用Selenium+Phantomjs动态获取网站数据信息信息的例子

)

  刚开始学习爬虫,从网上找资料,写了一个使用Selenium+Phantomjs动态获取网站数据信息的例子。当然要先安装Selenium+Phantomjs,详情见

  硒下载:

  phantomjs使用参考:及官网:

  源码如下:

<p># coding=utf-8

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

import selenium.webdriver.support.ui as ui

from selenium.webdriver.common.action_chains import ActionChains

import time

import re

import os

class Crawler:

def __init__(self, firstUrl = "https://list.jd.com/list.html?cat=9987,653,655",

nextUrl = "https://list.jd.com/list.html?cat=9987,653,655&page=%d&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main"):

self.firstUrl = firstUrl

self.nextUrl = nextUrl

def getDetails(self,pageIndex,id = "plist"):

'''

获取详细信息

:param pageIndex: 页索引

:param id: 标签对应的id

:return:

'''

element = self.driver.find_element_by_id(id)

txt = element.text.encode('utf8')

items = txt.split('¥')

for item in items:

if len(item) > 0:

details = item.split('\n')

print '¥' + item

# print '单价:¥'+ details[0]

# print '品牌:' + details[1]

# print '参与评价:' + details[2]

# print '店铺:' + details[3]

print ' '

print '第 ' + str(pageIndex) + '页'

def CatchData(self,id = "plist",totalpageCountLable = "//span[@class='p-skip']/em/b"):

'''

抓取数据

:param id:获取数据的标签id

:param totalpageCountLable:获取总页数标记

:return:

'''

start = time.clock()

self.driver = webdriver.PhantomJS()

wait = ui.WebDriverWait(self.driver, 10)

self.driver.get(self.firstUrl)

#在等待页面元素加载全部完成后才进行下一步操作

wait.until(lambda driver: self.driver.find_element_by_xpath(totalpageCountLable))

# 获取总页数

pcount = self.driver.find_element_by_xpath(totalpageCountLable)

txt = pcount.text.encode('utf8')

print '总页数:' + txt

print '第1页'

print ' '

pageNum = int(txt)

pageNum = 3 # 只执行三次

i = 2

while (i

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线