动态网页抓取(BeyondCompare分析动态网页分析及方法解析(一)-乐题库)

优采云发布时间: 2022-02-10 18:21

　　一、分析动态网页1、分析工具

　　使用 Beyond Compare 分析网页的动态部分。

　　2、直接python解析判断

　　找到您需要的内容，并以通常的方式抓取测验。如果不能爬取，就要考虑是否有动态网页！！

　　二、常用解决方案1、查找JS文件

　　之前掌握了一个解决办法，找动态网页的js文件，很简单，但是美中不足的是找到加载的js文件，找到这些动态页面的规则，需要手动搜索。

　　推荐教程：Python爬取js动态页面

　　2、python 网络引擎

　　安装：

　　selenium 的安装很简单：

　　点安装硒

　　phantomjs的安装有点复杂：

　　首先下载安装nodejs，很简单。

　　如果需要使用浏览器显示对应的浏览器驱动：

　　查看 chromedriver 教程

　　selenium + chrome/phantomjs 教程

　　直接代码，代码里面有详细的解释，不解释的话后面会给出解释：

　　1import re

2from selenium import webdriver

3from selenium.webdriver.common.by import By

4from selenium.webdriver.support.ui import WebDriverWait

5from selenium.webdriver.support import expected_conditions as EC

6from selenium.common.exceptions import TimeoutException

7from pyquery import PyQuery as pq

8import pymongo

9

10client = pymongo.MongoClient('localhost')

11db = client['tbmeishi']

12

13driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true','--load-images=false','--disk-cache=true'])

14driver.set_window_size(1280,2400) #当无浏览器界面时必须设置窗口大小

15#driver = webdriver.Chrome()

16wait = WebDriverWait(driver, 10)

17

18def search():

19 try:

20 driver.get('https://www.taobao.com/') #加载淘宝首页

21 #等待页面加载出输入框

22 input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#q")))

23 #等待页面出现搜索按钮

24 submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#J_TSearchForm > div.search-button > button")))

25 input.send_keys('美食') #向输入框中输入‘美食’关键字

26 submit.click() #点击搜索按钮

27 #等待页面加载完

28 total = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#mainsrp-pager > div > div > div > div.total')))

29 #第一页加载完后，获取第一页信息

30 get_products()

31 return total.text

32 except TimeoutException:

33 return search()

34

35def next_page(page_number):

36 try:

37 # 等待页面出现搜索按钮

38 submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > ul > li.item.next > a")))

39 submit.click() # 点击确定按钮

40 #判断当前页面是否为输入页面

41 wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager > div > div > div > ul > li.item.active > span'), str(page_number)))

42 #第i页加载完后，获取页面信息

43 get_products()

44 except TimeoutException:

45 return next_page(page_number)

46

47def get_products():

48 wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-itemlist .items .item')))

49 html = driver.page_source

50 doc = pq(html)

51 items = doc('#mainsrp-itemlist .items .item').items()

52 for item in items:

53 product = {

54 'image': item.find('.pic .img').attr('src'),

55 'price': item.find('.price').text(),

56 'deal': item.find('.deal-cnt').text()[:-3],

57 'title': item.find('.title').text(),

58 'shop': item.find('.shop').text(),

59 'location': item.find('.location').text()

60 }

61 print(product)

62 #save_to_mongo(product)

63

64def save_to_mongo(result):

65 try:

66 if db['product'].insert(result):

67 print('存储到MONGODB成功', result)

68 except Exception:

69 print('存储到MONGODB失败', result)

70

71

72def main():

73 try:

74 total = search()

75 total = int(re.compile('(\d+)').search(total).group(1))

76 for i in range(2,total+1):

77 print('第 %d 页'%i)

78 next_page(i)

79

80 except Exception as e:

81 print('error!!!',e)

82 finally:

83 driver.close()

84

85if __name__ == '__main__':

86 main()

87

88

89

90

　　做selenium最好参考**selenium for Python API**，里面有很多用法。

　　1from selenium import webdriver

2from selenium.webdriver.common.by import By

3from selenium.webdriver.support.ui import WebDriverWait

4from selenium.webdriver.support import expected_conditions as EC

5from selenium.common.exceptions import TimeoutException

6

　　使用无界面操作时，一定要注意设置窗口大小，尽量设置大一些。如果尺寸设置得更小，我们一定不能使用 JavaScript 滚动命令来模拟页面向下滑动的效果来显示更多内容，所以设置更大的窗口进行渲染

　　1driver.set_window_size(1280,2400)

2

　　Selenium实现了一些类似xpath的功能，可以使用驱动直接获取我们想要的元素，直接调用如下方法：

　　但是这个方法太慢了，我们一般不用，而是直接通过驱动获取网页的源代码：html = driver.page_source，然后用lxml + xpath或者BeautifulSoup来解析；

　　除此之外，还有另一种解析方式：pyquery

　　参考这两篇博文：

　　**

　　下面的代码是用pyquery方法解析的，真的很简单。

　　1def get_products():

2 wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-itemlist .items .item')))

3 html = driver.page_source

4 doc = pq(html)

5 items = doc('#mainsrp-itemlist .items .item').items()

6 for item in items:

7 product = {

8 'image': item.find('.pic .img').attr('src'),

9 'price': item.find('.price').text(),

10 'deal': item.find('.deal-cnt').text()[:-3],

11 'title': item.find('.title').text(),

12 'shop': item.find('.shop').text(),

13 'location': item.find('.location').text()

14 }

15 print(product)

16

　　Selenium 还包括许多方法：

　　注意：

　　运行后一定要调用driver.close()或者driver.quit()退出phantomjs，否则phantomjs会一直占用内存资源。

　　推荐使用 driver.service.process.send_signal(signal.SIGTERM)

　　可以强杀，windows下百度

　　在 Linux 下：

　　ps辅助| grep phantomjs #查看phantomjs进程

　　pgrep phantomjs | xargs kill #杀死所有幻影

　　PhantomJS 配置

　　1driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true','--load-images=false','--disk-cache=true'])

2

　　--ignore-ssl-errors = [true|false]#是否检查CA证书，安全

　　--load-images = [true|false]#是否加载图片，一般不加载，节省时间

　　--disk-cache = [true|false]#是否缓存

　　最后总结一下，常规的爬取方式比较容易操作，尤其是在使用selenium的一些方式的时候，初学者感觉很吃力；而且使用selenium+phantomjs的方式会比较慢，应该相当于一个人访问网页，需要等待加载时间，而常规的爬取方式是直接取网页代码，会是快点。当然，有时候 selenium+phantomjs 会简单很多。它假装一个人正在访问，并且反爬虫不容易找到。而且，有些网页有陷阱，传统的方法会很麻烦。对于慢的问题，可以使用多线程来解决。

　　总而言之，具体情况！！！！

0

2022-02-10

动态网页抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

动态网页抓取(BeyondCompare分析动态网页分析及方法解析(一)-乐题库)

0 个评论

发起人