python抓取动态网页( 爬取需要安装Xpath和firebug两个插件(用于xpath定位))
优采云 发布时间: 2022-04-06 06:16python抓取动态网页(
爬取需要安装Xpath和firebug两个插件(用于xpath定位))
browser = webdriver.Firefox() # Get local session of Firefox
browser.get("www.baidu.com") # Load page
在一般的静态网页中,我们需要爬取的信息是直接写在源码中的。我们可以很方便地使用正则表达式来抓取,例如:
rr.firstInit({"data":[{"author":"袁莉,翟坤","change":"第一次","companyCode":"80116848","datetime":"2016-01- 28T08 : 13:29","infoCode":"APPH2FEzZ2tFASearchReport","insCode":"80000031","insName":"东吴证券","insStar":"3","jlrs": ["206000000","259000000 ","352000000","",""],"rate":"Accumulation","secuFullCode":"002322.SZ","secuName":"科技监测","sratingName": "加","sy":"","syls":["24.4","19.37","14.19",""," "],"sys":["0.5","0.63","0.86","",""],"title" :"业绩有望触底,收购整合加速","profitYear":"2014","type ":"1","newPrice":"16.17"},
但是对于js生成的动态页面,我们需要模拟浏览器加载页面的行为,然后爬取:
所以我们需要准备的是python3+selenium+firefox 其中firefox需要安装Xpath checker和firebug两个插件(用于xpath定位)
首先我们需要通过:
browser = webdriver.Firefox() # Get local session of Firefox
time.sleep(5)
browser.get("https://www.baidu.com/") # Load page
加载页面。然后通过页面上的Xpath插件定位到我们需要爬取的信息。
最后一次使用
elem = browser.find_element_by_xpath("xpath")
获取我们需要抓取的信息。
最后定位翻页和跳页控件,完成所有网页的爬取。