python抓取动态网页( 爬取需要安装Xpath和firebug两个插件(用于xpath定位))

优采云 发布时间: 2022-04-06 06:16

  python抓取动态网页(

爬取需要安装Xpath和firebug两个插件(用于xpath定位))

  browser = webdriver.Firefox() # Get local session of Firefox

browser.get("www.baidu.com") # Load page

  在一般的静态网页中,我们需要爬取的信息是直接写在源码中的。我们可以很方便地使用正则表达式来抓取,例如:

  rr.firstInit({"data":[{"author":"袁莉,翟坤","change":"第一次","companyCode":"80116848","datetime":"2016-01- 28T08 : 13:29","infoCode":"APPH2FEzZ2tFASearchReport","insCode":"80000031","insName":"东吴证券","insStar":"3","jlrs": ["206000000","259000000 ","352000000","",""],"rate":"Accumulation","secuFullCode":"002322.SZ","secuName":"科技监测","sratingName": "加","sy":"","syls":["24.4","19.37","14.19",""," "],"sys":["0.5","0.63","0.86","",""],"title" :"业绩有望触底,收购整合加速","profitYear":"2014","type ":"1","newPrice":"16.17"},

  但是对于js生成的动态页面,我们需要模拟浏览器加载页面的行为,然后爬取:

  所以我们需要准备的是python3+selenium+firefox 其中firefox需要安装Xpath checker和firebug两个插件(用于xpath定位)

  首先我们需要通过:

  browser = webdriver.Firefox() # Get local session of Firefox

time.sleep(5)

browser.get("https://www.baidu.com/") # Load page

  加载页面。然后通过页面上的Xpath插件定位到我们需要爬取的信息。

  最后一次使用

  elem = browser.find_element_by_xpath("xpath")

  获取我们需要抓取的信息。

  最后定位翻页和跳页控件,完成所有网页的爬取。

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线