excel抓取多页网页数据(Chrome浏览器插件WebScraper可轻松实现网页数据的爬取)

优采云 发布时间: 2022-01-07 01:18

  excel抓取多页网页数据(Chrome浏览器插件WebScraper可轻松实现网页数据的爬取)

  Chrome浏览器插件Web Scraper可以轻松抓取网页数据,无需考虑爬虫中的登录、验证码、异步加载等复杂问题。

  先粘贴爬虫58数据的sitemap如下:

  {"_id":"hefeitongcheng","startUrl":[";ClickID=1"],"selectors":[{"id":"click","type":"SelectorElementClick","parentSelectors" :["_root"],"selector":".list-main-style li","multiple":true,"delay":"5000","clickElementSelector":"strong span","clickType":"clickMore ","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},{"id":"link","type":"SelectorLink","parentSelectors":["click"], "selector":".title a","multiple":false,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["link"], "selector":"h1","multiple":false,"regex":"","delay":0},{"id":"jiage","type":"SelectorText","parentSelectors":[ "link"],"selector":".house_basic_title_money span","multiple":false,"regex":"","delay":0},{"id":"add","type":"SelectorText ","parentSelectors":["link"],"selector":"p.p_2","multiple":false,"regex":"","delay":0}]}

  

  Web Scraper 爬取过程及要点:

  安装Web Scraper插件后,抓取操作分三步完成

  1、新建站点地图(创建爬虫项目)

  2、选择抓取网页中的内容,点击~点击~点击,操作

  3、启用抓取和下载CSV数据

  最关键的一步是第二步,主要有两点:

  首先选择数据块Element,我们获取页面上的每个数据块,所有这些都是重复的。在数据块中选择Multiple,然后获取所需的数据字段(上面Excel中的列)

  爬取大量数据的重点是掌握分页的控制。

  分页分为3种情况:

  1. URL 参数分页(比较组织) ?page=2 或 ?page=[1-27388]

  2.向下滚动,点击“加载更多”加载页面数据元素向下滚动

  3.点击页面的数字标签(包括“下一页”标签)链接或元素点击

  其他例子A:jd爬上hw p30价格信息

  {"_id":"huaweip30","startUrl":[";enc=utf-8&wq=%E5%8D%8E%E4%B8%BAp30%20512&pvid=ed449bf16e44461fac90ff6fae2e66cd"][ "id":"element","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.gl-i-wrap","multiple":true,"delay": "1500","clickElementSelector":".p-num a:nth-of-type(3)","clickType":"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType ":"uniqueText"},{"id":"name","type":"SelectorText","parentSelectors":["element"],"selector":"a em","multiple":false," regex":"","delay":0},{"id":"jiage","type":"SelectorText","parentSelectors":["element"],"selector":"div.p-price ","multiple":false,"regex":"","delay":0}]}

  其他例子B:爬上百度关键词信息

  { "_id": "wailaizhu", "startUrl": [ "; pn = 0 & oq = wailaizhu% 20h0101 & tn = baiduhome_pg & ie = utf-8 & rsv_idx = 2 & rsv_pq = f62d1151tv_f0 5b15EoMWRlm3% 2BeroyWXBKI% 2FDZ3H0BlGKJ6lNa6mmYBo4nNDUeJNeeN8BvgiE9S9Orivd"], "选择器": [ {"id":"element","type":"SelectorElementClick","parent_selector"],"div_selector":"[" "multiple":true,"delay":"1500","clickElementSelector":"aspan.pc","clickType":"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType":" uniqueText"},{"id":"name", "type":"SelectorText","parentSelectors":["element"],"selector":"a","multiple":false,"regex":" ","delay":0},{"id":"body","type":"SelectorText","parentSelectors":["element"],"selector":"_parent_","multiple":false, "regex":"","delay":0} ]}

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线