抓取网页数据工具(WebScraperchrome官方商店百度网盘idqg抓取豆瓣电影top250豆瓣250 )

优采云 发布时间: 2022-01-09 00:05

  抓取网页数据工具(WebScraperchrome官方商店百度网盘idqg抓取豆瓣电影top250豆瓣250

)

  同理,我这次用的是百度的Ecahrt3,不过这个是打包好的js下载到本地的。也可以使用excel在线转json工具替换在线示例中的数据,不过echart最棒的地方在于可以在交互可视化之前嵌入到网页中。

  

  网络刮刀

  铬官方商店

  百度网盘idqg

  抢豆瓣电影top250

  豆瓣250网站

  1

  {"startUrl":"https://movie.douban.com/top250?start=[0-225:25]&filter=","selectors":[{"parentSelectors":["root"],"type":"SelectorElement","multiple":true,"id":"moive","selector":"div.item","delay":""},{"parentSelectors":["moive"],"type":"SelectorText","multiple":false,"id":"number","selector":"em","regex":"","delay":""},{"parentSelectors":["moive"],"type":"SelectorText","multiple":false,"id":"name","selector":"div.hd a","regex":"","delay":""},{"parentSelectors":["moive"],"type":"SelectorText","multiple":false,"id":"rating","selector":"span.rating_num","regex":"","delay":""},{"parentSelectors":["moive"],"type":"SelectorText","multiple":false,"id":"people","selector":"span:nth-of-type(4)","regex":"","delay":""},{"parentSelectors":["moive"],"type":"SelectorText","multiple":false,"id":"detail","selector":"p:nth-of-type(1)","regex":"","delay":""}],"id":"douban250"}

  

  分页不规则问题-点击加载更多(IT橘子)滑动加载更多(maitao)

  网址:%E5%9C%B0%E5%9B%BE

  站点地图中的一级内容设置,主要需要选择加载更多区域(ElementClick)

  

  元素 Scoll down

  链接采集二级和三级页面(招股说明书下载)

  也就是一个多级嵌套元素,例子其实就是爬取到所有的下载链接

  在Detail中设置为链接类型,选择点击跳转的位置

  

  在点击跳转后的页面,新建一个抓取下载链接的位置,下载所有链接,然后用迅雷批量下载

  表格采集(优采云剩余票数查询)

  主要内容是类型设置为链接

  选择表格内容和标题和数据内容部分

  

  反爬高级反爬的常用方法和应对策略文件头用户代理检查动态加载(AJAX Javascript等)用户行为(cookies+请求时间间隔)人机交互验证(验证码) -爬虫

  美团(Sprite)将小图标和背景图片组合在一张图片上,然后利用CSS的背景定位来展示图片的一部分——转化为移动端页面爬取

  去哪儿爬取的数据(元素位移,ttf格式的字体替换)是第一层假数据

  结果解析为图片(图片识别)

  各大网站的防盗攻略也越来越高级了~

  完成的作业 - 抓取 知乎 用户的所有答案并创建一个新的站点地图。注意分页格式的new loadmore主要是用来加载和点击more,注意设置延迟为100创建新的Answers及以下具体内容,注意设置延迟为10000

  1

  {"startUrl":"https://www.zhihu.com/people/giscafer/answers?page=[1-5]","selectors":[{"parentSelectors":["_root"],"type":"SelectorElementClick","multiple":true,"id":"loadmore","selector":"div.List-item","clickElementSelector":"div.RichContent-inner button.Button","clickElementUniquenessType":"uniqueText","clickType":"clickMore","discardInitialElements":false,"delay":"100"},{"parentSelectors":["_root"],"type":"SelectorElement","multiple":true,"id":"Answers","selector":"div.List-item","delay":"10000"},{"parentSelectors":["Answers"],"type":"SelectorText","multiple":false,"id":"title","selector":"h2.ContentItem-title a","regex":"","delay":""},{"parentSelectors":["Answers"],"type":"SelectorText","multiple":false,"id":"Like","selector":"button.Button.VoteButton--up","regex":"","delay":""},{"parentSelectors":["Answers"],"type":"SelectorText","multiple":false,"id":"content","selector":"div.RichContent-inner","regex":"","delay":""},{"parentSelectors":["Answers"],"type":"SelectorLink","multiple":false,"id":"Link","selector":"h2.ContentItem-title a","delay":""},{"parentSelectors":["Link"],"type":"SelectorText","multiple":false,"id":"guanzhu","selector":"button.Button div.NumberBoard-value","regex":"","delay":""},{"parentSelectors":["Link"],"type":"SelectorText","multiple":false,"id":"liulan","selector":"div.NumberBoard-item div.NumberBoard-value","regex":"","delay":""}],"_id":"giscafe2"}

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线