selenium爬虫操作网页（实战篇）

优采云发布时间: 2022-05-08 00:23

　　selenium爬虫操作网页（实战篇）

　　前面我们遇到了一个爬虫难题：，选择了[在R里面配置selenium爬虫环境]()，仅仅是安装和配置好了在R里面使用selenium爬虫，打开一个JavaScript控制的动态网页仅仅是爬虫的开始，接下来需要跟这个网页进行各式各样的交互。首先放出一些学习链接：查看源代码如果你使用谷歌浏览器自行打开网页：会发现它的源代码里面根本就没有各式各样的下一页等按钮，全部被隐藏在了如下所示的JavaScript脚本里面：

document.write(''); document.write(''); document.write(''); document.write(''); document.write(''); document.write('');

　　但是这些JavaScript脚本是网页开发者并不会公开的，所以我们没办法去查看它们里面的函数。但是可以使用selenium爬虫打开的网页，代码如下：

library(RSelenium) library(rvest) library(stringr) ################调用R包######################################### library(rvest) # 为了read_html函数 library(RSelenium) # 为了使用JavaScript进行网页抓取 ###############连接Server并打开浏览器############################ remDr % html_nodes("a") %>% html_attr("href") links i=1 while(i% html_nodes("a") %>% html_attr("href") print(lks) links=c(links,lks) } links=unique(links) links save(links,file = 'plasmid_detail_links.Rdata')

　　当然了，拿到的links本身，还需进行二次访问，继续摘取里面的信息。第二次访问具体每个网页前面的links变量里面储存了全部的plasmid的介绍页面的url

　　接下来就循环访问每个plasmid网页获取信息，代码如下：

load(file = 'plasmid_detail_links.Rdata') links kp=grepl('plasmid_detail.html',links) links=links[kp] length(links) remDr % html_nodes('.panel-body') c1 % html_text() c2 % html_text() c3 % html_text() c1=gsub('\t','',c1);c1=gsub('\n','',c1); c2=gsub('\t','',c2);c2=gsub('\n','',c2); c3=gsub('\t','',c3);c3=gsub('\n','',c3); # id="plasmidName" plasmidName % read_html() %>% html_nodes('#plasmidName') %>% html_text() # id="plasmid_identification" plasmid_identification % read_html() %>% html_nodes('#plasmid_identification') %>% html_text() info=data.frame(plasmidName,plasmid_identification,c1,c2,c3) rm(htmls) write.table(info,file = 'info1.txt', col.names = F,row.names = F, append = T) }

　　保存为文件：

　　更复杂的使用RSelenium+rvest爬取动态或需登录页面教程参考：其它爬虫基础知识：文末友情宣传强烈建议你推荐我们生信技能树给身边的博士后以及年轻生物学PI，帮助他们多一点数据认知，让科研更上一个台阶：

0

2022-05-08

网页视频抓取脚本

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

selenium爬虫操作网页（实战篇）

0 个评论

发起人