实时抓取网页数据(【】续前一篇RSelenium包抓取包)

优采云发布时间: 2022-02-14 21:16

　　继续

　　上一篇RSelenium包抓取了链家网（上图：模拟点击和页面抓取），重点关注网页自动点击的问题。虽然代码也可以抓取完整的数据，但如果没有错误或警告中断抓取，它就会这样做。由于LinkinfoFunc和HouseinfoFunc都是封装好的函数，一旦中断，中断前捕获的数据无法写入数据帧或列表。

　　当要抓取的数据量大、耗时长时，难免会出现网络中断等各种问题。因此，本文在上一篇文章的基础上，增加了一个for循环，并引入了tryCatch函数来进行简单的错误处理。两者对比如下：

　　页面准备，Step1的代码没有变化 Step2 删除数据框结果，将数据存储的任务交给Step3 Step3 添加for循环，并引入tryCatch函数

　　另外，文中重复的代码不再标注。

<p>1library(rvest)

2library(stringr)

3library(RSelenium)

4remDr %

13 unlist() %>% gsub(":", "", .)

14 totalpage %

6 html_text() %>% paste(., unit, sep = "")

7 downpayment % html_nodes(".taxtext span") %>% html_text() %>% .[1]

8 persquare % html_nodes("span.unitPriceValue") %>% html_text()

9 area % html_nodes(".area .mainInfo") %>% html_text()

10 title % html_nodes(".title h1") %>% html_text()

11 subtitle % html_nodes(".title div.sub") %>% html_text()

12 room % html_nodes(".room .mainInfo") %>% html_text()

13 floor % html_nodes(".room .subInfo") %>% html_text()

14 data

0

2022-02-14

实时抓取网页数据

0 个评论

要回复文章请先登录或注册