抓取动态网页(代码也可以从我的开源项目HtmlExtractor中获取。。 )

优采云发布时间: 2021-09-28 07:27

　　抓取动态网页(代码也可以从我的开源项目HtmlExtractor中获取。。

)

　　代码也可以从我的开源项目htmlextractor获得

　　当我们捕获数据时，如果目标网站是以JS和逐页滚动的形式动态生成数据，我们应该如何捕获它

　　比如今天的头条新闻网站：

　　我们可以使用硒来做到这一点。尽管selenium是为web应用程序的自动测试而设计的，但它非常适合数据捕获。它可以轻松绕过反爬虫限制网站，因为selenium与真实用户一样直接在浏览器中运行

　　使用selenium，我们不仅可以使用JS动态生成的数据对网页进行抓取，还可以通过滚动页面对页面进行抓取

　　首先，我们使用Maven引入selenium依赖关系：

　　< dependency >

< groupId >org.seleniumhq.selenium

< artifactId >selenium-java

< version >2.47.1

　　接下来，您可以编写代码来抓取：

<p>import org.openqa.selenium.By;

import org.openqa.selenium.WebDriver;

import org.openqa.selenium.WebElement;

import org.openqa.selenium.firefox.FirefoxDriver;

import java.util.List;

import java.util.Random;

/**

* 如何抓取Js动态生成数据且以滚动页面方式分页的网页

* 以抓取今日头条为例说明：http://toutiao.com/

* Created by ysc on 10/13/15.

*/

public class Toutiao {

public static void main(String[] args) throws Exception{

//等待数据加载的时间

//为了防止服务器封锁，这里的时间要模拟人的行为，随机且不能太短

long waitLoadBaseTime = 3000 ;

int waitLoadRandomTime = 3000 ;

Random random = new Random(System.currentTimeMillis());

//火狐浏览器

WebDriver driver = new FirefoxDriver();

//要抓取的网页

driver.get( "http://toutiao.com/" );

//等待页面动态加载完毕

Thread.sleep(waitLoadBaseTime+random.nextInt(waitLoadRandomTime));

//要加载多少页数据

int pages= 5 ;

for ( int i= 0 ; i

0

2021-09-28

抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取动态网页(代码也可以从我的开源项目HtmlExtractor中获取。。 )

0 个评论

发起人

AI时代内容工厂

抓取动态网页(代码也可以从我的开源项目HtmlExtractor中获取。。 )

0 个评论

发起人

相关问题