htmlunit 抓取网页(一下Gecco中的htmlunit中中 )

优采云发布时间: 2021-12-13 18:04

　　htmlunit 抓取网页(一下Gecco中的htmlunit中中

)

　　2021SC@SDUSC

　　简要解释GECCO中的htmlunit

　　Htmlunit是一个开源Java页面分析工具。阅读页面后，您可以有效地使用htmlunit分析页面上的内容。该项目可以模拟浏览器的操作，称为Java浏览器的开源实现。这个没有界面的浏览器运行得很快。HTML单元使用rhino作为JavaScript的解析引擎

　　下载

com.geccocrawler

gecco-htmlunit

x.x.x

　　京东产品详细信息中的价格信息是通过Ajax异步请求获取的，该请求以前是通过@Ajax annotation实现的。这里，htmlunit用于自动化Ajax请求

　　@Gecco(matchUrl="http://item.jd.com/{code}.html", pipelines="consolePipeline", downloader="htmlUnitDownloder")

public class JDDetail implements HtmlBean {

private static final long serialVersionUID = -377053120283382723L;

@RequestParameter

private String code;

@Text

@HtmlField(cssPath=".p-price")

private String price;

@Text

@HtmlField(cssPath="#name > h1")

private String title;

@Text

@HtmlField(cssPath="#p-ad")

private String jdAd;

@HtmlField(cssPath="#product-detail-2")

private String detail;

public String getPrice() {

return price;

}

public void setPrice(String price) {

this.price = price;

}

public String getJdAd() {

return jdAd;

}

public void setJdAd(String jdAd) {

this.jdAd = jdAd;

}

public String getTitle() {

return title;

}

public void setTitle(String title) {

this.title = title;

}

public String getDetail() {

return detail;

}

public void setDetail(String detail) {

this.detail = detail;

}

public String getCode() {

return code;

}

public void setCode(String code) {

this.code = code;

}

public static void main(String[] args) throws Exception {

HttpRequest request = new HttpGetRequest("http://item.jd.com/1455427.html");

request.setCharset("GBK");

GeccoEngine.create()

.classpath("com.geccocrawler.gecco.htmlunit")

//开始抓取的页面地址

.start(request)

//开启几个爬虫线程

.thread(1)

.timeout(1000)

.run();

}

　　使用htmlunit的优点和缺点确实可以节省大量工作，但htmlunit也有许多缺点：

　　1、效率低下。在使用htmlunit之后，下载程序应该一起下载所有JS并执行所有JS代码。有时下载一个页面需要5~10秒

　　2、rhino引擎与JS的兼容性。Rhino兼容性仍然存在许多问题。如果爬网时不想看到这些错误日志输出，可以配置log4j：

　　log4j.logger.com.gargoylesoftware.htmlunit=OFF

0

2021-12-13

htmlunit 抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

htmlunit 抓取网页(一下Gecco中的htmlunit中中 )

0 个评论

发起人

AI时代内容工厂

htmlunit 抓取网页(一下Gecco中的htmlunit中中 )

0 个评论

发起人

相关问题