htmlunit 抓取网页(htmlunit网络工具一个没有没有 )

优采云发布时间: 2021-12-06 11:02

　　htmlunit 抓取网页(htmlunit网络工具一个没有没有

)

　　1：背景

　　我想用jsoup抓取一个页面，但是我抓取的数据总是不完整。然后我发现有些数据是在页面用js执行后渲染在页面上的，也就是说只有在执行js后才在数据页面上。会显示数据，但是jsoup无法实现执行页面的js。

　　2：解决

　　搜索后发现htmlunit网络工具可以执行js，相当于没有页面的浏览器。解决办法是先用htmlUnit发送网络请求，执行js获取页面，然后用jsoup转换成Document页面对象。然后用jsoup分析页面读取数据。

　　3: htmlUnit 发送请求

　　 1 public static Document getDocument() throws IOException, InterruptedException{

2 /*String url="https://www.marklines.com/cn/vehicle_sales/search_country/search/?searchID=587200";

3 Connection connect = Jsoup.connect(url).userAgent("")

4 .header("Cookie", "PLAY_LANG=cn; _plh=b9289d0a863a8fc9c79fb938f15372f7731d13fb; PLATFORM_SESSION=39034d07000717c664134556ad39869771aabc04-_ldi=520275&_lsh=8cf91cdbcbbb255adff5cba6061f561b642f5157&csrfToken=209f20c8473bc0518413c226f898ff79cd69c3ff-1539926671235-b853a6a63c77dd8fcc364a58&_lpt=%2Fcn%2Fvehicle_sales%2Fsearch&_lsi=1646321; _ga=GA1.2.2146952143.1539926675; _gid=GA1.2.1032787565.1539926675; _plh_notime=8cf91cdbcbbb255adff5cba6061f561b642f5157")

5 .timeout(360000000);

6 Document document = connect.get();*/

7 WebClient wc = new WebClient(BrowserVersion.CHROME);

8 //是否使用不安全的SSL

9 wc.getOptions().setUseInsecureSSL(true);

10 //启用JS解释器，默认为true

11 wc.getOptions().setJavaScriptEnabled(true);

12 //禁用CSS

13 wc.getOptions().setCssEnabled(false);

14 //js运行错误时，是否抛出异常

15 wc.getOptions().setThrowExceptionOnScriptError(false);

16 //状态码错误时，是否抛出异常

17 wc.getOptions().setThrowExceptionOnFailingStatusCode(false);

18 //是否允许使用ActiveX

19 wc.getOptions().setActiveXNative(false);

20 //等待js时间

21 wc.waitForBackgroundJavaScript(600*1000);

22 //设置Ajax异步处理控制器即启用Ajax支持

23 wc.setAjaxController(new NicelyResynchronizingAjaxController());

24 //设置超时时间

25 wc.getOptions().setTimeout(1000000);

26 //不跟踪抓取

27 wc.getOptions().setDoNotTrackEnabled(false);

28 WebRequest request=new WebRequest(new URL("https://www.marklines.com/cn/vehicle_sales/search_country/search/?searchID=587200"));

29 request.setAdditionalHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");

30 request.setAdditionalHeader("Cookie","PLAY_LANG=cn; _plh=b9289d0a863a8fc9c79fb938f15372f7731d13fb; PLATFORM_SESSION=39034d07000717c664134556ad39869771aabc04-_ldi=520275&_lsh=8cf91cdbcbbb255adff5cba6061f561b642f5157&csrfToken=209f20c8473bc0518413c226f898ff79cd69c3ff-1539926671235-b853a6a63c77dd8fcc364a58&_lpt=%2Fcn%2Fvehicle_sales%2Fsearch&_lsi=1646321; _ga=GA1.2.2146952143.1539926675; _gid=GA1.2.1032787565.1539926675; _plh_notime=8cf91cdbcbbb255adff5cba6061f561b642f5157");

31 try {

32 //模拟浏览器打开一个目标网址

33 HtmlPage htmlPage = wc.getPage(request);

34 //为了获取js执行的数据线程开始沉睡等待

35 Thread.sleep(1000);//这个线程的等待因为js加载需要时间的

36 //以xml形式获取响应文本

37 String xml = htmlPage.asXml();

38 //并转为Document对象return

39 return Jsoup.parse(xml);

40 //System.out.println(xml.contains("结果.xls"));//false

41 } catch (FailingHttpStatusCodeException e) {

42 e.printStackTrace();

43 } catch (MalformedURLException e) {

44 e.printStackTrace();

45 } catch (IOException e) {

46 e.printStackTrace();

47 }

48 return null;

49 }

　　4：返回的Document对象交给jsoup处理

　　我这里只做了一个简单的输出，我检查了数据是否全部渲染完毕。

　　 1 Document doc=getDocument();

2 Element table=doc.select("table.table.table-bordered.aggregate_table").get(0);//获取到表格

3 Element tableContext=table.getElementsByTag("tbody").get(0);

4 Elements contextTrs=tableContext.getElementsByTag("tr");

5 System.out.println(contextTrs.size());

6

7

8 String context=doc.toString();

9 OutputStreamWriter pw = null;

10 pw = new OutputStreamWriter(new FileOutputStream("D:/test.txt"),"GBK");

11 pw.write(context);

12 pw.close();

0

2021-12-06

htmlunit 抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

htmlunit 抓取网页(htmlunit网络工具一个没有没有 )

0 个评论

发起人

AI时代内容工厂

htmlunit 抓取网页(htmlunit网络工具一个没有没有 )

0 个评论

发起人

相关问题