java爬虫抓取动态网页(如何解析并抓取网页上的信息呢？-八维教育)

优采云发布时间: 2022-02-27 06:21

　　对于静态网页的解析，我们一般使用Jsoup。

　　但是对于动态加载的网页，Jsoup 就不行了！

　　那么我们如何解析和抓取网络上的信息呢？

　　看了网上朋友的讨论，打算模拟一个浏览器，然后通过操作浏览器获取新的网页信息。

　　最后我选择了 Selenium 来模拟浏览器。

　　事实上，Selenium 是一个测试浏览器性能的工具，这对于爬虫来说有点矫枉过正！

　　Selenium官网地址：

　　产科selenium的安装和使用可以去官网

　　我们一般使用 Selenium RC 工具包来操作浏览器。

　　安装完包后，我们举个小例子：

　　package com.example.tests;

// We specify the package of our tests

import com.thoughtworks.selenium.*;

// This is the driver's import. You'll use this for instantiating a

// browser and making it do what you need.

import java.util.regex.Pattern;

// Selenium-IDE add the Pattern module because it's sometimes used for

// regex validations. You can remove the module if it's not used in your

// script.

public class NewTest extends SeleneseTestCase {

// We create our Selenium test case

public void setUp() throws Exception {

setUp("http://www.google.com/", "*firefox");

// We instantiate and start the browser

}

public void testNew() throws Exception {

selenium.open("/");

selenium.type("q", "selenium rc");

selenium.click("btnG");

selenium.waitForPageToLoad("30000");

assertTrue(selenium.isTextPresent("Results * for selenium rc"));

// These are the real test steps

}

　　这是使用的类。

　　我们可以编写一个主程序如下：

　　package Test1;

import java.net.UnknownHostException;

import com.mongodb.BasicDBObject;

import com.thoughtworks.selenium.*;

//This is the driver's import. You'll use this for instantiating a

//browser and making it do what you need.

import org.jsoup.Jsoup;

import org.jsoup.helper.Validate;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import java.util.LinkedList;

import java.util.Queue;

import java.util.regex.Pattern;

//Selenium-IDE add the Pattern module because it's sometimes used for

//regex validations. You can remove the module if it's not used in your

//script.

@SuppressWarnings("deprecation")

public class NewTest extends SeleneseTestCase {

//We create our Selenium test case

public String url;

public void setUp() throws Exception {

setUp("https://foursquare.com/v/singapore-zoo/4b05880ef964a520b8ae22e3", "*chrome");

//selenium.waitForPageToLoad("30000");

// We instantiate and start the browser

}

public void testNew() throws Exception {

selenium.open("https://foursquare.com/v/singapore-zoo/4b05880ef964a520b8ae22e3");

selenium.windowMaximize();

public static void print(String msg, Object... args) {

System.out.println(String.format(msg, args));

}

public static void gettips(Document doc){

Elements tips = doc.select(".tipText");

int count = 0;

//BasicDBObject document4 = new BasicDBObject();

for (Element link : tips){

String str2 = new String(link.text());

count++;

String tempint = String.valueOf(count);

//document4.put(tempint, str);

print("%s \r\n", str2);

}

　　运行后，程序会打开firefox浏览器，然后它会自动为他运行你的设计。

0

2022-02-27

java爬虫抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java爬虫抓取动态网页(如何解析并抓取网页上的信息呢？-八维教育)

0 个评论

发起人