java爬虫抓取动态网页(什么是Webmagic.要使用Webmagic？期刊版面内容表表 )

优采云发布时间: 2022-02-01 06:15

　　java爬虫抓取动态网页(什么是Webmagic.要使用Webmagic？期刊版面内容表表

)

　　一、什么是 Webmagic。

　　要使用 Webmagic，首先需要了解 Webmagic 是什么。

　　webmagic 是一个开源的 Java 垂直爬虫框架。目标是简化爬虫的开发过程，让开发者专注于逻辑功能的开发。webmagic主要由Downloader（下载器）、PageProcesser（解析器）、Schedule（调度器）和Pipeline（管道）组成。

　　webmagic采用完全模块化设计，功能覆盖爬虫全生命周期（链接提取、页面下载、内容提取、持久化），支持多线程爬取、分布式爬取，并支持自动重试、自定义UA/等功能饼干。

　　webmagic 收录页面提取功能，开发者可以使用 css 选择器、xpath 和正则表达式提取链接和内容，并支持多个选择器链调用。

　　二、示例代码

　　(1）pom.xml 文件增加了新的依赖

us.codecraft`这里写代码片` webmagic-core

0.5.3

us.codecraft

webmagic-extension

0.5.3

　　（2）Scheduling定时任务设定，也可称为调度器。

　　import javax.annotation.Resource;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.springframework.scheduling.annotation.EnableScheduling;import org.springframework.scheduling.annotation.Scheduled;import org.springframework.stereotype.Component;import org.springframework.transaction.annotation.Transactional;import com.zhibo.xmt.common.webmagic.xpager.popeline.XpaperZgtcbPopeline;import com.zhibo.xmt.common.webmagic.xpager.processor.XpaperZgtcbProcessor;import us.codecraft.webmagic.Spider;/**

* 爬取 xpaper http://i.xpaper.net/cnsports 版面信息数据

* 每周二、四、日发布新期刊

* @author Bruce

*

*/@Component@EnableSchedulingpublic class XpaperWebmagicSchedulingConfig {

private final Logger logger = LoggerFactory.getLogger(XpaperWebmagicSchedulingConfig.class); public static final String BASE_URL = "http://i.xpaper.net/cnsports"; @Resource

private XpaperZgtcbPopeline xpaperZgtcbPopeline; /**

* 中国体彩报 xpaper全媒体数字报版面内容抓取

*/

/**

* "0 0/1 18 * * ?" 每天18：00到18:59 没分钟执行一次

*

* "0 10 4 ? * *" 每天上午4:10触发

*/

@Transactional

@Scheduled(cron = "0 10 4 ? * *") public void createLotteryInfo(){

System.out.println("中国体彩报 xpaper全媒体数字报版面内容抓取"); long startTime, endTime;

System.out.println("【爬虫开始】");

startTime = System.currentTimeMillis();

logger.info("爬取地址：" + BASE_URL); try {

Spider spider = Spider.create(new XpaperZgtcbProcessor());

spider.addUrl(BASE_URL);

spider.addPipeline(xpaperZgtcbPopeline);

spider.thread(5);

spider.setExitWhenComplete(true);

spider.start();

spider.stop();

} catch (Exception e) {

logger.error(e.getMessage(),e);

}

endTime = System.currentTimeMillis();

System.out.println("【爬虫结束】");

System.out.println("中国体彩报 xpaper全媒体数字报版面内容抓取耗时约" + ((endTime - startTime) / 1000) + "秒，已保存到数据库.");

}

　　(3）XpaperZgtcbProcessor解析器，解析要爬取的页面

<p>import java.util.ArrayList;import java.util.Date;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;import org.apache.commons.lang3.StringUtils;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.springframework.stereotype.Component;import com.zhibo.xmt.common.enums.common.EnumCommonStatus;import com.zhibo.xmt.common.util.DateUtil;import com.zhibo.xmt.common.vo.pagesub.Journal;import com.zhibo.xmt.common.vo.pagesub.JournalPage;import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.processor.PageProcessor;import us.codecraft.webmagic.selector.Selectable;/**

* 中国体彩报 xpaper全媒体数字报版面内容抓取

* http://i.xpaper.net/cnsports

* @author Bruce

*

*/@Component

public class XpaperZgtcbProcessor implements PageProcessor{

private static Logger logger = LoggerFactory.getLogger(XpaperZgtcbProcessor.class);

// 正则表达式\. \转义java中的\ \.转义正则中的.

// 主域名

public static final String BASE_URL = "http://i.xpaper.net/cnsports";

private Site site = Site.me() .setDomain(BASE_URL) .setSleepTime(1000) .setRetryTimes(30) .setCharset("utf-8") .setTimeOut(30000) .setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");

@Override

public Site getSite() {

return site;

}

@Override

public void process(Page page) {

if (page.getUrl().regex(BASE_URL).match()) {

String contentTitle = page.getHtml().xpath("//title/text()").toString();

/**

* System.out.println("issue:" + issue);

System.out.println("issueDesc:" + issueDesc);

System.out.println("contentTitle:" + contentTitle);

* contentTitle:中国体彩报 - 第1151期 - 第01版 - A1

issue: 1151

issueDesc:中国体彩报 - 第1151期

*/

String[] contentTitles = contentTitle.trim().split("-");

String issueStr = contentTitles[1].replaceAll("第", "").replaceAll("期", "").replaceAll(" ", "").trim().replaceAll("\s*", "");

String issue = new String(issueStr);

//由于里面有空格，因此使用了多种方式去空格。

Pattern p = Pattern.compile("\s*|\t|\r|\n");

Matcher m = p.matcher(issue);

issue = m.replaceAll("");

issue = issue.replaceAll("\u00A0","");

String issueDesc = contentTitles[0] + "-" + contentTitles[1];

Journal journal = new Journal();

journal.setTitle(issueDesc);

journal.setTitleDesc(contentTitle);

journal.setIssue(issue);

journal.setDate(DateUtil.getDateByFormat(DateUtil.getDateByFormat(new Date(), "yyyy-MM-dd"), "yyyy-MM-dd"));

journal.setDateStr(DateUtil.getDateByFormat(new Date(), "yyyy-MM-dd"));

journal.setType(1);

journal.setStatus(EnumCommonStatus.NORMAL.getValue());

journal.setGrabDate(new Date());

journal.setCreatedAt(new Date());

journal.setUpdatedAt(new Date());

logger.info("期刊数据:" + journal.toString());

List list = page.getHtml().xpath("//div[@id='m1']/a").nodes();

if(list != null && list.size() > 0){

List journalPages = new ArrayList();

for(int i = 0; i < list.size(); i++){

Selectable s = list.get(i);

String link = s.links().toString();

String titleStr = s.xpath("//b/text()").toString();

if(StringUtils.isBlank(titleStr)){

titleStr = s.toString().split(">")[1].replaceAll("

0

2022-02-01

java爬虫抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java爬虫抓取动态网页(什么是Webmagic.要使用Webmagic？期刊版面内容表表 )

0 个评论

发起人

AI时代内容工厂

java爬虫抓取动态网页(什么是Webmagic.要使用Webmagic？期刊版面内容表表 )

0 个评论

发起人

相关问题