Webmagic(爬虫)抓取新浪博客案例

优采云发布时间: 2020-05-19 08:00

　　Webmagic框架更侧重实际的内容抓取。今天为你们分享Webmagic 爬虫框架抓取新浪博客的案例。

　　我们以作者的新浪博客作为反例。在这个事例里，我们要从最终的博客文章页面，抓取博客的标题、内容、日期等信息，也要从列表页抓取博客的链接等信息，从而获取这个博客的所有文章。

　　列表页的格式是““，其中“0_1”中的“1”是可变的页数。

　　文章页的格式是“”，其中“95b4e3010102xsua”是可变的字符。

　　通过前面的剖析新浪博客超级爬虫(网站推广工具) v14绿色版，我先要找到文章的 url，再愈发 url 获取文章。所以怎么发觉这个博客中所有的文章地址，是爬虫的第一步。

　　我们可以使用正则表达式 +//.html 对 URL 进行一次简略过滤。这里比较复杂的是，这个 URL 过于空泛，可能会抓取到其他博客的信息，所以我们必须从列表页中指定的区域获取 URL。

　　在这里，我们使用 xpath//div[@class=//”articleList//”]选中所有区域，再使用 links()或者 xpath//a/@href 获取所有链接，最后再使用正则表达式 +//.html，对 URL 进行过滤，去掉一些“编辑”或者“更多”之类的链接。于是，我们可以这样写：

　　 page.addTargetRequests(

page.getHtml().xpath("//div[@class=/"articleList/"]"

).links().regex("http://blog//.sina//.com//.cn/s/blog_//w+//.html").all());

　　同时，我们须要把所有找到的列表页也加到待下载的 URL 中去：

　　 page.addTargetRequests(

page.getHtml().links().regex(

"http://blog//.sina//.com//.cn/s/articlelist_1487828712_0_//d+//.html").all());

　　文章页面信息的抽取是比较简单的，写好对应的 xpath 抽取表达式就可以了。

　　 page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2"));

page.putField("content", page.getHtml().xpath(

"//div[@id='articlebody']//div[@class='articalContent']"));

page.putField("date",page.getHtml().xpath(

"//div[@id='articlebody']//span[@class='time SG_txtc']").regex("//((.*)//)"));

　　现在，我们早已定义了对列表和目标页进行处理的方法，现在我们须要在处理时对她们进行分辨。在这个反例中，区分方法很简单，因为列表页和目标页在 URL 格式上是不同的，所以直接用 URL 区分就可以了！

　　这个反例完整的代码如下：

　　 package us.codecraft.webmagic.samples;

import us.codecraft.webmagic.Page;

import us.codecraft.webmagic.Site;

import us.codecraft.webmagic.Spider;

import us.codecraft.webmagic.processor.PageProcessor;

public class SinaBlogProcessor implements PageProcessor {

public static final String URL_LIST = "http://blog//.sina//.com//.cn/s/articlelist_1487828712_0_//d+//.html";

public static final String URL_POST = "http://blog//.sina//.com//.cn/s/blog_//w+//.html";

private Site site = Site.me().setDomain("blog.sina.com.cn").setSleepTime(3000).setUserAgent(

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");

@Override public void process(Page page) {

//列表页

if (page.getUrl().regex(URL_LIST).match()) {

page.addTargetRequests(page.getHtml().xpath("//div[@class=/"articleList/"]").links().regex(URL_POST).all());

page.addTargetRequests(page.getHtml().links().regex(URL_LIST).all());

//文章页} else {

page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2"));

page.putField("content", page.getHtml().xpath("//div[@id='articlebody']//div[@class='articalContent']"));

page.putField("date",page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("//((.*)//)"));}}

@Override public Site getSite() {

return site;}

public static void main(String[] args) {

Spider.create(new SinaBlogProcessor()).addUrl("http://blog.sina.com.cn/s/articlelist_1487828712_0_1.html").run();

} }

　　通过这个反例我们可以发觉主要使用几个方式：

　　如果你认为用 if-else 来分辨不同处理有些不方便新浪博客超级爬虫(网站推广工具) v14绿色版，那么可以使用 SubPageProcessor 来解决这个问题。

0

2020-05-19

webmagic url

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

Webmagic(爬虫)抓取新浪博客案例

0 个评论

发起人