文章采集接口 [开源

优采云发布时间: 2020-08-18 20:00

　　文章采集接口 [开源

　　[开源 .NET 跨平台 Crawler 数据采集爬虫框架: DotnetSpider] [四] JSON数据解析

　　[DotnetSpider 系列目录]场景模拟

　　接上一篇, 假设因为漏存JD SKU对应的店面信息。这时我们须要重新完全采集所有的SKU数据吗？补爬的话历史数据就用不了了。因此，去易迅页面上找看是否有提供相关的插口。

　　查找API恳求插口

　　安装 Fiddler, 并打开

　　在谷歌浏览器中访问：,1343,9719

　　在Fiddler查找一条条的访问记录，找到我们想要的插口

　　编写爬虫

　　分析返回的数据结果，我们可以先写出数据对象的定义（观察Expression的值早已是JsonPath查询表达式了，同时Type必须设置为Type = SelectorType.JsonPath）。另外须要注意的是，这次的爬虫是更新型爬虫，就是说采集到的数据补充回原表，那么就一定要设置字段是哪些，即在数据类上添加字段的定义

　　[Schema("jd", "sku_v2", TableSuffix.Monday)]

[EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)]

[Indexes(Primary = "sku")]

public class ProductUpdater : ISpiderEntity

{

[StoredAs("sku", DataType.String, 25)]

[PropertySelector(Expression = "$.pid", Type = SelectorType.JsonPath)]

public string Sku { get; set; }

[StoredAs("shopname", DataType.String, 100)]

[PropertySelector(Expression = "$.seller", Type = SelectorType.JsonPath)]

public string ShopName { get; set; }

[StoredAs("shopid", DataType.String, 25)]

[PropertySelector(Expression = "$.shopId", Type = SelectorType.JsonPath)]

public string ShopId { get; set; }

}

　　定义Pipeline的类型为Update

　　context.AddEntityPipeline(new MySqlEntityPipeline

{

ConnectString = "Database='taobao';Data Source= ;User ID=root;Password=1qazZAQ!;Port=4306",

Mode = PipelineMode.Update

});

　　由于返回的数据中还有一个json()这样的pagging，所以须要先做一个截取操作，框架提供了PageHandler插口，并且我们实现了大量常用的Handler，用于HTML的解析前的一些处理操作，因此完整的代码如下

　　 public class JdShopDetailSpider : EntitySpiderBuilder

{

protected override EntitySpider GetEntitySpider()

{

var context = new EntitySpider(new Site())

{

TaskGroup = "JD SKU Weekly",

Identity = "JD Shop details " + DateTimeUtils.MondayRunId,

CachedSize = 1,

ThreadNum = 8,

Downloader = new HttpClientDownloader

{

DownloadCompleteHandlers = new IDownloadCompleteHandler[]

{

new SubContentHandler

{

Start = "json(",

End = ");",

StartOffset = 5,

EndOffset = 0

}

},

PrepareStartUrls = new PrepareStartUrls[]

{

new BaseDbPrepareStartUrls()

{

Source = DataSource.MySql,

ConnectString = "Database='test';Data Source= localhost;User ID=root;Password=1qazZAQ!;Port=3306",

QueryString = $"SELECT * FROM jd.sku_v2_{DateTimeUtils.MondayRunId} WHERE shopname is null or shopid is null order by sku",

Columns = new [] {new DataColumn { Name = "sku"} },

FormateStrings = new List { "http://chat1.jd.com/api/checkChat?my=list&pidList={0}&callback=json" }

}

};

context.AddEntityPipeline(new MySqlEntityPipeline

{

ConnectString = "Database='taobao';Data Source=localhost ;User ID=root;Password=1qazZAQ!;Port=4306",

Mode = PipelineMode.Update

});

context.AddEntityType(typeof(ProductUpdater), new TargetUrlExtractor

{

Region = new Selector { Type = SelectorType.XPath, Expression = "//*[@id=\"J_bottomPage\"]" },

Patterns = new List { @"&page=[0-9]+&" }

});

return context;

}

[Schema("jd", "sku_v2", TableSuffix.Monday)]

[EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)]

[Indexes(Primary = "sku")]

public class ProductUpdater : ISpiderEntity

{

[StoredAs("sku", DataType.String, 25)]

[PropertySelector(Expression = "$.pid", Type = SelectorType.JsonPath)]

public string Sku { get; set; }

[StoredAs("shopname", DataType.String, 100)]

[PropertySelector(Expression = "$.seller", Type = SelectorType.JsonPath)]

public string ShopName { get; set; }

[StoredAs("shopid", DataType.String, 25)]

[PropertySelector(Expression = "$.shopId", Type = SelectorType.JsonPath)]

public string ShopId { get; set; }

}

　　posted @ 2017-04-14 10:26网路蚂蚁阅读(1417)评论(0)编辑

0

2020-08-18

文章采集接口

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

文章采集接口 [开源

0 个评论

发起人

AI时代内容工厂

文章采集接口 [开源

0 个评论

发起人

相关问题