高性能异步并发爬虫!- colly 自动抓取资讯

优采云 发布时间: 2022-06-12 08:31

  高性能异步并发爬虫!- colly 自动抓取资讯

  colly 在 golang 中的地位,比之 scrapy 在 python 的作用,都是爬虫界的大佬。本文用其抓取博文资讯,从收集器实例配置,goQuery 进行 dom 节点数据抓取,自动分页访问,到 csv 数据持久化,json 控制台输出,全程简单直观。

  

  Code

  抓取数据入口为社区某用户博客列表页,比如

  package main<br /><br />import (<br /> "encoding/csv"<br /> "encoding/json"<br /> "log"<br /> "os"<br /> "regexp"<br /> "strconv"<br /> "strings"<br /><br /> "github.com/gocolly/colly"<br />)<br /><br />// Article 抓取blog数据<br />type Article struct {<br /> ID int `json:"id,omitempty"`<br /> Title string `json:"title,omitempty"`<br /> URL string `json:"url,omitempty"`<br /> Created string `json:"created,omitempty"`<br /> Reads string `json:"reads,omitempty"`<br /> Comments string `json:"comments,omitempty"`<br /> Feeds string `json:"feeds,omitempty"`<br />}<br /><br />// 数据持久化<br />func csvSave(fName string, data []Article) error {<br /> file, err := os.Create(fName)<br /> if err != nil {<br /> log.Fatalf("Cannot create file %q: %s\n", fName, err)<br /> }<br /> defer file.Close()<br /> writer := csv.NewWriter(file)<br /> defer writer.Flush()<br /><br /> writer.Write([]string{"ID", "Title", "URL", "Created", "Reads", "Comments", "Feeds"})<br /> for _, v := range data {<br /> writer.Write([]string{strconv.Itoa(v.ID), v.Title, v.URL, v.Created, v.Reads, v.Comments, v.Feeds})<br /> }<br /> return nil<br />}<br /><br />func main() {<br /> articles := make([]Article, 0, 200)<br /> // 1.准备收集器实例<br /> c := colly.NewCollector(<br /> // 开启本机debug<br /> // colly.Debugger(&debug.LogDebugger{}),<br /> colly.AllowedDomains("learnku.com"),<br /> // 防止页面重复下载<br /> // colly.CacheDir("./learnku_cache"),<br /> )<br /><br /> // 2.分析页面数据<br /> c.OnHTML("div.blog-article-list > .event", func(e *colly.HTMLElement) {<br /> article := Article{<br /> Title: e.ChildText("div.content > div.summary"),<br /> URL: e.ChildAttr("div.content a.title", "href"),<br /> Feeds: e.ChildText("div.item-meta > a:first-child"),<br /> }<br /> // 查找同一集合不同子项<br /> e.ForEach("div.content > div.meta > div.date>a", func(i int, el *colly.HTMLElement) {<br /> switch i {<br /> case 1:<br /> article.Created = el.Attr("data-tooltip")<br /> case 2:<br /> // 用空白切割字符串<br /> article.Reads = strings.Fields(el.Text)[1]<br /> case 3:<br /> article.Comments = strings.Fields(el.Text)[1]<br /> }<br /> })<br /> // 正则匹配替换,字符串转整型<br /> article.ID, _ = strconv.Atoi(regexp.MustCompile(`\d+`).FindAllString(article.URL, -1)[0])<br /> articles = append(articles, article)<br /> })<br /><br /> // 下一页<br /> c.OnHTML("a[href].page-link", func(e *colly.HTMLElement) {<br /> e.Request.Visit(e.Attr("href"))<br /> })<br /><br /> // 启动<br /> c.Visit("https://learnku.com/blog/pardon")<br /><br /> // 输出<br /> csvSave("pardon.csv", articles)<br /> enc := json.NewEncoder(os.Stdout)<br /> enc.SetIndent("", " ")<br /> enc.Encode(articles)<br /><br /> // 显示收集器的打印信息<br /> log.Println(c)<br />}<br />

  Output

  控制台输出

  ....<br /> "id": 30604,<br /> "title": "教程: TodoMVC 与 director 路由",<br /> "url": "https://learnku.com/articles/30604",<br /> "created": "2019-07-01 12:42:01",<br /> "reads": "650",<br /> "comments": "0",<br /> "feeds": "0"<br /> },<br /> {<br /> "id": 30579,<br /> "title": "flaskr 进阶笔记",<br /> "url": "https://learnku.com/articles/30579",<br /> "created": "2019-06-30 19:01:04",<br /> "reads": "895",<br /> "comments": "0",<br /> "feeds": "0"<br /> },<br /> {<br /> "id": 30542,<br /> "title": "教程 Redis+ flask+vue 在线聊天",<br /> "url": "https://learnku.com/articles/30542",<br /> "created": "2019-06-29 12:19:45",<br /> "reads": "2760",<br /> "comments": "1",<br /> "feeds": "2"<br /> }<br />]<br />2019/12/20 15:50:14 Requests made: 5 (5 responses) | Callbacks: OnRequest: 0, OnHTML: 2, OnResponse: 0, OnError: 0

  csv 文本输出

  ID,Title,URL,Created,Reads,Comments,Feeds<br />37991,ferret 爬取动态网页,https://learnku.com/articles/37991,2019-12-15 10:43:03,219,0,3<br />37803,匿名类 与 索引重建,https://learnku.com/articles/37803,2019-12-09 19:35:09,323,1,0<br />37476,大话并发,https://learnku.com/articles/37476,2019-12-08 21:17:55,612,0,4<br />37738,三元运算符,https://learnku.com/articles/37738,2019-12-08 09:44:36,606,0,0<br />37719,笔试之 模板变量替换,https://learnku.com/articles/37719,2019-12-07 18:30:42,843,0,0<br />37707,笔试之 连续数增维,https://learnku.com/articles/37707,2019-12-07 13:50:17,872,0,0<br />37616,笔试之 一行代码求重,https://learnku.com/articles/37616,2019-12-05 12:10:24,792,0,0<br />....

  Colly

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线