网络数据采集(AngleSharp)-使用AngleSharp做html解析
优采云 发布时间: 2022-06-21 23:15网络数据采集(AngleSharp)-使用AngleSharp做html解析
public static async Task GetHtmlSourceCodeAsync(string uri)<br /> {<br /> var httpClient = new HttpClient();<br /> try<br /> {<br /> var htmlSource = await httpClient.GetStringAsync(uri);<br /> return htmlSource;<br /> }<br /> catch (HttpRequestException e)<br /> {<br /> Console.ForegroundColor = ConsoleColor.Red;<br /> Console.WriteLine($"{nameof(HttpRequestException)}: {e.Message}");<br /> return null;<br /> }<br /> }
CSS是网络爬虫的福音, 下面这两个元素在页面中可能会出现很多次:
我们可以使用AngleSharp里面的QuerySelectorAll()方法把所有符合条件的元素都找出来, 返回到一个结果集合里.
public static async Task FindGreenClassAsync()<br /> {<br /> const string url = "http://www.pythonscraping.com/pages/warandpeace.html";<br /> var html = await GetHtmlSourceCodeAsync(url);<br /> if (!string.IsNullOrWhiteSpace(html))<br /> {<br /> var parser = new HtmlParser();<br /> var document = await parser.ParseAsync(html);<br /> var nameList = document.QuerySelectorAll("span > .green");<br /><br /> Console.WriteLine("Green names are:");<br /> Console.ForegroundColor = ConsoleColor.Green;<br /> foreach (var item in nameList)<br /> {<br /> Console.WriteLine(item.TextContent);<br /> }<br /> }<br /> else<br /> {<br /> Console.WriteLine("No html source code returned.");<br /> }<br /> }
非常简单, 和DOM的标准操作是一样的.
如果只需要元素的文字部分, 那么就是用其TextContent属性即可.
再看个例子
1. 找出页面中所有的h1, h2, h3, h4, h5, h6元素
2. 找出class为green或red的span元素.
public static async Task FindByAttributeAsync()<br /> {<br /> const string url = "http://www.pythonscraping.com/pages/warandpeace.html";<br /> var html = await GetHtmlSourceCodeAsync(url);<br /> if (!string.IsNullOrWhiteSpace(html))<br /> {<br /> var parser = new HtmlParser();<br /> var document = await parser.ParseAsync(html);<br /><br /> var headers = document.QuerySelectorAll("*")<br /> .Where(x => new[] { "h1", "h2", "h3", "h4", "h5", "h6" }.Contains(x.TagName.ToLower()));<br /> Console.WriteLine("Headers are:");<br /> PrintItemsText(headers);<br /><br /> var greenAndRed = document.All<br /> .Where(x => x.TagName == "span" && (x.ClassList.Contains("green") || x.ClassList.Contains("red")));<br /> Console.WriteLine("Green and Red spans are:");<br /> PrintItemsText(greenAndRed);<br /><br /> var thePrinces = document.QuerySelectorAll("*").Where(x => x.TextContent == "the prince");<br /> Console.WriteLine(thePrinces.Count());<br /> }<br /> else<br /> {<br /> Console.WriteLine("No html source code returned.");<br /> }<br /><br /> void PrintItemsText(IEnumerable elements)<br /> {<br /> foreach (var item in elements)<br /> {<br /> Console.WriteLine(item.TextContent);<br /> }<br /> }<br /> }
这里我们可以看到QuerySelectorAll()的返回结果可以使用Linq的Where方法进行过滤, 这样就很强大了.
TagName属性就是元素的标签名.
此外, 还有一个document.All,All属性是该Document所有元素的集合, 它同样也支持Linq.
(该方法中使用了一个本地方法).
由于同时支持CSS选择器和Linq, 所以抽取元素的工作简单多了.
导航树
一个页面, 它的结构可以是这样的:
这里面有几个概念:
子标签和后代标签.
子标签是父标签的下一级, 而后代标签则是指父标签下面所有级别的标签.
tr是table的子标签, tr, th, td, img都是table的后代标签.
使用AngleSharp, 找出子标签可以使用.Children属性.而找出后代标签, 可以使用CSS选择器.
兄弟标签
找到前一个兄弟标签使用.PreviousElementSibling属性, 后一个兄弟标签是.NextElementSibling属性.
父标签
.ParentElement属性就是父标签.
public static async Task FindDescendantAsync()<br /> {<br /> const string url = "http://www.pythonscraping.com/pages/page3.html";<br /> var html = await GetHtmlSourceCodeAsync(url);<br /> if (!string.IsNullOrWhiteSpace(html))<br /> {<br /> var parser = new HtmlParser();<br /> var document = await parser.ParseAsync(html);<br /><br /> var tableChildren = document.QuerySelector("table#giftList > tbody").Children;<br /> Console.WriteLine("Table's children are:");<br /> foreach (var child in tableChildren)<br /> {<br /> System.Console.WriteLine(child.LocalName);<br /> }<br /><br /> var descendants = document.QuerySelectorAll("table#giftList > tbody *");<br /> Console.WriteLine("Table's descendants are:");<br /> foreach (var item in descendants)<br /> {<br /> Console.WriteLine(item.LocalName);<br /> }<br /><br /> var siblings = document.QuerySelectorAll("table#giftList > tbody > tr").Select(x => x.NextElementSibling);<br /> Console.WriteLine("Table's descendants are:");<br /> foreach (var item in siblings)<br /> {<br /> Console.WriteLine(item?.LocalName);<br /> }<br /><br /> var parentSibling = document.All.SingleOrDefault(x => x.HasAttribute("src") && x.GetAttribute("src") == "../img/gifts/img1.jpg")<br /> ?.ParentElement.PreviousElementSibling;<br /> if (parentSibling != null)<br /> {<br /> Console.WriteLine($"Parent's previous sibling is: {parentSibling.TextContent}");<br /> }<br /> }<br /> else<br /> {<br /> Console.WriteLine("No html source code returned.");<br /> }<br /> }
结果:
使用正则表达式
"如果你有一个问题打算使用正则表达式来解决, 那么现在你有两个问题了".
这里有一个测试正则表达式的网站:
目前, AngleSharp支持通过CSS选择器来查找元素, 也可以使用Linq来过滤元素, 当然也可以通过多种方式使用正则表达式进行更复杂的查找动作.
关于正则表达式我就不介绍了. 直接看例子.
我想找到页面中所有的满足下列要求的图片, 其src的值以../img/gifts/img开头并且随后跟着数字, 然后格式为.jpg的图标.
public static async Task FindByRegexAsync()<br /> {<br /> const string url = "http://www.pythonscraping.com/pages/page3.html";<br /> var html = await GetHtmlSourceCodeAsync(url);<br /> if (!string.IsNullOrWhiteSpace(html))<br /> {<br /> var parser = new HtmlParser();<br /> var document = await parser.ParseAsync(html);<br /><br /> var images = document.QuerySelectorAll("img")<br /> .Where(x => x.HasAttribute("src") && Regex.Match(x.Attributes["src"].Value, @"\.\.\/img\/gifts/img.*\.jpg").Success);<br /> foreach (var item in images)<br /> {<br /> Console.WriteLine(item.Attributes["src"].Value);<br /> }<br /><br /> var elementsWith2Attributes = document.All.Where(x => x.Attributes.Length == 2);<br /> foreach (var item in elementsWith2Attributes)<br /> {<br /> Console.WriteLine(item.LocalName);<br /> foreach (var attr in item.Attributes)<br /> {<br /> Console.WriteLine($"\t{attr.Name} - {attr.Value}");<br /> }<br /> }<br /> }<br /> else<br /> {<br /> Console.WriteLine("No html source code returned.");<br /> }<br /> }
这个其实没有任何难度.
但从本例可以看到, 判断元素有没有一个属性可以使用HasAttribute("xxx")方法, 可以通过.Attributes索引来获取属性, 其属性值就是.Attributes["xxx"].Value.
如果不会正则表达式, 我相信多写的Linq的过滤代码也差不多能达到要求.
遍历单个域名
就是几个应用的例子, 直接贴代码吧.
打印出一个页面内所有的超链接地址:
public static async Task TraversingASingleDomainAsync()<br /> {<br /> var httpClient = new HttpClient();<br /> var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon");<br /><br /> var parser = new HtmlParser();<br /> var document = await parser.ParseAsync(htmlSource);<br /> var links = document.QuerySelectorAll("a");<br /> foreach (var link in links)<br /> {<br /> Console.WriteLine(link.Attributes["href"]?.Value);<br /> }<br /> }
找出满足下列条件的超链接:
public static async Task FindSpecificLinksAsync()<br /> {<br /> var httpClient = new HttpClient();<br /> var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon");<br /><br /> var parser = new HtmlParser();<br /> var document = await parser.ParseAsync(htmlSource);<br /> var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a")<br /> .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success);<br /> foreach (var link in links)<br /> {<br /> Console.WriteLine(link.Attributes["href"]?.Value);<br /> }<br /> }
随机找到页面里面一个连接, 然后递归调用自己的方法, 直到主动停止:
private static async Task GetLinksAsync(string uri)<br /> {<br /> var httpClient = new HttpClient();<br /> var htmlSource = await httpClient.GetStringAsync($"http://en.wikipedia.org{uri}");<br /> var parser = new HtmlParser();<br /> var document = await parser.ParseAsync(htmlSource);<br /><br /> var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a")<br /> .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success);<br /> return links;<br /> }<br /><br /> public static async Task GetRandomNestedLinksAsync()<br /> {<br /> var random = new Random();<br /> var links = (await GetLinksAsync("/wiki/Kevin_Bacon")).ToList();<br /> while (links.Any())<br /> {<br /> var newArticle = links[random.Next(0, links.Count)].Attributes["href"].Value;<br /> Console.WriteLine(newArticle);<br /> links = (await GetLinksAsync(newArticle)).ToList();<br /> }<br /> }
采集整个网站
首先要了解几个概念:
浅网 surface web: 是互联网上搜索引擎可以直接抓取到的那部分网络.
与浅网对立的就是深网 deep web: 互联网中90%都是深网.
暗网Darknet / dark web / dark internet: 它完全是另*敏*感*词*.....
深网相对暗网还是比较容易采集的.
采集整个网站的两个好处:
由于网站的规模和深度, 所以采集到的超链接很多可能是重复的, 这时我们就需要链接去重, 可以使用Set类型的集合:
private static readonly HashSet LinkSet = new HashSet();<br /> private static readonly HttpClient HttpClient = new HttpClient();<br /> private static readonly HtmlParser Parser = new HtmlParser();<br /><br /> public static async Task GetUniqueLinksAsync(string uri = "")<br /> {<br /> var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}");<br /> var document = await Parser.ParseAsync(htmlSource);<br /><br /> var links = document.QuerySelectorAll("a")<br /> .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success);<br /><br /> foreach (var link in links)<br /> {<br /> if (!LinkSet.Contains(link.Attributes["href"].Value))<br /> {<br /> var newPage = link.Attributes["href"].Value;<br /> Console.WriteLine(newPage);<br /> LinkSet.Add(newPage);<br /> await GetUniqueLinksAsync(newPage);<br /> }<br /> }<br /> }
(递归调用的深度需要注意一下, 不然有时候能崩溃).
收集整个网站数据
这个例子相对网站, 包括收集相关文字和异常处理等:
private static readonly HashSet LinkSet = new HashSet();<br /> private static readonly HttpClient HttpClient = new HttpClient();<br /> private static readonly HtmlParser Parser = new HtmlParser();<br /><br /> public static async Task GetLinksWithInfoAsync(string uri = "")<br /> {<br /> var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}");<br /> var document = await Parser.ParseAsync(htmlSource);<br /><br /> try<br /> {<br /> var title = document.QuerySelector("h1").TextContent;<br /> Console.ForegroundColor = ConsoleColor.Green;<br /> Console.WriteLine(title);<br /><br /> var contentElement = document.QuerySelector("#mw-content-text").QuerySelectorAll("p").FirstOrDefault();<br /> if (contentElement != null)<br /> {<br /> Console.WriteLine(contentElement.TextContent);<br /> }<br /><br /> var alink = document.QuerySelector("#ca-edit").QuerySelectorAll("span a").SingleOrDefault(x => x.HasAttribute("href"))?.Attributes["href"].Value;<br /> Console.WriteLine(alink);<br /> }<br /> catch (NullReferenceException)<br /> {<br /> Console.ForegroundColor = ConsoleColor.Red;<br /> Console.WriteLine("Cannot find the tag!");<br /> }<br /><br /> var links = document.QuerySelectorAll("a")<br /> .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success).ToList();<br /> foreach (var link in links)<br /> {<br /> if (!LinkSet.Contains(link.Attributes["href"].Value))<br /> {<br /> var newPage = link.Attributes["href"].Value;<br /> Console.WriteLine(newPage);<br /> LinkSet.Add(newPage);<br /> await GetLinksWithInfoAsync(newPage);<br /> }<br /> }<br /> }
不知前方水深的例子
第一个例子, 寻找随机外链:
using System;<br />using System.Collections.Generic;<br />using System.Linq;<br />using System.Net.Http;<br />using System.Text.RegularExpressions;<br />using System.Threading.Tasks;<br />using AngleSharp.Parser.Html;<br /><br />namespace WebScrapingWithDotNetCore.Chapter03<br />{<br /> public class CrawlingAcrossInternet<br /> {<br /> private static readonly Random Random = new Random();<br /> private static readonly HttpClient HttpClient = new HttpClient();<br /> private static readonly HashSet InternalLinks = new HashSet();<br /> private static readonly HashSet ExternalLinks = new HashSet();<br /> private static readonly HtmlParser Parser = new HtmlParser();<br /><br /> public static async Task FollowExternalOnlyAsync(string startingSite)<br /> {<br /> var externalLink = await GetRandomExternalLinkAsync(startingSite);<br /> if (externalLink != null)<br /> {<br /> Console.WriteLine($"External Links is: {externalLink}");<br /> await FollowExternalOnlyAsync(externalLink);<br /> }<br /> else<br /> {<br /> Console.WriteLine("Random External link is null, Crawling terminated.");<br /> }<br /> }<br /><br /> private static async Task GetRandomExternalLinkAsync(string startingPage)<br /> {<br /> try<br /> {<br /> var htmlSource = await HttpClient.GetStringAsync(startingPage);<br /> var externalLinks = (await GetExternalLinksAsync(htmlSource, SplitAddress(startingPage)[0])).ToList();<br /> if (externalLinks.Any())<br /> {<br /> return externalLinks[Random.Next(0, externalLinks.Count)];<br /> }<br /><br /> var internalLinks = (await GetInternalLinksAsync(htmlSource, startingPage)).ToList();<br /> if (internalLinks.Any())<br /> {<br /> return await GetRandomExternalLinkAsync(internalLinks[Random.Next(0, internalLinks.Count)]);<br /> }<br /><br /> return null;<br /> }<br /> catch (HttpRequestException e)<br /> {<br /> Console.WriteLine($"Error requesting: {e.Message}");<br /> return null;<br /> }<br /> }<br /><br /> private static string[] SplitAddress(string address)<br /> {<br /> var addressParts = address.Replace("http://", "").Replace("https://", "").Split("/");<br /> return addressParts;<br /> }<br /><br /> private static async Task GetInternalLinksAsync(string htmlSource, string includeUrl)<br /> {<br /> var document = await Parser.ParseAsync(htmlSource);<br /> var links = document.QuerySelectorAll("a")<br /> .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, $@"^(/|.*{includeUrl})").Success)<br /> .Select(x => x.Attributes["href"].Value);<br /> foreach (var link in links)<br /> {<br /> if (!string.IsNullOrEmpty(link) && !InternalLinks.Contains(link))<br /> {<br /> InternalLinks.Add(link);<br /> }<br /> }<br /> return InternalLinks;<br /> }<br /><br /> private static async Task GetExternalLinksAsync(string htmlSource, string excludeUrl)<br /> {<br /> var document = await Parser.ParseAsync(htmlSource);<br /><br /> var links = document.QuerySelectorAll("a")<br /> .Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, $@"^(http|www)((?!{excludeUrl}).)*$").Success)<br /> .Select(x => x.Attributes["href"].Value);<br /> foreach (var link in links)<br /> {<br /> if (!string.IsNullOrEmpty(link) && !ExternalLinks.Contains(link))<br /> {<br /> ExternalLinks.Add(link);<br /> }<br /> }<br /> return ExternalLinks;<br /> }<br /><br /> private static readonly HashSet AllExternalLinks = new HashSet();<br /> private static readonly HashSet AllInternalLinks = new HashSet();<br /><br /> public static async Task GetAllExternalLinksAsync(string siteUrl)<br /> {<br /> try<br /> {<br /> var htmlSource = await HttpClient.GetStringAsync(siteUrl);<br /> var internalLinks = await GetInternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]);<br /> var externalLinks = await GetExternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]);<br /> foreach (var link in externalLinks)<br /> {<br /> if (!AllExternalLinks.Contains(link))<br /> {<br /> AllExternalLinks.Add(link);<br /> Console.WriteLine(link);<br /> }<br /> }<br /><br /> foreach (var link in internalLinks)<br /> {<br /> if (!AllInternalLinks.Contains(link))<br /> {<br /> Console.WriteLine($"The link is: {link}");<br /> AllInternalLinks.Add(link);<br /> await GetAllExternalLinksAsync(link);<br /> }<br /> }<br /> }<br /> catch (HttpRequestException e)<br /> {<br /> Console.WriteLine(e);<br /> Console.WriteLine($"Request error: {e.Message}");<br /> }<br /> }<br /> }<br />}
程序有Bug, 您可以给解决下......
第一部分先到这....主要用的是AngleSharp. AngleSharp不止这些功能, 很强大的, 具体请看文档.
由于该书下一部分使用的是Python的Scrapy, 所以下篇文章我也许应该使用DotNetSpider了, 这是一个国产的库....
项目的代码在: