网页数据抓取(抓取网站数据不再难（其实是想死的！）)

优采云发布时间: 2022-04-17 22:37

　　首先，从标题开始，为什么抓网站数据不再难（其实抓网站数据很难），SO EASY！！！使用 Fizzler 可以做到这一切。我相信大多数人或公司应该都有捕获他人网站数据的经验。比如我们博客园每次发布文章都会被其他网站给我抢，不信你看就知道了。还有人在网站上抢别人的邮箱、电话、QQ等有用信息。这些信息绝对可以卖钱或做其他事情。我们每天都会不时收到垃圾短信或电子邮件。就是这样，同感，O(∩_∩)O哈哈~。

　　前段时间写了两个程序，一个程序是采集彩票网站（双色球）的数据，另一个是采集求职网站（猎聘，武城武城，智联招聘）等）数据，写这两个程序的时候显示特别棘手，看到一堆HTML标签真想死。首先，让我们回顾一下我之前是如何解析 HTML 的。这是一种非常普遍的做法。我通过WebRequest获取HTML内容，然后使用HTML标签一步步截取你想要的内容。下面的代码是截取双色球的红球和篮球的代码。一旦网站的标签稍有变化，就有可能面临重新编程，使用起来很不方便。

　　下面是我在解析红球和篮球双色球的代码。我做的最多的就是截取（正则表达式）标签的对应内容。也许这段代码不是很复杂，因为截取的数据有限，很有用。因此，规则相对简单。

　　 1 #region * 在一个TR中，解析TD，获取一期的号码

2 ///

3 /// 在一个TR中，解析TD，获取一期的号码

4 ///

5 ///

6 ///

7 private void ResolveTd(ref WinNo wn, string trContent)

8 {

9 List redBoxList = null;

10 //匹配期号的表达式

11 string patternQiHao = "0)

17 {

18 info.Position = NodesMainContent1.ToArray()[0].InnerText;

19 }

20 //--公司名称

21 IEnumerable NodesMainContent2 = AnalyzeHTML.GetHtmlInfo(html, "div.title-info h3");

22 if (NodesMainContent2.Count() > 0)

23 {

24 info.Company = NodesMainContent2.ToArray()[0].InnerText;

25 }

26 //--公司性质/公司规模

27 IEnumerable NodesMainContent4 = AnalyzeHTML.GetHtmlInfo(html, "div.content.content-word ul li");

28 if (NodesMainContent4.Count() > 0)

29 {

30 foreach (var item in NodesMainContent4)

31 {

32 if (item.InnerHtml.Contains("企业性质"))

33 {

34 string nature = item.InnerText;

35 nature = nature.Replace("企业性质：", "");

36 info.Nature = nature;

37 }

38 if (item.InnerHtml.Contains("企业规模"))

39 {

40 string scale = item.InnerText;

41 scale = scale.Replace("企业规模：", "");

42 info.Scale = scale;

43 }

44 }

45 }

46 else//第二次解析企业性质和企业规模

47 {

48 IEnumerable NodesMainContent4_1 = AnalyzeHTML.GetHtmlInfo(html, "div.right-post-top div.content.content-word");

49 if (NodesMainContent4_1.Count() > 0)

50 {

51 foreach (var item_1 in NodesMainContent4_1)

52 {

53 string[] arr = item_1.InnerText.Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

54 if (arr != null && arr.Length > 0)

55 {

56 foreach (string str in arr)

57 {

58 if (str.Trim().Contains("性质"))

59 {

60 info.Nature = str.Replace("性质：", "").Trim();

61 }

62 if (str.Trim().Contains("规模"))

63 {

64 info.Scale = str.Replace("规模：", "").Trim();

65 }

66 }

67 }

68 }

69 }

70 }

71 //--工作经验

72 IEnumerable NodesMainContent5 = AnalyzeHTML.GetHtmlInfo(html, "div.resume.clearfix span.noborder");

73 if (NodesMainContent5.Count() > 0)

74 {

75 info.Experience = NodesMainContent5.ToArray()[0].InnerText;

76 }

77 //--公司地址/最低*敏*感*词*

78 IEnumerable NodesMainContent6 = AnalyzeHTML.GetHtmlInfo(html, "div.resume.clearfix");

79 if (NodesMainContent6.Count() > 0)

80 {

81 foreach (var item in NodesMainContent6)

82 {

83 string lable = Regex.Replace(item.InnerHtml, "\\s", "");

84 lable = lable.Replace("", "");

85 string[] arr = lable.Split("".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

86 if (arr != null && arr.Length > 2)

87 {

88 info.Address = arr[0];//公司地址

89 info.Education = arr[1];//最低*敏*感*词*

90 }

91 }

92 }

93 //--月薪

94 IEnumerable NodesMainContent7 = AnalyzeHTML.GetHtmlInfo(html, "div.job-title-left p.job-main-title");

95 if (NodesMainContent7.Count() > 0)

96 {

97 info.Salary = NodesMainContent7.ToArray()[0].InnerText;

98 }

99 //--发布时间

100 IEnumerable NodesMainContent8 = AnalyzeHTML.GetHtmlInfo(html, "div.job-title-left p.release-time em");

101 if (NodesMainContent8.Count() > 0)

102 {

103 info.Time = NodesMainContent8.ToArray()[0].InnerText;

104 }

105 //--

106 if (GetJobEnd != null)

107 {

108 GetJobEnd("", info);

109 }

110 }

111 catch (Exception exMsg)

112 {

113 throw new Exception(exMsg.Message);

114 }

115 }

　　上面的方法也解析了一个招聘网站标签的内容，但是我已经看不到复杂的正则表达式来拦截HTML标签了，这使得代码更加干练和简单，整个配置页面可以应付爬取< @网站tag 频繁更换的问题，所以看来抓取别人的网站数据是一件很简单的事情，O(∩_∩)O哈哈~是不是！！！

　　以上只是我个人的看法！！！

0

2022-04-17

网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页数据抓取(抓取网站数据不再难（其实是想死的！）)

0 个评论

发起人

AI时代内容工厂

网页数据抓取(抓取网站数据不再难（其实是想死的！）)

0 个评论

发起人

相关问题