c#抓取网页数据(获取网页数据有很多种方式-获取网页内容获取方法 )

优采云发布时间: 2022-03-06 01:03

　　c#抓取网页数据(获取网页数据有很多种方式-获取网页内容获取方法

)

　　获取网页数据的方法有很多。这里主要介绍通过WebClient、WebBrowser和HttpWebRequest/HttpWebResponse三种方式获取网页内容。

　　这里得到的是包括网页在内的所有信息。如果你只是需要一些数据内容。您可以使用自己的构造函数来识别和删除它们！一般的做法是根据源代码的格式过滤掉你需要的内容。

　　一、通过WebClient获取网页内容

　　这是一种很简单的获取方式，当然其他的获取方式也很简单。这里首先要注意的是，如果考虑实际项目的效率，需要考虑在函数中分配内存区域。大致写成如下

　　//MemoryStream是一个支持储存区为内存的流。

byte[] buffer = new byte[1024];

using (MemoryStream memory = new MemoryStream())

{

int index = 1, sum = 0;

while (index * sum < 100 * 1024)

{

index = reader.Read(buffer, 0, 1024);

if (index > 0)

{

memory.Write(buffer, 0, index);

sum += index;

}

//网页通常使用utf-8或gb2412进行编码

Encoding.GetEncoding("gb2312").GetString(memory.ToArray());

if (string.IsNullOrEmpty(html))

{

return html;

}

else

{

Regex re = new Regex(@"charset=(? charset[/s/S]*?)[ |']");

Match m = re.Match(html.ToLower());

encoding = m.Groups[charset].ToString();

}

if (string.IsNullOrEmpty(encoding) || string.Equals(encoding.ToLower(), "gb2312"))

{

return html;

}

　　好了，现在进入正题，WebClient获取网页数据的代码如下

　　 //using System.IO;

try

{

WebClient webClient = new WebClient();

webClient.Credentials = CredentialCache.DefaultCredentials;//获取或设置用于向Internet资源的请求进行身份验证的网络凭据

Byte[] pageData = webClient.DownloadData("http://www.360doc.com/content/11/0427/03/1947337_112596569.shtml");

//string pageHtml = Encoding.Default.GetString(pageData); //如果获取网站页面采用的是GB2312，则使用这句

string pageHtml = Encoding.UTF8.GetString(pageData); //如果获取网站页面采用的是UTF-8，则使用这句

using (StreamWriter sw = new StreamWriter("e:\\ouput.txt"))//将获取的内容写入文本

{

htm = sw.ToString();//测试StreamWriter流的输出状态，非必须

sw.Write(pageHtml);

}

catch (WebException webEx)

{

Console.W

}

　　二、通过WebBrowser控件获取网页内容

　　相对来说，这是最简单的获取方式。将 WebBrowser 控件拖入其中并匹配以下代码

　　WebBrowser web = new WebBrowser();

web.Navigate("http://www.163.com");

web.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(web_DocumentCompleted);

void web_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)

{

WebBrowser web = (WebBrowser)sender;

HtmlElementCollection ElementCollection = web.Document.GetElementsByTagName("Table");

foreach (HtmlElement item in ElementCollection)

{

File.AppendAllText("Kaijiang_xj.txt", item.InnerText);

}

0

2022-03-06

c#抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

c#抓取网页数据(获取网页数据有很多种方式-获取网页内容获取方法 )

0 个评论

发起人

AI时代内容工厂

c#抓取网页数据(获取网页数据有很多种方式-获取网页内容获取方法 )

0 个评论

发起人

相关问题