抓取网页数据(我在抓取网页的时候得不到完整的原始源(就是右击-查看))
优采云 发布时间: 2021-10-07 15:09抓取网页数据(我在抓取网页的时候得不到完整的原始源(就是右击-查看))
我在抓取网页时无法获得完整的原创来源。在网上找了半天,也没找到可行的办法。我希望花园里的人可以提供帮助。提前致谢!!
比如我要提取的网址是:
我想得到它的原创源码(也就是右键查看原创源码中看到的所有字符)
当我使用以下代码提取时:
private string GetHtmlCode(string url)
{
string htmlCode;
HttpWebRequest webRequest = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url);
webRequest.Timeout = 30000;
webRequest.Method = "GET";
webRequest.UserAgent = "Mozilla/4.0";
webRequest.Headers.Add("Accept-Encoding", "gzip, deflate");
HttpWebResponse webResponse = (System.Net.HttpWebResponse)webRequest.GetResponse();
if (webResponse.ContentEncoding.ToLower() == "gzip
{
using (System.IO.Stream streamReceive = webResponse.GetResponseStream())
{
using (var zipStream =
new System.IO.Compression.GZipStream(streamReceive, System.IO.Compression.CompressionMode.Decompress))
{
using (StreamReader sr = new System.IO.StreamReader(zipStream, Encoding.Default))
{
htmlCode = sr.ReadToEnd();
}
}
}
}else
{
using (System.IO.Stream streamReceive = webResponse.GetResponseStream())
{
using (System.IO.StreamReader sr = new System.IO.StreamReader(streamReceive, Encoding.Default))
{
htmlCode = sr.ReadToEnd();
}
}
}
return htmlCode;
}
提取的数据不完整,无法显示iframe中的代码。我以为可能有一些AJAX数据,所以我更改了以下提取代码:
private void button1_Click(object sender, EventArgs e)
{
WebBrowser web = new WebBrowser();
web.Navigate(this.rtb_Url.Text);
web.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(web_DocumentCompleted);
while (web.IsBusy)
{
Application.DoEvents();
Thread.Sleep(100);
}
}
void web_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser web = (WebBrowser)sender;
string mystr = web.Document.Body.OuterHtml;
}
提到的原创来源总是不完整,希望园丁能给我一些建议,谢谢!