c#抓取网页数据( 从IE浏览器获取当前页面内容可能有多种方式的资料)

优采云 发布时间: 2021-10-02 11:36

  c#抓取网页数据(

从IE浏览器获取当前页面内容可能有多种方式的资料)

  C#从IE浏览器获取当前页面的内容

  更新时间:2021年6月24日09:52:49作者:迈克尔·大卫

  从IE浏览器获取当前页面内容的方法可能有很多种。今天我介绍其中一个。基本原理:鼠标点击当前ie页面时,获取鼠标坐标位置,根据鼠标位置获取当前页面的句柄,然后根据句柄调用Win32 things获取页面内容。感兴趣的朋友可以参考本文

  

private void timer1_Tick(object sender, EventArgs e)

{

lock (currentLock)

{

System.Drawing.Point MousePoint = System.Windows.Forms.Form.MousePosition;

if (_leftClick)

{

timer1.Stop();

_leftClick = false;

_lastDocument = GetHTMLDocumentFormHwnd(GetPointControl(MousePoint, false));

if (_lastDocument != null)

{

if (_getDocument)

{

_getDocument = true;

try

{

string url = _lastDocument.url;

string html = _lastDocument.documentElement.outerHTML;

string cookie = _lastDocument.cookie;

string domain = _lastDocument.domain;

var resolveParams = new ResolveParam

{

Url = new Uri(url),

Html = html,

PageCookie = cookie,

Domain = domain

};

RequetResove(resolveParams);

}

catch (Exception ex)

{

System.Windows.MessageBox.Show(ex.Message);

Console.WriteLine(ex.Message);

Console.WriteLine(ex.StackTrace);

}

}

}

else

{

new MessageTip().Show("xx", "当前页面不是IE浏览器页面,或使用了非IE内核浏览器,如火狐,搜狗等。请使用IE浏览器打开网页");

}

_getDocument = false;

}

else

{

_pointFrm.Left = MousePoint.X + 10;

_pointFrm.Top = MousePoint.Y + 10;

}

}

}

  在第11行的gethtmldocumentformhwnd(getpointcontrol(mousepoint,false))分解下,首先从鼠标坐标获取页面句柄:

  

public static IntPtr GetPointControl(System.Drawing.Point p, bool allControl)

{

IntPtr handle = Win32APIsFull.WindowFromPoint(p);

if (handle != IntPtr.Zero)

{

System.Drawing.Rectangle rect = default(System.Drawing.Rectangle);

if (Win32APIsFull.GetWindowRect(handle, out rect))

{

return Win32APIsFull.ChildWindowFromPointEx(handle, new System.Drawing.Point(p.X - rect.X, p.Y - rect.Y), allControl ? Win32APIsFull.CWP.ALL : Win32APIsFull.CWP.SKIPINVISIBLE);

}

}

return IntPtr.Zero;

}

  接下来,根据句柄获取页面内容:

  

public static HTMLDocument GetHTMLDocumentFormHwnd(IntPtr hwnd)

{

IntPtr result = Marshal.AllocHGlobal(4);

Object obj = null;

Console.WriteLine(Win32APIsFull.SendMessageTimeoutA(hwnd, HTML_GETOBJECT_mid, 0, 0, 2, 1000, result));

if (Marshal.ReadInt32(result) != 0)

{

Console.WriteLine(Win32APIsFull.ObjectFromLresult(Marshal.ReadInt32(result), ref IID_IHTMLDocument, 0, out obj));

}

Marshal.FreeHGlobal(result);

return obj as HTMLDocument;

}

  一般原则:

  

  向IE表单发送消息,获取指向IE浏览器(非托管)内存块的指针,然后根据该指针获取htmldocument对象

  此方法涉及两个Win32函数:

  

[System.Runtime.InteropServices.DllImportAttribute("user32.dll", EntryPoint = "SendMessageTimeoutA")]

public static extern int SendMessageTimeoutA(

[InAttribute()] System.IntPtr hWnd,

uint Msg, uint wParam, int lParam,

uint fuFlags,

uint uTimeout,

System.IntPtr lpdwResult);

  

[System.Runtime.InteropServices.DllImportAttribute("oleacc.dll", EntryPoint = "ObjectFromLresult")]

public static extern int ObjectFromLresult(

int lResult,

ref Guid riid,

int wParam,

[MarshalAs(UnmanagedType.IDispatch), Out]

out Object pObject

);

  以上是从IE浏览器获取当前页面内容的详细信息。有关c#获取浏览器页面内容的更多信息,请注意其他相关信息文章

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线