excel vba 网页数据抓取(优点是速度快,灵活,可以实现GET、POST、Header等诸多细节)

优采云 发布时间: 2021-09-13 21:06

  excel vba 网页数据抓取(优点是速度快,灵活,可以实现GET、POST、Header等诸多细节)

  优点是速度快,灵活,可以实现GET、POST、Header、Cookie等很多细节。缺点是比Webbrowser麻烦一点,调试不直观。为了方便引用网页中的信息,不妨将XMLhttp的responsetext放入一个HTMLfile对象中,这样就可以像Webbrowser一样进行检索了。 XMLhttp的用法和Webbrowser类似,最简单的HTTP GET代码:

  

Dim oHTTP, oHTML as Object

Set oHTTP = CreateObject("msxml2.xmlhttp.6.0")

Set oHTML = CreateObject("HTMLfile")

strURL = "http://foo.com/search.do?keyword=" & strKeyword

'True是异步模式,访问后需要写个循环等它完成,异步的好处后面说。改成False就是同步模式,Send后不用专门等它

oHTTP.Open "get", strURL, True

oHTTP.Send

Do While oHTTP.ReadyState 4

DoEvents

Loop

'用返回的文本建立一个HTML文档便于查找数据

oHTML.body.innerhtml = oHTTP.responsetext

Thisworkbook.Sheets("Sheet1").Cells(nRow, nColumn) = oHTML.getElementById("Name").Value

Set oHTTP = Nothing

Set oHTML = Nothing

  2.解决登录问题

  模拟登录的大概代码如下,一目了然:

  

oIE.Navigate "http://foo.com/login.do"

Do While oIE.Busy Or (oIE.ReadyState 4)

DoEvents

Loop

oIE.Document.getElementById("userName").Value = userName

oIE.Document.getElementById("userPassword").Value = passWord

oIE.Document.getElementById("submitBtn").Click

Do While oIE.Busy Or (oIE.ReadyState 4) Or oIE.LocationURL "http://foo.com/mainindex.do?method=login&status=1"

DoEvents

Loop

  主要内容如下:

  3. 使用异步加速

  等待网页一个个返回太慢了,所以我们不是一个个同步发送,一个个等待,而是使用异步,一次发送一批请求,统一等待。初衷当然很好,但是VBA不支持多线程,所以这里的速度提升比较有限。一次发送20个请求只能提高2倍左右的速度。再多好像也没用。 nThread 值的选择很大程度上取决于网站 爬取的速度。建议多试几次再决定。

  

'一共nThread个请求

For i = 1 To nThread

Set oHTTP(i) = CreateObject("msxml2.xmlhttp.6.0")

Set oHTML(i) = CreateObject("HTMLfile")

Next i

For m = 2 To Application.WorksheetFunction.CountA(ThisWorkbook.Sheets("Sheet1").Range("A:A")) - 1 Step nThread

For i = 1 To nThread

URL(i) = "http://foo.com/search.do&keyword=" & ThisWorkbook.Sheets("Sheet1").Cells(m + i - 1, 1).Value

oHTTP(i).Open "get", URL(i), True

oHTTP(i).Send

errflag(i) = False

Next i

'发送后一起等待

For i = 1 To nThread

Do While oHTTP(i).ReadyState 4

DoEvents

Loop

Next i

For i = 1 To nThread

oHTML(i).body.innerhtml = oHTTP(i).responsetext

'简单的出错处理

If InStr(1, oHTML(i).body.outerhtml, "Error") 0 Then

errflag(i) = True

If target(i) "" Then

ThisWorkbook.Sheets("sheet1").Cells(m + i - 1, 2).Value = "Error"

End If

Else

errflag(i) = False

ThisWorkbook.Sheets("sheet1").Cells(m + i - 1, 2).Value = oHTML(i).getElementById("Name").Value

End If

Next i

Next m

  4. 看似不可能的多线程实现

  可能很多人都告诉过你,VBA 不支持多线程。是的,它不支持,使用API​​ 极其麻烦且不稳定。但是,Windows 操作系统支持多线程,我们用它来绕过 VBA 的限制。不仅有方法,还有三种。

  4.1 使用 VBScript 添加应用程序

  保存n份收录宏的工作簿,生成n个VBScript脚本文件,每个脚本使用Excel.Application对象打开一个工作簿,在每个工作簿中运行VBA爬虫,将爬取的结果统一写入返回主要的Excel。这种方法有两个优点:一是使用字符串的VBScript代码比较简洁,二是各个线程都可以方便地使用Webbrowser控件登录。缺点是打开一批Excel,系统负担较重。

  

For nWorker = 1 To cmbWorkers.Value 'cmbWorkers复合框保存了总线程数

'保存当前工作簿的拷贝

WorkerFileName = ThisWorkbook.Path & "\~Worker_" & cmbWorkers.Text & "_" & nWorker & ".xlsx"

Call ThisWorkbook.SaveCopyAs(WorkerFileName)

'写VBS脚本。脚本中调用每个工作簿拷贝里面的宏searchWorker,用参数做好线程之间的分工和数据传递

s = "Set objExcel = CreateObject(""Excel.Application"")" & vbCrLf

s = s & "Set objWorkbook = objExcel.Workbooks.Open(""" & WorkerFileName & """)" & vbCrLf

s = s & "objExcel.Application.Visible = False" & vbCrLf

s = s & "objExcel.Application.Run ""~Worker_" & cmbWorkers.Value & "_" & nWorker & ".xlsx!searchWorker"" ," & nWorker & "," & cmbWorkers.Text & ",""" & ThisWorkbook.Name & """," & txtStart.Text & ",""" & txtUserName.Text & """,""" & txtPassword.Text & """" & vbCrLf

s = s & "objExcel.ActiveWorkbook.Close" & vbCrLf

s = s & "objExcel.Application.Quit" & vbCrLf

s = s & "Set objExcel = Nothing" & vbCrLf

'保存VBS脚本文件

scriptFileName = ThisWorkbook.Path & "\~Worker_" & cmbWorkers.Text & "_" & nWorker & ".vbs"

Open scriptFileName For Output As #1

Print #1, s

Close #1

'异步执行VBS脚本

Set wsh = VBA.CreateObject("WScript.Shell")

wsh.Run """" & scriptFileName & """"

Set wsh = Nothing

Next nWorker

  在searchWorker进程中创建了一个Excel对象,爬取到的数据通过工作簿名workbookName写回到原来的工作簿中。 SearchWorker 代码示例:

  

Const CThread = 20 '同时发送请求数

Public Sub searchWorker(nWorker As Integer, maxWorkers As Integer, workbookName As String, nRowStart As Long, userName As String, passWord As String)

Application.ScreenUpdating = False

Application.EnableEvents = False

If userName = "" Or passWord = "" Then

MsgBox "Login information required."

Exit Sub

End If

'利用Webbrowser登录

fmUI.oIE.Navigate "http://foo.com/login.do"

Do While fmUI.oIE.Busy Or (fmUI.oIE.ReadyState 4)

DoEvents

Loop

fmUI.oIE.Document.getElementById("userName").Value = userName

fmUI.oIE.Document.getElementById("userPassword").Value = passWord

fmUI.oIE.Document.getElementById("submitBtn").Click

Do While fmUI.oIE.Busy Or (fmUI.oIE.ReadyState 4) Or fmUI.oIE.LocationURL "http://foo.com/mainindex.do?method=login&status=1"

DoEvents

Loop

Dim oXL As Object

Set oXL = GetObject(, "Excel.Application")

Dim target(1 To CThread) As String '查询目标

Dim URL(1 To CThread) As String 'url

Dim errflag(1 To CThread) As Boolean '错误标识

Dim oHTTP(1 To CThread) As Object 'xmlhttp

Dim oHTML(1 To CThread) As Object 'html文档对象

nThread = CThread

n = Application.WorksheetFunction.CountA(ThisWorkbook.Sheets("Sheet1").Range("A:A")) - 1

errmsg = "错误信息"

For i = 1 To nThread

Set oHTTP(i) = CreateObject("msxml2.xmlhttp.6.0")

Set oHTML(i) = CreateObject("htmlfile")

Next i

For m = nRowStart To n Step nThread * maxWorkers

For i = 1 To nThread

target(i) = ThisWorkbook.Sheets("sheet1").Cells(m + (i - 1) * maxWorkers + (nWorker - 1), 1).Value

URL(i) = "http://foo.com/search.do&keyword=" & target(i)

oHTTP(i).Open "get", URL(i), True

oHTTP(i).Send

errflag(i) = False

Next i

For i = 1 To nThread

Do While oHTTP(i).ReadyState 4

DoEvents

Loop

Next i

For i = 1 To nThread

oHTML(i).body.innerhtml = oHTTP(i).responsetext

If InStr(1, oHTML(i).body.outerhtml, errmsg) 0 Then

errflag(i) = True

msg = "错误"

If target(i) "" Then

oXL.Workbooks(workbookName).Sheets("sheet1").Cells(m + (i - 1) * maxWorkers + (nWorker - 1), j).Value = msg

End If

Else

errflag(i) = False

oXL.Workbooks(workbookName).Sheets("sheet1").Cells(m + (i - 1) * maxWorkers + (nWorker - 1), 2).Value = Trim(oHTML(i).getElementsByTagName("td")(5).innertext)

End If

Next i

Next m

Set oXL = Nothing

For i = 1 To nThread

Set oHTTP(i) = Nothing

Set oHTML(i) = Nothing

Next i

End Sub

  4.2 只用VBScript实现多线程

  通过上一节的例子,很容易构建一个合适的VBScript文件,直接在文件中抓取数据,我就不放代码了。与添加VBScript和Application的方法相比,只用VBScript拼字符串比较麻烦,但是程序执行起来很轻量级,所以如果你要抓取的网站没有复杂的登录过程,而且你不怕代码麻烦,那你可以考虑使用VBScript。例子可以在这里找到,代码很乱而且很长:Multi-threaded VBA

  4.3 使用ActiveX EXE实现多线程

  这是前辈写的。优点是资源消耗适中,缺点是需要Visual Basic环境,实现起来比较复杂。参见:VBA异步多线程网页捕捉教程-excelhome

  总结

  我个人推荐VBScript+Application的多线程方案,它的通用性更强,现在电脑不太在意占用更多内存。与本文前面使用XMLhttp批量异步发送的方法相比,VBS+Application方案通过创建8个线程可以提速5倍左右,非常高效。测试电脑为i7台式机,4核8线程,8G内存。爬行时,每个WPS ET线程大概占用不到100M内存,机器可以承受。

  做爬虫可能会遇到很多问题,比如翻页、动态网页、json解析、保存附件等,有时候为了避免被网站屏蔽,会加一些延时。具体问题只能在爬行过程中单独解决。祝大家好运。

  以上。

  修复做不到的老狼

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线