excel vba 网页数据抓取(优点是速度快,灵活,可以实现GET、POST、Header等诸多细节)
优采云 发布时间: 2021-09-13 21:06excel vba 网页数据抓取(优点是速度快,灵活,可以实现GET、POST、Header等诸多细节)
优点是速度快,灵活,可以实现GET、POST、Header、Cookie等很多细节。缺点是比Webbrowser麻烦一点,调试不直观。为了方便引用网页中的信息,不妨将XMLhttp的responsetext放入一个HTMLfile对象中,这样就可以像Webbrowser一样进行检索了。 XMLhttp的用法和Webbrowser类似,最简单的HTTP GET代码:
Dim oHTTP, oHTML as Object
Set oHTTP = CreateObject("msxml2.xmlhttp.6.0")
Set oHTML = CreateObject("HTMLfile")
strURL = "http://foo.com/search.do?keyword=" & strKeyword
'True是异步模式,访问后需要写个循环等它完成,异步的好处后面说。改成False就是同步模式,Send后不用专门等它
oHTTP.Open "get", strURL, True
oHTTP.Send
Do While oHTTP.ReadyState 4
DoEvents
Loop
'用返回的文本建立一个HTML文档便于查找数据
oHTML.body.innerhtml = oHTTP.responsetext
Thisworkbook.Sheets("Sheet1").Cells(nRow, nColumn) = oHTML.getElementById("Name").Value
Set oHTTP = Nothing
Set oHTML = Nothing
2.解决登录问题
模拟登录的大概代码如下,一目了然:
oIE.Navigate "http://foo.com/login.do"
Do While oIE.Busy Or (oIE.ReadyState 4)
DoEvents
Loop
oIE.Document.getElementById("userName").Value = userName
oIE.Document.getElementById("userPassword").Value = passWord
oIE.Document.getElementById("submitBtn").Click
Do While oIE.Busy Or (oIE.ReadyState 4) Or oIE.LocationURL "http://foo.com/mainindex.do?method=login&status=1"
DoEvents
Loop
主要内容如下:
3. 使用异步加速
等待网页一个个返回太慢了,所以我们不是一个个同步发送,一个个等待,而是使用异步,一次发送一批请求,统一等待。初衷当然很好,但是VBA不支持多线程,所以这里的速度提升比较有限。一次发送20个请求只能提高2倍左右的速度。再多好像也没用。 nThread 值的选择很大程度上取决于网站 爬取的速度。建议多试几次再决定。
'一共nThread个请求
For i = 1 To nThread
Set oHTTP(i) = CreateObject("msxml2.xmlhttp.6.0")
Set oHTML(i) = CreateObject("HTMLfile")
Next i
For m = 2 To Application.WorksheetFunction.CountA(ThisWorkbook.Sheets("Sheet1").Range("A:A")) - 1 Step nThread
For i = 1 To nThread
URL(i) = "http://foo.com/search.do&keyword=" & ThisWorkbook.Sheets("Sheet1").Cells(m + i - 1, 1).Value
oHTTP(i).Open "get", URL(i), True
oHTTP(i).Send
errflag(i) = False
Next i
'发送后一起等待
For i = 1 To nThread
Do While oHTTP(i).ReadyState 4
DoEvents
Loop
Next i
For i = 1 To nThread
oHTML(i).body.innerhtml = oHTTP(i).responsetext
'简单的出错处理
If InStr(1, oHTML(i).body.outerhtml, "Error") 0 Then
errflag(i) = True
If target(i) "" Then
ThisWorkbook.Sheets("sheet1").Cells(m + i - 1, 2).Value = "Error"
End If
Else
errflag(i) = False
ThisWorkbook.Sheets("sheet1").Cells(m + i - 1, 2).Value = oHTML(i).getElementById("Name").Value
End If
Next i
Next m
4. 看似不可能的多线程实现
可能很多人都告诉过你,VBA 不支持多线程。是的,它不支持,使用API 极其麻烦且不稳定。但是,Windows 操作系统支持多线程,我们用它来绕过 VBA 的限制。不仅有方法,还有三种。
4.1 使用 VBScript 添加应用程序
保存n份收录宏的工作簿,生成n个VBScript脚本文件,每个脚本使用Excel.Application对象打开一个工作簿,在每个工作簿中运行VBA爬虫,将爬取的结果统一写入返回主要的Excel。这种方法有两个优点:一是使用字符串的VBScript代码比较简洁,二是各个线程都可以方便地使用Webbrowser控件登录。缺点是打开一批Excel,系统负担较重。
For nWorker = 1 To cmbWorkers.Value 'cmbWorkers复合框保存了总线程数
'保存当前工作簿的拷贝
WorkerFileName = ThisWorkbook.Path & "\~Worker_" & cmbWorkers.Text & "_" & nWorker & ".xlsx"
Call ThisWorkbook.SaveCopyAs(WorkerFileName)
'写VBS脚本。脚本中调用每个工作簿拷贝里面的宏searchWorker,用参数做好线程之间的分工和数据传递
s = "Set objExcel = CreateObject(""Excel.Application"")" & vbCrLf
s = s & "Set objWorkbook = objExcel.Workbooks.Open(""" & WorkerFileName & """)" & vbCrLf
s = s & "objExcel.Application.Visible = False" & vbCrLf
s = s & "objExcel.Application.Run ""~Worker_" & cmbWorkers.Value & "_" & nWorker & ".xlsx!searchWorker"" ," & nWorker & "," & cmbWorkers.Text & ",""" & ThisWorkbook.Name & """," & txtStart.Text & ",""" & txtUserName.Text & """,""" & txtPassword.Text & """" & vbCrLf
s = s & "objExcel.ActiveWorkbook.Close" & vbCrLf
s = s & "objExcel.Application.Quit" & vbCrLf
s = s & "Set objExcel = Nothing" & vbCrLf
'保存VBS脚本文件
scriptFileName = ThisWorkbook.Path & "\~Worker_" & cmbWorkers.Text & "_" & nWorker & ".vbs"
Open scriptFileName For Output As #1
Print #1, s
Close #1
'异步执行VBS脚本
Set wsh = VBA.CreateObject("WScript.Shell")
wsh.Run """" & scriptFileName & """"
Set wsh = Nothing
Next nWorker
在searchWorker进程中创建了一个Excel对象,爬取到的数据通过工作簿名workbookName写回到原来的工作簿中。 SearchWorker 代码示例:
Const CThread = 20 '同时发送请求数
Public Sub searchWorker(nWorker As Integer, maxWorkers As Integer, workbookName As String, nRowStart As Long, userName As String, passWord As String)
Application.ScreenUpdating = False
Application.EnableEvents = False
If userName = "" Or passWord = "" Then
MsgBox "Login information required."
Exit Sub
End If
'利用Webbrowser登录
fmUI.oIE.Navigate "http://foo.com/login.do"
Do While fmUI.oIE.Busy Or (fmUI.oIE.ReadyState 4)
DoEvents
Loop
fmUI.oIE.Document.getElementById("userName").Value = userName
fmUI.oIE.Document.getElementById("userPassword").Value = passWord
fmUI.oIE.Document.getElementById("submitBtn").Click
Do While fmUI.oIE.Busy Or (fmUI.oIE.ReadyState 4) Or fmUI.oIE.LocationURL "http://foo.com/mainindex.do?method=login&status=1"
DoEvents
Loop
Dim oXL As Object
Set oXL = GetObject(, "Excel.Application")
Dim target(1 To CThread) As String '查询目标
Dim URL(1 To CThread) As String 'url
Dim errflag(1 To CThread) As Boolean '错误标识
Dim oHTTP(1 To CThread) As Object 'xmlhttp
Dim oHTML(1 To CThread) As Object 'html文档对象
nThread = CThread
n = Application.WorksheetFunction.CountA(ThisWorkbook.Sheets("Sheet1").Range("A:A")) - 1
errmsg = "错误信息"
For i = 1 To nThread
Set oHTTP(i) = CreateObject("msxml2.xmlhttp.6.0")
Set oHTML(i) = CreateObject("htmlfile")
Next i
For m = nRowStart To n Step nThread * maxWorkers
For i = 1 To nThread
target(i) = ThisWorkbook.Sheets("sheet1").Cells(m + (i - 1) * maxWorkers + (nWorker - 1), 1).Value
URL(i) = "http://foo.com/search.do&keyword=" & target(i)
oHTTP(i).Open "get", URL(i), True
oHTTP(i).Send
errflag(i) = False
Next i
For i = 1 To nThread
Do While oHTTP(i).ReadyState 4
DoEvents
Loop
Next i
For i = 1 To nThread
oHTML(i).body.innerhtml = oHTTP(i).responsetext
If InStr(1, oHTML(i).body.outerhtml, errmsg) 0 Then
errflag(i) = True
msg = "错误"
If target(i) "" Then
oXL.Workbooks(workbookName).Sheets("sheet1").Cells(m + (i - 1) * maxWorkers + (nWorker - 1), j).Value = msg
End If
Else
errflag(i) = False
oXL.Workbooks(workbookName).Sheets("sheet1").Cells(m + (i - 1) * maxWorkers + (nWorker - 1), 2).Value = Trim(oHTML(i).getElementsByTagName("td")(5).innertext)
End If
Next i
Next m
Set oXL = Nothing
For i = 1 To nThread
Set oHTTP(i) = Nothing
Set oHTML(i) = Nothing
Next i
End Sub
4.2 只用VBScript实现多线程
通过上一节的例子,很容易构建一个合适的VBScript文件,直接在文件中抓取数据,我就不放代码了。与添加VBScript和Application的方法相比,只用VBScript拼字符串比较麻烦,但是程序执行起来很轻量级,所以如果你要抓取的网站没有复杂的登录过程,而且你不怕代码麻烦,那你可以考虑使用VBScript。例子可以在这里找到,代码很乱而且很长:Multi-threaded VBA
4.3 使用ActiveX EXE实现多线程
这是前辈写的。优点是资源消耗适中,缺点是需要Visual Basic环境,实现起来比较复杂。参见:VBA异步多线程网页捕捉教程-excelhome
总结
我个人推荐VBScript+Application的多线程方案,它的通用性更强,现在电脑不太在意占用更多内存。与本文前面使用XMLhttp批量异步发送的方法相比,VBS+Application方案通过创建8个线程可以提速5倍左右,非常高效。测试电脑为i7台式机,4核8线程,8G内存。爬行时,每个WPS ET线程大概占用不到100M内存,机器可以承受。
做爬虫可能会遇到很多问题,比如翻页、动态网页、json解析、保存附件等,有时候为了避免被网站屏蔽,会加一些延时。具体问题只能在爬行过程中单独解决。祝大家好运。
以上。
修复做不到的老狼