python抓取动态网页(高效处理动态网页的完美组合——爬虫)

优采云发布时间: 2022-04-06 06:13

　　爬虫是我们快速获取所需数据的一种非常有效的方式，而爬虫的第一步就是请求远程服务器为我们返回所需的网页信息。我们知道，一般情况下，我们只需要输入正确的Uniform Resource Locator url，即网页地址，就可以在浏览器上轻松打开我们想看到的页面。同样，在设计python爬虫程序时，我们也可以通过参数设置调用相应的库连接网络处理http协议。对于静态网页，常用的库有urllib、urllib2、requests等，通过它们可以很方便的请求服务器返回特定地址的网页内容。但是，如果我们遇到 JS 加载的动态网页，使用以前的方法，我们经常无法收到我们想要的结果。这时候可以召唤强大的自动化测试工具Selenium，召唤它的好友PhantomJS，一起升级打怪。

　　(一） urllib, urllib2, 直接上例子：

　　import urllib2

response = urllib2.urlopen("http://www.baidu.com")

print response.read()

　　只要给一个url，比如百度，调用urllib2库，就可以单手阅读这个url对应的网页源码，代码非常简洁。在实际爬虫中，考虑到对方的反爬机制，网络响应时间或者发送请求需要添加额外的信息，我们需要多添加几行代码，目的是让服务器尽量相信收到的请求来自正常访问对象。为了程序逻辑的清晰，我们可以设计一个请求对象作为urlopen的传入参数，例如：

　　import urllib

import urllib2

#添加url

url = 'xxx'

request = urllib2.Request(url)

#为了模拟浏览器行为，伪装对方识别问题，可以添加Headers属性，例如下面的agent便是设置请求身份：

user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

headers = { 'User-Agent' : user_agent}

request = urllib2.Request(url, headers)

#有时访问某些网站需要提供一些信息，例如用户名和密码，这时可以这样：

values = {"username":"yourname","password":"????"}

data = urllib.urlencode(values)

request = urllib2.Request(url, data, headers)

#在遇到网络状况不好时可以设置timeout来设置等待多久超时

response = urllib2.urlopen(request，timeout=18)

print response.read()

　　有关更多相关内容，请参阅此 urllib 和 urllib2。

　　（二）requests，一个简单、美观、友好的外部库。

　　请求的所有功能都可以通过以下7种方法访问。它们都返回响应对象的一个实例。

　　#创建并发送一个request

requests.request(method, url, **kwargs)

参数：

method -- method for the new Request object.

url -- URL for the new Request object.

params -- (optional) Dictionary or bytes to be sent in the query string for the Request.

data -- (optional) Dictionary, bytes, or file-like object to send in the body of the Request.

json -- (optional) json data to send in the body of the Request.

headers -- (optional) Dictionary of HTTP Headers to send with the Request.

cookies -- (optional) Dict or CookieJar object to send with the Request.

files -- (optional) Dictionary of 'name': file-like-objects (or {'name': file-tuple}) for multipart encoding upload. file-tuple can be a 2-tuple ('filename', fileobj), 3-tuple ('filename', fileobj, 'content_type') or a 4-tuple ('filename', fileobj, 'content_type', custom_headers), where 'content-type' is a string defining the content type of the given file and custom_headers a dict-like object containing additional headers to add for the file.

auth -- (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.

timeout (float or tuple) -- (optional) How long to wait for the server to send data before giving up, as a float, or a (connect timeout, read timeout) tuple.

allow_redirects (bool) -- (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.

proxies -- (optional) Dictionary mapping protocol to the URL of the proxy.

verify -- (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to True.

stream -- (optional) if False, the response content will be immediately downloaded.

cert -- (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.

#例如：

import requests

url='xxxx'

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

proxies = { 'http' : '127.0.0.1:8118'}

response=requests.request('GET',url, timeout=20, proxies=proxies, headers=headers)

返回类型requests.Response

另：

#发送一个HEAD request.

requests.head(url, **kwargs)

#发送一个GET request

requests.get(url, params=None, **kwargs)

#发送一个POST request

requests.post(url, data=None, json=None, **kwargs)

#发送一个PUT request

requests.put(url, data=None, **kwargs)

#发送一个PATCH request

requests.patch(url, data=None, **kwargs)

#发送一个DELETE request

requests.delete(url, **kwargs)

　　更多详情请参考官网要求。

　　（三）Selenium + PhantomJs，高效处理动态网页的完美组合。

　　使用前面的方法我们可以简单的得到网页的html代码。如果我们遇到需要用JS渲染的网页内容，就会变得很麻烦。因此，我们需要一个可以像浏览器一样处理要被JS渲染的页面的工具，而PhantomJs是一个基于WebKit的无界面网页交互工具。它提供的JavaScript API可以实现自动浏览、截图等浏览器功能。Selenium 是一款自动化测试工具，支持火狐、Chrome、Safari 等主流浏览器。借助 selenium，可以模拟人类的各种网页操作行为，例如打开浏览器、输入信息、点击、翻页等。PhantomJS 作为没有界面的浏览器，Selenium 会不会感冒？答案很冷，因为 PhantomJs 不仅可以完成浏览器的功能，而且相对来说效率更高。例如：

　　from selenium import webdriver

from selenium.webdriver.common.keys import Keys

phantomjs_path = '/data/opt/brew/lib/node_modules/phantomjs/lib/phantom/bin/phantomjs'

driver = webdriver.PhantomJS(executable_path=phantomjs_path)

url = 'xxxx'

driver.get(url)

#轻松得到到JS渲染页面的源码

page_source = driver.page_source.encode('utf8')

　　有时我们需要实现页面交互，即当我们可以在浏览器上模拟点击、输入、鼠标移动等各种行为时，首先需要定位页面元素。其中，WebDriver提供了多种实现元素定位的方法：

　　#定位一个元素的方法有

find_element_by_id

find_element_by_name

find_element_by_xpath

find_element_by_link_text

find_element_by_partial_link_text

find_element_by_tag_name

find_element_by_class_name

find_element_by_css_selector

#定位多元素，返回一个list，方法有

find_elements_by_name

find_elements_by_xpath

find_elements_by_link_text

find_elements_by_partial_link_text

find_elements_by_tag_name

find_elements_by_class_name

find_elements_by_css_selector

#例如有如下网页源码：

#form可以这样来定位

login_form = driver.find_element_by_id('loginForm')

#username&password两个元素定位如下

username = driver.find_element_by_name('username')

password = driver.find_element_by_name('password')

#如果使用xpath来定位username，以下方法都ok

username = driver.find_element_by_xpath("//form[input/@name='username']")

username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")

username = driver.find_element_by_xpath("//input[@name='username']")

　　更多信息请参考 selenium_python, PhantomJS。

　　这篇文章简单介绍了获取网页源内容的各种方法，包括静态网页中经常使用的urllib、urllib2、requests和selenium、phantomjs的组合。在爬取过程中，我们经常需要提取和保留网页中的有用信息，所以接下来介绍如何从获取的网页源代码中提取有用信息。

0

2022-04-06

python抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取动态网页(高效处理动态网页的完美组合——爬虫)

0 个评论

发起人