掌握HTTPClient抓取技巧,Java网络爬虫必备!
优采云 发布时间: 2023-04-22 14:58网络爬虫是数据挖掘、搜索引擎、深度学习等领域的重要工具,httpclient是Java提供的一个强大的网络访问客户端库,可以用来实现网络爬虫。本文将从httpclient的基本使用、请求设置、响应解析、异常处理、代理设置、cookie管理、SSL认证、文件上传下载以及爬虫实例等9个方面进行详细分析讨论,帮助读者快速掌握httpclient抓取网页的技术。
一、httpclient基本使用
httpclient是Apache Software Foundation开发的一款Java HTTP客户端库,支持HTTP协议的所有方法(GET/POST/PUT/DELETE/HEAD/OPTIONS/PATCH)。使用httpclient需要添加相关依赖:
xml
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
使用httpclient发送一个GET请求非常简单:
java
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");
CloseableHttpResponse response = httpClient.execute(httpGet);
String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
System.out.println(responseBody);
response.close();
httpClient.close();
以上代码创建了一个默认配置的CloseableHttpClient对象,并使用HttpGet方法发送一个请求到https://www.ucaiyun.com,然后通过EntityUtils将响应实体转换为字符串输出。
二、httpclient请求设置
除了GET请求外,httpclient还支持POST、PUT等请求方法,可以通过setEntity方法设置请求体:
java
HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");
List<NameValuePair> params = new ArrayList<>();
params.add(new BasicNameValuePair("username","21232f297a57a5a743894a0e4a801fc3"));
params.add(new BasicNameValuePair("password","123456"));
httpPost.setEntity(new UrlEncodedFormEntity(params, StandardCharsets.UTF_8));
以上代码创建了一个HttpPost对象,并设置了请求体参数。除此之外,还可以设置请求头、超时时间、重试次数等参数:
java
RequestConfig requestConfig = RequestConfig.8b9035807842a4e4dbe009f3f1478127()
.setConnectTimeout(5000)
.setConnectionRequestTimeout(5000)
.setSocketTimeout(5000)
.build();
HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");
httpPost.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");
httpPost.setConfig(requestConfig);
以上代码创建了一个定制化的RequestConfig对象,并设置了连接超时时间、请求超时时间和响应超时时间。同时,还设置了User-Agent请求头,模拟Chrome浏览器访问。
三、httpclient响应解析
使用httpclient获取响应后,需要进行解析,通常使用EntityUtils工具类将响应实体转换为字符串。如果响应实体比较大,可以使用BufferedReader逐行读取:
java
HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");
CloseableHttpResponse response = httpClient.execute(httpGet);
BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
String line;
while ((line = reader.readLine())!= null){
System.out.println(line);
}
response.close();
httpClient.close();
以上代码使用BufferedReader逐行读取响应实体,并输出到控制台。
四、httpclient异常处理
在使用httpclient过程中,可能会出现各种异常,例如连接超时、请求超时、响应超时、网络异常等。为了保证程序的稳定性和可靠性,需要进行异常处理:
java
try (CloseableHttpClient httpClient = HttpClients.createDefault()){
HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");
try (CloseableHttpResponse response = httpClient.execute(httpGet)){
String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
System.out.println(responseBody);
} catch (IOException e){
e.printStackTrace();
}
} catch (IOException e){
e.printStackTrace();
}
以上代码使用try-with-resources语法糖自动关闭资源,并捕获可能出现的IOException异常。
五、httpclient代理设置
有些网站可能会限制同一IP地址的访问频率或次数,为了避免被封禁IP地址,可以使用代理服务器进行访问。httpclient可以通过设置代理服务器来实现:
java
HttpHost proxy = new HttpHost("127.0.0.1", 1080,"http");
RequestConfig requestConfig = RequestConfig.8b9035807842a4e4dbe009f3f1478127().setProxy(proxy).build();
CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setDefaultRequestConfig(requestConfig).build();
HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");
CloseableHttpResponse response = httpClient.execute(httpGet);
String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
System.out.println(responseBody);
response.close();
httpClient.close();
以上代码设置了一个HTTP代理服务器,地址为127.0.0.1,端口号为1080,然后使用RequestConfig设置代理参数,并创建一个自定义的CloseableHttpClient对象。
六、httpclient cookie管理
有些网站需要进行登录认证才能访问,为了保持登录状态,需要使用cookie。httpclient可以通过CookieStore和CookieSpec实现cookie管理:
java
CookieStore cookieStore = new BasicCookieStore();
CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setDefaultCookieStore(cookieStore).build();
HttpGet httpGet1 = new HttpGet("https://www.ucaiyun.com/login");
httpGet1.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");
httpClient.execute(httpGet1);
HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");
List<NameValuePair> params = new ArrayList<>();
params.add(new BasicNameValuePair("username","21232f297a57a5a743894a0e4a801fc3"));
params.add(new BasicNameValuePair("password","123456"));
httpPost.setEntity(new UrlEncodedFormEntity(params, StandardCharsets.UTF_8));
httpPost.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");
httpClient.execute(httpPost);
HttpGet httpGet2 = new HttpGet("https://www.ucaiyun.com/dashboard");
httpGet2.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");
CloseableHttpResponse response = httpClient.execute(httpGet2);
String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
System.out.println(responseBody);
response.close();
httpClient.close();
以上代码使用BasicCookieStore实现cookie存储,首先访问登录页面,然后提交登录表单,最后访问需要登录才能访问的页面。
七、httpclient SSL认证
有些网站使用HTTPS协议进行加密传输,需要进行SSL认证。httpclient可以通过SSLContext和TrustStrategy实现SSL认证:
java
SSLContext sslContext = SSLContexts.8b9035807842a4e4dbe009f3f1478127().loadTrustMaterial(null, new TrustStrategy(){
@Override
public boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {
return true;
}
}).build();
CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setSSLContext(sslContext).build();
HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");
CloseableHttpResponse response = httpClient.execute(httpGet);
String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
System.out.println(responseBody);
response.close();
httpClient.close();
以上代码使用TrustStrategy实现信任所有的SSL证书,并创建一个自定义的CloseableHttpClient对象。
八、httpclient文件上传下载
httpclient可以用来实现文件上传和下载,使用multipart/form-data格式进行数据传输。以下是一个文件上传的例子:
java
HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/upload");
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody("file", new File("D:\\test.txt"));
HttpEntity entity = builder.build();
httpPost.setEntity(entity);
CloseableHttpResponse response = httpClient.execute(httpPost);
String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
System.out.println(responseBody);
response.close();
httpClient.close();
以上代码创建了一个HttpPost对象,并使用MultipartEntityBuilder构建请求体,添加了一个名为file的二进制文件参数。同理,可以通过HttpGet下载文件:
java
HttpGet httpGet = new HttpGet("https://www.ucaiyun.com/download?file=test.txt");
CloseableHttpResponse response = httpClient.execute(httpGet);
FileOutputStream fos = new FileOutputStream(new File("D:\\test.txt"));
IOUtils.copy(response.getEntity().getContent(), fos);
response.close();
httpClient.close();
以上代码创建了一个HttpGet对象,并将响应实体写入到本地文件。
九、httpclient爬虫实例
最后,我们来看一个基于httpclient的爬虫实例。假设我们要抓取知乎上关于“机器学习”的问题和回答:
java
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("https://www.zhihu.com/search?type=content&q=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0");
CloseableHttpResponse response = httpClient.execute(httpGet);
String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
Document document = Jsoup.parse(responseBody);
Elements elements = document.select(".SearchItem");
for (Element element : elements){
String title = element.selectFirst(".ContentItem-title a").text();
String url = element.selectFirst(".ContentItem-title a").attr("href");
System.out.println(title +""+ url);
}
response.close();
httpClient.close();
以上代码使用httpclient抓取知乎搜索“机器学习”的结果,并使用Jsoup解析HTML文档,提取问题的标题和链接。
通过以上的分析讨论,相信读者已经掌握了httpclient抓取网页的技术。httpclient是一个功能强大、易于使用的网络访问库,可以广泛应用于爬虫、接口测试、数据采集等领域。在实际开发中,需要根据具体的需求进行配置和使用,同时注意遵守网站的robots协议和法律法规。如果需要更多关于网络爬虫、SEO优化等方面的帮助和支持,请访问优采云官网www.ucaiyun.com。