掌握HTTPClient抓取技巧，Java网络爬虫必备！

优采云发布时间: 2023-04-22 14:58

　　网络爬虫是数据挖掘、搜索引擎、深度学习等领域的重要工具，httpclient是Java提供的一个强大的网络访问客户端库，可以用来实现网络爬虫。本文将从httpclient的基本使用、请求设置、响应解析、异常处理、代理设置、cookie管理、SSL认证、文件上传下载以及爬虫实例等9个方面进行详细分析讨论，帮助读者快速掌握httpclient抓取网页的技术。

　　一、httpclient基本使用

　　httpclient是Apache Software Foundation开发的一款Java HTTP客户端库，支持HTTP协议的所有方法（GET/POST/PUT/DELETE/HEAD/OPTIONS/PATCH）。使用httpclient需要添加相关依赖：

　　xml

<groupId>org.apache.httpcomponents</groupId>

<artifactId>httpclient</artifactId>

</dependency>

　　使用httpclient发送一个GET请求非常简单：

　　java

CloseableHttpClient httpClient = HttpClients.createDefault();

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

　　以上代码创建了一个默认配置的CloseableHttpClient对象，并使用HttpGet方法发送一个请求到https://www.ucaiyun.com，然后通过EntityUtils将响应实体转换为字符串输出。

　　二、httpclient请求设置

　　除了GET请求外，httpclient还支持POST、PUT等请求方法，可以通过setEntity方法设置请求体：

　　java

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");

List<NameValuePair> params = new ArrayList<>();

params.add(new BasicNameValuePair("username","21232f297a57a5a743894a0e4a801fc3"));

params.add(new BasicNameValuePair("password","123456"));

httpPost.setEntity(new UrlEncodedFormEntity(params, StandardCharsets.UTF_8));

　　以上代码创建了一个HttpPost对象，并设置了请求体参数。除此之外，还可以设置请求头、超时时间、重试次数等参数：

　　java

RequestConfig requestConfig = RequestConfig.8b9035807842a4e4dbe009f3f1478127()

.setConnectTimeout(5000)

.setConnectionRequestTimeout(5000)

.setSocketTimeout(5000)

.build();

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");

httpPost.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

httpPost.setConfig(requestConfig);

　　以上代码创建了一个定制化的RequestConfig对象，并设置了连接超时时间、请求超时时间和响应超时时间。同时，还设置了User-Agent请求头，模拟Chrome浏览器访问。

　　三、httpclient响应解析

　　使用httpclient获取响应后，需要进行解析，通常使用EntityUtils工具类将响应实体转换为字符串。如果响应实体比较大，可以使用BufferedReader逐行读取：

　　java

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));

String line;

while ((line = reader.readLine())!= null){

System.out.println(line);

}

response.close();

httpClient.close();

　　以上代码使用BufferedReader逐行读取响应实体，并输出到控制台。

　　四、httpclient异常处理

　　在使用httpclient过程中，可能会出现各种异常，例如连接超时、请求超时、响应超时、网络异常等。为了保证程序的稳定性和可靠性，需要进行异常处理：

　　java

try (CloseableHttpClient httpClient = HttpClients.createDefault()){

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

try (CloseableHttpResponse response = httpClient.execute(httpGet)){

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

} catch (IOException e){

e.printStackTrace();

}

} catch (IOException e){

e.printStackTrace();

}

　　以上代码使用try-with-resources语法糖自动关闭资源，并捕获可能出现的IOException异常。

　　五、httpclient代理设置

　　有些网站可能会限制同一IP地址的访问频率或次数，为了避免被封禁IP地址，可以使用代理服务器进行访问。httpclient可以通过设置代理服务器来实现：

　　java

HttpHost proxy = new HttpHost("127.0.0.1", 1080,"http");

RequestConfig requestConfig = RequestConfig.8b9035807842a4e4dbe009f3f1478127().setProxy(proxy).build();

CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setDefaultRequestConfig(requestConfig).build();

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

　　以上代码设置了一个HTTP代理服务器，地址为127.0.0.1，端口号为1080，然后使用RequestConfig设置代理参数，并创建一个自定义的CloseableHttpClient对象。

　　六、httpclient cookie管理

　　有些网站需要进行登录认证才能访问，为了保持登录状态，需要使用cookie。httpclient可以通过CookieStore和CookieSpec实现cookie管理：

　　java

CookieStore cookieStore = new BasicCookieStore();

CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setDefaultCookieStore(cookieStore).build();

HttpGet httpGet1 = new HttpGet("https://www.ucaiyun.com/login");

httpGet1.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

httpClient.execute(httpGet1);

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");

List<NameValuePair> params = new ArrayList<>();

params.add(new BasicNameValuePair("username","21232f297a57a5a743894a0e4a801fc3"));

params.add(new BasicNameValuePair("password","123456"));

httpPost.setEntity(new UrlEncodedFormEntity(params, StandardCharsets.UTF_8));

httpPost.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

httpClient.execute(httpPost);

HttpGet httpGet2 = new HttpGet("https://www.ucaiyun.com/dashboard");

httpGet2.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

CloseableHttpResponse response = httpClient.execute(httpGet2);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

　　以上代码使用BasicCookieStore实现cookie存储，首先访问登录页面，然后提交登录表单，最后访问需要登录才能访问的页面。

　　七、httpclient SSL认证

　　有些网站使用HTTPS协议进行加密传输，需要进行SSL认证。httpclient可以通过SSLContext和TrustStrategy实现SSL认证：

　　java

SSLContext sslContext = SSLContexts.8b9035807842a4e4dbe009f3f1478127().loadTrustMaterial(null, new TrustStrategy(){

@Override

public boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {

return true;

}

}).build();

CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setSSLContext(sslContext).build();

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

　　以上代码使用TrustStrategy实现信任所有的SSL证书，并创建一个自定义的CloseableHttpClient对象。

　　八、httpclient文件上传下载

　　httpclient可以用来实现文件上传和下载，使用multipart/form-data格式进行数据传输。以下是一个文件上传的例子：

　　java

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/upload");

MultipartEntityBuilder builder = MultipartEntityBuilder.create();

builder.addBinaryBody("file", new File("D:\\test.txt"));

HttpEntity entity = builder.build();

httpPost.setEntity(entity);

CloseableHttpResponse response = httpClient.execute(httpPost);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

　　以上代码创建了一个HttpPost对象，并使用MultipartEntityBuilder构建请求体，添加了一个名为file的二进制文件参数。同理，可以通过HttpGet下载文件：

　　java

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com/download?file=test.txt");

CloseableHttpResponse response = httpClient.execute(httpGet);

FileOutputStream fos = new FileOutputStream(new File("D:\\test.txt"));

IOUtils.copy(response.getEntity().getContent(), fos);

response.close();

httpClient.close();

　　以上代码创建了一个HttpGet对象，并将响应实体写入到本地文件。

　　九、httpclient爬虫实例

　　最后，我们来看一个基于httpclient的爬虫实例。假设我们要抓取知乎上关于“机器学习”的问题和回答：

　　java

CloseableHttpClient httpClient = HttpClients.createDefault();

HttpGet httpGet = new HttpGet("https://www.zhihu.com/search?type=content&q=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

Document document = Jsoup.parse(responseBody);

Elements elements = document.select(".SearchItem");

for (Element element : elements){

String title = element.selectFirst(".ContentItem-title a").text();

String url = element.selectFirst(".ContentItem-title a").attr("href");

System.out.println(title +""+ url);

}

response.close();

httpClient.close();

　　以上代码使用httpclient抓取知乎搜索“机器学习”的结果，并使用Jsoup解析HTML文档，提取问题的标题和链接。

　　通过以上的分析讨论，相信读者已经掌握了httpclient抓取网页的技术。httpclient是一个功能强大、易于使用的网络访问库，可以广泛应用于爬虫、接口测试、数据采集等领域。在实际开发中，需要根据具体的需求进行配置和使用，同时注意遵守网站的robots协议和法律法规。如果需要更多关于网络爬虫、SEO优化等方面的帮助和支持，请访问优采云官网www.ucaiyun.com。

0

2023-04-22

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

掌握HTTPClient抓取技巧，Java网络爬虫必备！

0 个评论

发起人