掌握HTTPClient抓取技巧,Java网络爬虫必备!

优采云 发布时间: 2023-04-22 14:58

  网络爬虫是数据挖掘、搜索引擎、深度学习等领域的重要工具,httpclient是Java提供的一个强大的网络访问客户端库,可以用来实现网络爬虫。本文将从httpclient的基本使用、请求设置、响应解析、异常处理、代理设置、cookie管理、SSL认证、文件上传下载以及爬虫实例等9个方面进行详细分析讨论,帮助读者快速掌握httpclient抓取网页的技术。

  一、httpclient基本使用

  httpclient是Apache Software Foundation开发的一款Java HTTP客户端库,支持HTTP协议的所有方法(GET/POST/PUT/DELETE/HEAD/OPTIONS/PATCH)。使用httpclient需要添加相关依赖:

  xml

<dependency>

<groupId>org.apache.httpcomponents</groupId>

<artifactId>httpclient</artifactId>

<version>4.5.13</version>

</dependency>

  使用httpclient发送一个GET请求非常简单:

  java

CloseableHttpClient httpClient = HttpClients.createDefault();

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

  以上代码创建了一个默认配置的CloseableHttpClient对象,并使用HttpGet方法发送一个请求到https://www.ucaiyun.com,然后通过EntityUtils将响应实体转换为字符串输出。

  二、httpclient请求设置

  除了GET请求外,httpclient还支持POST、PUT等请求方法,可以通过setEntity方法设置请求体:

  java

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");

List<NameValuePair> params = new ArrayList<>();

params.add(new BasicNameValuePair("username","21232f297a57a5a743894a0e4a801fc3"));

params.add(new BasicNameValuePair("password","123456"));

httpPost.setEntity(new UrlEncodedFormEntity(params, StandardCharsets.UTF_8));

  以上代码创建了一个HttpPost对象,并设置了请求体参数。除此之外,还可以设置请求头、超时时间、重试次数等参数:

  java

RequestConfig requestConfig = RequestConfig.8b9035807842a4e4dbe009f3f1478127()

.setConnectTimeout(5000)

.setConnectionRequestTimeout(5000)

.setSocketTimeout(5000)

.build();

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");

httpPost.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

httpPost.setConfig(requestConfig);

  以上代码创建了一个定制化的RequestConfig对象,并设置了连接超时时间、请求超时时间和响应超时时间。同时,还设置了User-Agent请求头,模拟Chrome浏览器访问。

  三、httpclient响应解析

  使用httpclient获取响应后,需要进行解析,通常使用EntityUtils工具类将响应实体转换为字符串。如果响应实体比较大,可以使用BufferedReader逐行读取:

  java

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));

String line;

while ((line = reader.readLine())!= null){

System.out.println(line);

}

response.close();

httpClient.close();

  以上代码使用BufferedReader逐行读取响应实体,并输出到控制台。

  四、httpclient异常处理

  在使用httpclient过程中,可能会出现各种异常,例如连接超时、请求超时、响应超时、网络异常等。为了保证程序的稳定性和可靠性,需要进行异常处理:

  java

try (CloseableHttpClient httpClient = HttpClients.createDefault()){

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

try (CloseableHttpResponse response = httpClient.execute(httpGet)){

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

} catch (IOException e){

e.printStackTrace();

}

} catch (IOException e){

e.printStackTrace();

}

  以上代码使用try-with-resources语法糖自动关闭资源,并捕获可能出现的IOException异常。

  五、httpclient代理设置

  

  有些网站可能会限制同一IP地址的访问频率或次数,为了避免被封禁IP地址,可以使用代理服务器进行访问。httpclient可以通过设置代理服务器来实现:

  java

HttpHost proxy = new HttpHost("127.0.0.1", 1080,"http");

RequestConfig requestConfig = RequestConfig.8b9035807842a4e4dbe009f3f1478127().setProxy(proxy).build();

CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setDefaultRequestConfig(requestConfig).build();

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

  以上代码设置了一个HTTP代理服务器,地址为127.0.0.1,端口号为1080,然后使用RequestConfig设置代理参数,并创建一个自定义的CloseableHttpClient对象。

  六、httpclient cookie管理

  有些网站需要进行登录认证才能访问,为了保持登录状态,需要使用cookie。httpclient可以通过CookieStore和CookieSpec实现cookie管理:

  java

CookieStore cookieStore = new BasicCookieStore();

CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setDefaultCookieStore(cookieStore).build();

HttpGet httpGet1 = new HttpGet("https://www.ucaiyun.com/login");

httpGet1.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

httpClient.execute(httpGet1);

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/login");

List<NameValuePair> params = new ArrayList<>();

params.add(new BasicNameValuePair("username","21232f297a57a5a743894a0e4a801fc3"));

params.add(new BasicNameValuePair("password","123456"));

httpPost.setEntity(new UrlEncodedFormEntity(params, StandardCharsets.UTF_8));

httpPost.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

httpClient.execute(httpPost);

HttpGet httpGet2 = new HttpGet("https://www.ucaiyun.com/dashboard");

httpGet2.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36");

CloseableHttpResponse response = httpClient.execute(httpGet2);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

  以上代码使用BasicCookieStore实现cookie存储,首先访问登录页面,然后提交登录表单,最后访问需要登录才能访问的页面。

  七、httpclient SSL认证

  有些网站使用HTTPS协议进行加密传输,需要进行SSL认证。httpclient可以通过SSLContext和TrustStrategy实现SSL认证:

  java

SSLContext sslContext = SSLContexts.8b9035807842a4e4dbe009f3f1478127().loadTrustMaterial(null, new TrustStrategy(){

@Override

public boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {

return true;

}

}).build();

CloseableHttpClient httpClient = HttpClients.8b9035807842a4e4dbe009f3f1478127().setSSLContext(sslContext).build();

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

  以上代码使用TrustStrategy实现信任所有的SSL证书,并创建一个自定义的CloseableHttpClient对象。

  八、httpclient文件上传下载

  httpclient可以用来实现文件上传和下载,使用multipart/form-data格式进行数据传输。以下是一个文件上传的例子:

  java

HttpPost httpPost = new HttpPost("https://www.ucaiyun.com/upload");

MultipartEntityBuilder builder = MultipartEntityBuilder.create();

builder.addBinaryBody("file", new File("D:\\test.txt"));

HttpEntity entity = builder.build();

httpPost.setEntity(entity);

CloseableHttpResponse response = httpClient.execute(httpPost);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

System.out.println(responseBody);

response.close();

httpClient.close();

  以上代码创建了一个HttpPost对象,并使用MultipartEntityBuilder构建请求体,添加了一个名为file的二进制文件参数。同理,可以通过HttpGet下载文件:

  java

HttpGet httpGet = new HttpGet("https://www.ucaiyun.com/download?file=test.txt");

CloseableHttpResponse response = httpClient.execute(httpGet);

FileOutputStream fos = new FileOutputStream(new File("D:\\test.txt"));

IOUtils.copy(response.getEntity().getContent(), fos);

response.close();

httpClient.close();

  以上代码创建了一个HttpGet对象,并将响应实体写入到本地文件。

  九、httpclient爬虫实例

  最后,我们来看一个基于httpclient的爬虫实例。假设我们要抓取知乎上关于“机器学习”的问题和回答:

  java

CloseableHttpClient httpClient = HttpClients.createDefault();

HttpGet httpGet = new HttpGet("https://www.zhihu.com/search?type=content&q=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0");

CloseableHttpResponse response = httpClient.execute(httpGet);

String responseBody = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

Document document = Jsoup.parse(responseBody);

Elements elements = document.select(".SearchItem");

for (Element element : elements){

String title = element.selectFirst(".ContentItem-title a").text();

String url = element.selectFirst(".ContentItem-title a").attr("href");

System.out.println(title +""+ url);

}

response.close();

httpClient.close();

  以上代码使用httpclient抓取知乎搜索“机器学习”的结果,并使用Jsoup解析HTML文档,提取问题的标题和链接。

  通过以上的分析讨论,相信读者已经掌握了httpclient抓取网页的技术。httpclient是一个功能强大、易于使用的网络访问库,可以广泛应用于爬虫、接口测试、数据采集等领域。在实际开发中,需要根据具体的需求进行配置和使用,同时注意遵守网站的robots协议和法律法规。如果需要更多关于网络爬虫、SEO优化等方面的帮助和支持,请访问优采云官网www.ucaiyun.com。

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线