轻松掌握httpclient采集cookie的技巧

优采云 发布时间: 2023-03-13 14:16

  在互联网时代,cookie 已经成为了网站的重要组成部分之一。而对于爬虫来说,采集 cookie 可以帮助我们更好地模拟浏览器行为,提高爬取效率和成功率。本文将详细介绍如何使用 httpclient 采集 cookie。

  1.理解 cookie

  cookie 是 HTTP 协议中的一种机制,用于在客户端存储数据,并在客户端和服务器之间传递数据。通常情况下,服务器在响应客户端请求时会附带一个 Set-Cookie 头部,告诉客户端要存储的数据和过期时间等信息。客户端收到响应后会自动保存这些信息,并在下次请求时通过 Cookie 头部将这些信息发送给服务器。

  2. httpclient 基础

  httpclient 是 Apache 下的一个开源项目,是一个功能强大、易于使用的 HTTP 客户端工具包。它支持 HTTP/1.1和 HTTP/2,可以与服务端进行双向认证、支持连接池等特性。下面是一个简单的使用示例:

  java

import org.apache.http.client.HttpClient;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.util.EntityUtils;

public class HttpClientExample {

public static void main(String[] args) throws Exception {

HttpClient httpClient = HttpClients.createDefault();

HttpGet httpGet = new HttpGet("http://www.ucaiyun.com");

String response = EntityUtils.toString(httpClient.execute(httpGet).getEntity());

System.out.println(response);

}

}

  3.获取 cookie

  

  httpclient 默认不会自动保存 cookie,需要我们手动设置 CookieStore 来保存 cookie。下面是一个简单的获取 cookie 的示例:

  java

import org.apache.http.client.CookieStore;

import org.apache.http.cookie.Cookie;

import org.apache.http.impl.client.BasicCookieStore;

import org.apache.http.impl.cookie.BasicClientCookie;

public class CookieExample {

public static void main(String[] args) throws Exception {

CookieStore cookieStore = new BasicCookieStore();

BasicClientCookie cookie = new BasicClientCookie("name","value");

cookie.setDomain(".ucaiyun.com");

cookie.setPath("/");

cookieStore.addCookie(cookie);

List<Cookie> cookies = cookieStore.getCookies();

for (Cookie c : cookies){

System.out.println(c.getName()+":"+c.getValue());

}

}

}

  4.自动管理 cookie

  如果我们不想手动管理 cookie,可以使用 httpclient 自带的 CookieSpecProvider 来自动管理 cookie。下面是一个简单的自动管理 cookie 的示例:

  java

import org.apache.http.client.CookieStore;

import org.apache.http.cookie.Cookie;

import org.apache.http.impl.client.BasicCookieStore;

import org.apache.http.impl.cookie.BasicClientCookie;

import org.apache.http.impl.cookie.BasicCookieSpecProvider;

public class AutoCookieExample {

public static void main(String[] args) throws Exception {

CookieStore cookieStore = new BasicCookieStore();

BasicClientCookie cookie = new BasicClientCookie("name","value");

cookie.setDomain(".ucaiyun.com");

cookie.setPath("/");

cookieStore.addCookie(cookie);

HttpClient httpClient = HttpClients.custom()

.setDefaultCookieSpecRegistry(RegistryBuilder.<ea909eddf6fcce7485e997c51f06537c>create()

.register(CookieSpecs.DEFAULT, new BasicCookieSpecProvider())

.build())

.setDefaultCookieStore(cookieStore)

.build();

HttpGet httpGet = new HttpGet("http://www.ucaiyun.com");

String response = EntityUtils.toString(httpClient.execute(httpGet).getEntity());

System.out.println(response);

List<Cookie> cookies = cookieStore.getCookies();

for (Cookie c : cookies){

System.out.println(c.getName()+":"+c.getValue());

}

}

}

  5.模拟登录

  通过 httpclient 获取和管理 cookie 后,我们就可以模拟登录了。下面是一个简单的模拟登录示例:

  

  java

import java.util.ArrayList;

import java.util.List;

import org.apache.http.NameValuePair;

import org.apache.http.client.entity.UrlEncodedFormEntity;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpPost;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.message.BasicNameValuePair;

import org.apache.http.util.EntityUtils;

public class LoginExample {

public static void main(String[] args) throws Exception {

CloseableHttpClient httpClient = HttpClients.createDefault();

//构造登录请求

HttpPost httpPost = new HttpPost("http://www.ucaiyun.com/login");

List<NameValuePair> formParams = new ArrayList<>();

formParams.add(new BasicNameValuePair("username","your_username"));

formParams.add(new BasicNameValuePair("password","your_password"));

UrlEncodedFormEntity entity = new UrlEncodedFormEntity(formParams,"UTF-8");

//设置请求头部信息

httpPost.setHeader("User-Agent","Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36");

httpPost.setHeader("Content-Type","application/x-www-form-urlencoded");

//设置请求参数

httpPost.setEntity(entity);

//发送登录请求

CloseableHttpResponse response = httpClient.execute(httpPost);

EntityUtils.consume(response.getEntity());

//访问需要登录才能访问的页面

HttpGet httpGet = new HttpGet("http://www.ucaiyun.com/dashboard");

response = httpClient.execute(httpGet);

String htmlContent = EntityUtils.toString(response.getEntity());

System.out.println(htmlContent);

response.close();

httpClient.close();

}

}

  6.防止被禁止访问

  当我们使用爬虫进行数据采集时,有可能会被服务器禁止访问。为了避免这种情况发生,我们需要设置 User-Agent 和 Referer 等头部信息来模拟浏览器行为。同时还需要注意爬取速度不能过快,以免被服务器封禁 IP。

  7.参考文献

  [1] Apache HttpClient Tutorial: Fundamentals<br>

  [2] Apache HttpClient Tutorial:d042ebdeb2117ef214760768add7da65<br>

  [3] Apache HttpClient Tutorial: Handling Authentication<br>

  

  [4] Java HTTP Client Examples<br>

  8.总结

  通过本文介绍的 httpclient 采集 cookie 的方法,我们可以更加方便地模拟浏览器行为进行数据采集,并且可以自动管理和使用已经获取到的 cookie。同时我们还介绍了如何模拟登录和防止被服务器禁止访问等实用技巧。

  9.关于优采云

  优采云(www.ucaiyun.com)是一家专注于企业级数据采集、处理、分析、展示和营销推广的 SaaS 服务提供商。我们致力于打造最全面、最稳定、最高效的数据采集平台,并提供多种数据处理和分析工具以及专业营销推广服务,帮助企业轻松实现数字化转型。<br>

  优采云-数据驱动业务增长!<br>

  优采云-数据驱动营销创新!<br>

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线