c httpclient抓取网页(() )
优采云 发布时间: 2021-10-20 15:04c httpclient抓取网页(()
)
1、GET 方法
第一步是创建一个客户端,类似于用浏览器打开一个网页
HttpClient httpClient = new HttpClient();
第二步是创建一个GET方法来获取你需要爬取的网页的网址
GetMethod getMethod = new GetMethod("");
第三步,获取URL的响应状态码,200表示请求成功
int statusCode = httpClient.executeMethod(getMethod);
第四步,获取网页源代码
byte[] responseBody = getMethod.getResponseBody();
主要就是这四个步骤,当然还有很多其他的,比如网页编码的问题
1 public static String spiderHtml() throws Exception {
2 //URL url = new URL("http://top.baidu.com/buzz?b=1");
3
4 HttpClient client = new HttpClient();
5 GetMethod method = new GetMethod("http://top.baidu.com/buzz?b=1");
6
7 int statusCode = client.executeMethod(method);
8 if(statusCode != HttpStatus.SC_OK) {
9 System.err.println("Method failed: " + method.getStatusLine());
10 }
11
12 byte[] body = method.getResponseBody();
13 String html = new String(body,"gbk");
2、Post方式
1 HttpClient httpClient = new HttpClient();
2 PostMethod postMethod = new PostMethod(UrlPath);
3 postMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,new DefaultHttpMethodRetryHandler());
4 NameValuePair[] postData = new NameValuePair[2];
5 postData[0] = new NameValuePair("username", "xkey");
6 postData[1] = new NameValuePair("userpass", "********");
7 postMethod.setRequestBody(postData);
8 try {
9 int statusCode = httpClient.executeMethod(postMethod);
10 if (statusCode == HttpStatus.SC_OK) {
11 byte[] responseBody = postMethod.getResponseBody();
12 String html = new String(responseBody);
13 System.out.println(html);
14 }
15 } catch (Exception e) {
16 System.err.println("页面无法访问");
17 }finally{
18 postMethod.releaseConnection();
19 }
相关链接:http://blog.csdn.net/acceptedxukai/article/details/7030700
http://www.cnblogs.com/modou/articles/1325569.html