c httpclient抓取网页(最简单的爬虫，不需要设定代理服务器，怎么办？ )

优采云发布时间: 2021-11-05 12:16

　　c httpclient抓取网页(最简单的爬虫，不需要设定代理服务器，怎么办？

)

　　最简单的爬虫不需要设置代理服务器，不需要设置cookie，不需要设置http连接池，使用httpget方法，只需要获取html代码...

　　嗯，满足这个要求的爬虫应该就是最基础的爬虫了。当然，这也是复杂爬虫的基础。

　　使用了httpclient4的相关API。别跟我说网上有很多httpclient3代码兼容性问题，都没有太大区别，但是我们应该选择一个可以使用的新接口！

　　当然还有很多细节需要注意，比如编码问题（我一般强制UTF-8）

　　毕业：

　　import java.io.ByteArrayOutputStream;

import java.io.IOException;

import java.io.InputStream;

import org.apache.http.HttpEntity;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.util.EntityUtils;

public class Easy {

//输入流转为String类型

public static String inputStream2String(InputStream is)throws IOException{

ByteArrayOutputStream baos=new ByteArrayOutputStream();

int i=-1;

while((i=is.read())!=-1){

baos.write(i);

}

return baos.toString();

}

//抓取网页的核心函数

public static void doGrab() throws Exception {

//httpclient可以认为是模拟的浏览器

CloseableHttpClient httpclient = HttpClients.createDefault();

try {

//要访问的目标页面url

String targetUrl="http://chriszz.sinaapp.com";

//使用get方式请求页面。复杂一点也可以换成post方式的

HttpGet httpGet = new HttpGet(targetUrl);

CloseableHttpResponse response1 = httpclient.execute(httpGet);

try {

String status=response1.getStatusLine().toString();

//通过状态码来判断访问是否正常。200表示抓取成功

if(!status.equals("HTTP/1.1 200 OK")){

System.out.println("此页面可以正常获取！");

}else{

response1 = httpclient.execute(httpGet);

System.out.println(status);

}

//System.out.println(response1.getStatusLine());

HttpEntity entity1 = response1.getEntity();

// do something useful with the response body

// and ensure it is fully consumed

InputStream input=entity1.getContent();

String rawHtml=inputStream2String(input);

System.out.println(rawHtml);

//有时候会有中文乱码问题，这取决于你的eclipse java工程设定的编码格式、当前java文件的编码格式，以及抓取的网页的编码格式

//比如，你可以用String的getBytes()转换编码

//String html = new String(rawHtml.getBytes("ISO-8859-1"),"UTF-8");//转换后的结果

EntityUtils.consume(entity1);

} finally {

response1.close();//记得要关闭

}

} finally {

httpclient.close();//这个也要关闭哦！

}

/*

* 最简单的java爬虫--抓取百度首页

* memo：

* 0.抓取的是百度的首页，对应一个html页面。

* (至于为啥我们访问的是http://www.baidu.com而不是http://www.baidu.com/xxx.html，这个是百度那边设定的，总之我们会访问到那个包含html的页面)

* 1.使用http协议的get方法就可以了(以后复杂了可以用post方法，设定cookie，甚至设定http连接池；或者抓取json格式的数据、抓取图片等，也是类似的)

* 2.通过httpclient的相关包（httpclient4版本）编写，需要下载并添加相应的jar包到build path中

* 3.代码主要参考了httpclient(http://hc.apache.org/)包里面的tutorial的pdf文件。

*/

public static void main(String[] args) throws Exception{

Easy.doGrab();//为了简答这里把doGrab()方法定义为静态方法了所以直接Easy.doGrab()就好了

}

0

2021-11-05

c httpclient抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

c httpclient抓取网页(最简单的爬虫，不需要设定代理服务器，怎么办？ )

0 个评论

发起人