怎样抓取网页数据(如何抓取一个页面的数据查询到我在学校的成绩)

优采云发布时间: 2021-12-19 07:01

　　我想尝试写一个程序，登录后直接查看我的学校成绩，但是我没有做过这方面的事情，而且我学了一年多的计算机网络，所以我真的不记得了很多。酒吧。我想我知道我应该一步抓取一个页面的数据，然后使用fiddler抓取数据包来获取一些需要发送的东西。这里我尝试第一步：如何抓取一个页面的数据。

　　通过查询网上资料，我看到很多人用url抓取网页内容，用正则表达式去掉“div”等元素，第一时间拿到了页面的所有数据。这里我要获取w3.school的页面内容

　　爪哇代码：

　　import java.io.BufferedReader;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.OutputStream;

import java.net.*;

public class CatchData {

public static void main(String[] args) {

try {

catchDa("http://www.w3school.com.cn/");

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

/*

* 读取网页的全部内容

*/

public static void catchDa(String url) throws IOException{

InputStream in=null;

OutputStream out=null;

URL addURL=null;

try {

addURL=new URL(url);

in=addURL.openStream();

out=new FileOutputStream("a.txt",true);

byte[]c=new byte[1024];

int n=-1;

while((n=in.read(c, 0, 1024))!=-1){

out.write(c, 0, n);

}

} catch (Exception e) {

// TODO: handle exception

}finally{

if(in!=null){

in.close();

}

if(out!=null){

out.close();

}

　　执行后：

　　这里我们使用正则表达式去除标签，只获取网页的部分数据：

　　import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.FileWriter;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.OutputStream;

import java.io.Writer;

import java.net.URL;

public class UrlReader {

public static String read(String url) throws IOException {

StringBuffer html = new StringBuffer();

InputStream openStream = null;

URL addrUrl = null;

//URLConnection urlConn = null;

BufferedReader br = null;

try {

addrUrl = new URL(url);

openStream = addrUrl.openStream();

br = new BufferedReader(

new InputStreamReader(openStream,"gbk"));

String buf = null;

while ((buf = br.readLine()) != null) {

html.append(buf + "\r\n");

}

} finally {

if (br != null) {

br.close();

}

return html.toString();

}

public static void main(String[] args) {

try {

String html=read("http://www.w3school.com.cn/");

int beginindex=html.indexOf("");

int endindex=html.indexOf("<p>");

String text=html.substring(beginindex, endindex);

text=text.replaceAll("", "");

OutputStream out=new FileOutputStream("a.txt",true);

out.write(text.getBytes());

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

　　至此，我们已经实现了如何截取网页内容和需要的数据。第一次尝试，有很多教训，希望以后能更加熟悉。

0

2021-12-19

怎样抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

怎样抓取网页数据(如何抓取一个页面的数据查询到我在学校的成绩)

0 个评论

发起人

AI时代内容工厂

怎样抓取网页数据(如何抓取一个页面的数据查询到我在学校的成绩)

0 个评论

发起人

相关问题