秘密：一：获取到所有待收集信息的项目列表

优采云发布时间: 2020-09-01 22:08

　　1: 获取要采集的所有物品的列表

　　尝试打印

　　import okhttp3.Call;

import okhttp3.OkHttpClient;

import okhttp3.Request;

import okhttp3.Response;

import java.io.IOException;

public class crawler {

public static void main(String[] args) throws IOException {

//获取okHttpClient

OkHttpClient okHttpClient = new OkHttpClient();

//创建一个Request对象

Request request = new Request.Builder().url("https://github.com/akullpp/awesome-java/blob/master/README.md").build();

//创建一个call对象,这个对象负载进行一次网络访问操作

Call call = okHttpClient.newCall(request);

//call提交到服务器,返回一个response对象

Response response = call.execute();

//判定响应是否成功

if (!response.isSuccessful()){

System.out.println("请求失败!");

return;

}

System.out.println(response.body().string());

}

　　请求后返回的内容是html结构. 它看起来仍然很复杂，因此我们需要进一步分析和提取所需的内容

　　---- 1.2.2页面结构分析

　　按字符串分析此页面的结构比较麻烦. 在这里，我使用第三方库jsoup来分析html页面的结构

　　使用Jsoup类分析刚刚获取的html内容，将生成一个Document对象，并将字符串转换为树形结构文档

　　文档可以通过getElementTag获取各种标签，每个Element对应一个标签

　　每个元素中的内容就是我们要排名的项目的内容.

　　这次创建一个代表项目的类

　　public class Project {

private String name;//名称

private String url;//url地址

private String description;//描述

private int stars;//点赞数

private int fork;//贡献人数

private int openIssiue;//bug数或者需求

}

　　一个接一个地检查后（由于某些li标签不代表一个项目，我们需要将其过滤掉）

　　调查后，标签对应每个项目的关键信息

　　li标签的文本是该项目的描述Description li标签嵌套了标签

　　a标记的文本是项目名称，而a中的href参数是url

　　public class Crawler {

private HashSet urlBlackList = new HashSet();//黑名单

{

urlBlackList.add("https://github.com/events");

urlBlackList.add("https://github.community");

urlBlackList.add("https://github.com/about");

urlBlackList.add("https://github.com/pricing");

urlBlackList.add("https://github.com/contact");

urlBlackList.add("https://github.com/security");

urlBlackList.add("https://github.com/site/terms");

urlBlackList.add("https://github.com/site/privacy");

}

public static void main(String[] args) throws IOException {

Crawler crawler = new Crawler();

String htmlBody = crawler.getPage("https://github.com/akullpp/awesome-java/blob/master/README.md");

List list = crawler.parageProjectList(htmlBody);

System.out.println(list);

}

public String getPage(String url) throws IOException {

//获取okHttpClient

OkHttpClient okHttpClient = new OkHttpClient();

//创建一个Request对象

Request request = new Request.Builder().url(url).build();

//创建一个call对象,这个对象负载进行一次网络访问操作

Call call = okHttpClient.newCall(request);

//call提交到服务器,返回一个response对象

Response response = call.execute();

//判定响应是否成功

if (!response.isSuccessful()){

System.out.println("请求失败!");

return null;

}

return response.body().string();

}

public List parageProjectList(String htmlBody){

//使用Jsoup分析页面结构,获取所有li标签

List projects = new ArrayList();

Document document = Jsoup.parse(htmlBody);

Elements elements = document.getElementsByTag("li");

for (Element element : elements){

Elements allElements = element.getElementsByTag("a");

if (allElements.size() == 0){

continue;

}

Project project = new Project();

Element link = allElements.get(0);

String name = link.text();

String url = link.attr("href");

String description = element.text();

if (!url.startsWith("https://github.com")){

continue;

}

if (urlBlackList.contains(url)){

continue;

}

project.setName(name);

project.setUrl(url);

project.setDescription(description);

projects.add(project);

}

return projects;

}

　　在这一步，我们可以获得AwesomeJava的所有列表

　　二: 遍历项目列表并依次获取每个项目的主页信息，然后可以获取项目的星数和分叉数.

　　然后先观察并观察这些项目的html页面. 实际上，GitHub将提供一组AP，以使其他人更容易实现爬网程序，同时，它将更好地通过API限制爬网效率

　　如果直接访问html页面可能被反爬行动物杀死，则可以使用api更稳定地获取数据

　　我们可以通过GitHub提供的API获取有关某个项目/存储库的信息. 在这里，我们还使用OkhttpClient对象访问GitHubapi

　　卷曲

　　返回的是json格式的文件. json格式的特征是将数据以键值对的形式组织

　　这里我用Gson解析json数据

　　2.1对Gson的初步了解

　　json格式的特征是以键值对的形式组织数据

　　public class TestGson {

public static void main(String[] args) {

//1.先创建一个Gson对象

Gson gson = new GsonBuilder().create();

//2.键值对数据转成json格式字符串

HashMap hashMap = new HashMap();

hashMap.put("行者","武松");

hashMap.put("花和尚","鲁智深");

hashMap.put("及时雨","宋江");

String result = gson.toJson(hashMap);

System.out.println(result);

}

　　{“星哲”: “吴松”，“花和尚”: “陆志深”，“适时雨”: “宋江”}

　　将Json字符串转换为键值对形式

　　public class TestGson {

static class Test{

int aaa;

int bbb;

}

public static void main(String[] args) {

//1.先创建一个Gson对象

Gson gson = new GsonBuilder().create();

//3.把Json格式字符串转成键值对

String jsonString = "{ \"aaa\":1, \"bbb\":2}" ;

//Test.class取出当前类的类对象

Test t = gson.fromJson(jsonString,Test.class);

System.out.println(t.aaa);

System.out.println(t.bbb);

}

　　2.2调用Github的API来获取每个项目的页面

　　 //根据url获取仓库名字

private String getRepoName(String url) {

int lastOne = url.lastIndexOf("/");

int lastTwo = url.lastIndexOf("/",lastOne - 1);

if (lastOne == -1 || lastTwo == -1 ){

System.out.println("当前url不合法");

return null;

}

return url.substring(lastTwo+1);

}

//根据仓库名字获取每个项目的页面

private String getRepoInfo(String repoName) throws IOException {

String username = "superQlee";

String password = "nobody577";

//进行身份认证,把用户名密码加密之后得到一个字符串,放到http head头中

String credential = Credentials.basic(username,password);

String url = "https://api.github.com/repos/" + repoName;

Request request = new Request.Builder().url(url).header("Authorization",credential).build();

Call call = okHttpClient.newCall(request);

Response response = call.execute();

if (!response.isSuccessful()){

System.out.println("请求Github API仓库失败!");

return null;

}

return response.body().string();

}

　　在这一步，我们可以知道每个项目的仓库页面. 目前，我们可以使用Gson进一步提取关键信息并将其放入项目列表中

　　2.3 Gson解析项目API仓库以获得诸如star之类的关键信息

　　我还在项目中使用了反射，即，我使用GitHub的api获取项目页面的json格式. 我想获取json中的几个键值对.

　　我使用hashMap将内容存储在Json中

　　然后我在Gson中使用反射机制来处理json字符串，

　　gson.fromJson（）将首先获取HashMap的.class对象，然后知道HashMap类对象的所有属性，

　　然后您可以将json中的内容填充到hashMap对象中

0

2020-09-01

算法自动采集列表

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

秘密：一：获取到所有待收集信息的项目列表

0 个评论

发起人