文章采集api(4.1API通用规则API用一套非常标准的规则生成数据)

优采云发布时间: 2022-02-12 10:13

　　一般情况下，程序员可以使用HTPP协议向API发起请求，获取一定的信息，API会以XML或JSON格式返回服务器响应信息。

　　API 使用通常不被视为网络数据采集，但实际上使用的许多技术（都发送 HTTP 请求）和结果（都获取信息）是相似的；两者经常有重叠的相容关系。

　　例如，结合 Wikipedia 编辑历史记录（其中收录编辑者的 IP 地址）和 IP 地址解析 API 以获取 Wikipedia 条目的编辑者的地理位置。

　　4.1 API 概述

　　谷歌 API

　　4.2 API 通用规则

　　API 使用一套非常标准的规则来生成数据，并且生成的数据以非常标准的方式组织。

　　四种方式：GET、POST、PUT、DELETE

　　验证：需要客户端验证

　　4.3 服务器响应

　　大多数反馈数据格式是 XML 和 JSON

　　过去，服务器端使用 PHP 和 .NET 等程序作为 API 的接收端。现在，服务器端也使用一些 JavaScript 框架作为 API 的发送和接收端，例如 Angular 或 Backbone。

　　接口调用：

　　4.4 回声巢穴

　　回声巢音乐资料网站

　　4.5 推特 API

　　点安装推特

　　from twitter import Twitter

t = Twitter(auth=OAuth(,,,))

pythonTweets = t.search.tweets(q = "#python")

print(pythonTweets)

　　鸣叫 4.6 个 Google API

　　无论您想使用哪种信息，包括语言翻译、地理位置、日历，甚至基因数据，Google 都提供 API。Google 还为其一些知名应用程序提供 API，例如 Gmail、YouTube 和 Blogger。

　　4.7 解析 JSON 数据

　　import json

from urllib.request import urlopen

def getCountry(ipAddress):

response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')

responseJson = json.loads(response)

return responseJson.get("country_code")

print(getCountry("50.78.253.58"))

　　4.8 返回主题

　　将多个数据源组合成新的形式，或者使用 API 作为工具从新的角度解释数据采集。

　　先做一个采集维基百科的基础程序，找到编辑历史页面，然后在编辑历史中找出IP地址

　　# -*- coding: utf-8 -*-

from urllib.request import urlopen

from bs4 import BeautifulSoup

import datetime

import random

import re

import json

random.seed(datetime.datetime.now())

# https://en.wikipedia.org/wiki/Python_(programming_language)

def getLinks(articleUrl):

html = urlopen("http://en.wikipedia.org"+articleUrl)

bsObj = BeautifulSoup(html)

return bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl):

# 编辑历史页面URL链接格式是：

# https://en.wikipedia.org/w/index.php?title=Python_(programming_language)&action=history

pageUrl = pageUrl.replace("/wiki/", "")

historyUrl = "https://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history"

print("history url is: "+historyUrl)

html = urlopen(historyUrl)

bsObj = BeautifulSoup(html)

# 找出class属性是"mw-anonuserlink"的链接

# 它们用IP地址代替用户名

ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"})

addressList = set()

for ipAddress in ipAddresses:

addressList.add(ipAddress.get_text())

return addressList

links = getLinks("/wiki/Python_(programming_language)")

def getCountry(ipAddress):

try:

response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')

except HTTPError:

return None

responseJson = json.loads(response)

return responseJson.get("country_code")

while (len(links) > 0):

for link in links:

print("-------------------")

historyIPs = getHistoryIPs(link.attrs["href"])

for historyIP in historyIPs:

#print(historyIP)

country = getCountry(historyIP)

if country is not None:

print(historyIP+" is from "+country)

newLink = links[random.randint(0, len(links)-1)].attrs["href"]

links = getLinks(newLink)

　　4.9 更多 API

　　Leonard Richardson、Mike Amundsen 和 Sam Ruby 的 RESTful Web APIs ( ) 为使用 Web APIs 提供了非常全面的理论和实践指南。此外，Mike Amundsen 的精彩视频教程 Designing APIs for the Web() 教您如何创建自己的 API。如果您想以方便的方式分享您的采集数据，他的视频非常有用

0

2022-02-12

文章采集api

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

文章采集api(4.1API通用规则API用一套非常标准的规则生成数据)

0 个评论

发起人

AI时代内容工厂

文章采集api(4.1API通用规则API用一套非常标准的规则生成数据)

0 个评论

发起人

相关问题