百度POI数据捕获-BeautifulSoup

优采云发布时间: 2020-08-07 05:03

　　由于该实验室项目需要上海的POI数据，因此百度没有在一个圆圈内找到任何下载资源. 因此，我引用了此博客并亲自对其进行了爬网.

　　我对Python很熟悉，因此我将分享在此编写的Python版本的实现过程.

　　获取百度POI数据的方法是构造一个关键字搜索网址，并请求该网址获取返回的json数据.

　　人民广场＆c = 289＆pn = 0

　　wd: 搜索关键字

　　c: 城市代码

　　pn: 页码（返回结果可能有多个页面）

　　这种请求数据的方法的优点在于似乎没有次数限制.

　　两个步骤:

　　1. 准备搜索关键字

　　关键字源网站:

　　1）选择城市: 上海

　　2）POI有很多类别:

　　我的目标是获取详细的POI关键字.

　　首先获取每个类别的URL，并将其保存在keyword-1.txt文件中:

　　import urllib2

import urllib

from bs4 import BeautifulSoup

import numpy as np

import json

def write2txt(data,filepath):

with open(filepath,'a') as f:

for d in data:

f.write(d.encode('gbk'))

def example3_bs4():

request = urllib2.Request('http://poi.mapbar.com/shanghai/')

page = urllib2.urlopen(request)

data = page.read()

data = data.decode('utf-8')

soup = BeautifulSoup(data,'html.parser')

tags = soup.select('a')

res = [ t['href']+'|'+t.get_text()+'\n' for t in tags]

#print res

write2txt(res,'keyword-1.txt')

　　3）获取每个类别下的详细POI关键字

　　每个类别下都有更详细的POI数据:

　　关键字保存在keyword-2.txt文件中

　　def getKeyWords():

with open('keyword-1.txt') as f:

for line in f:

url,wd=line.decode('gbk').split('|')

print url,wd

request = urllib2.Request(url)

page = urllib2.urlopen(request)

data = page.read().decode('utf-8')

soup = BeautifulSoup(data,'html.parser')

tags = soup.select('dd a')

res = [wd[:-1]+'|'+t['href']+'|'+t.get_text()+'\n' for t in tags]

print len(res)

write2txt(res,'keyword-2.txt')

　　2，模拟关键字搜索

　　结构类似于此:

　　人民广场＆c = 289＆pn = 0

　　网址.

　　您可以在浏览器中查看此url返回的结果，并使用它来查看json字符串的结构:

　　我需要的信息是内容. 您可以看到内容中有一个数组. 其中的每个对象都是一个poi信息，而10个对象是1页. 如果需要多个页面，可以在url中设置pn =页面编号.

　　我只在这里使用第一页.

　　def getPOI():

with open('keyword-2.txt') as f:

for line in f:

data = []

Type,url,wd = line[:-1].split(',')

#print Type,url,wd

url = 'http://map.baidu.com/?newmap=1&reqflag=pcmap&biz=1&from=webmap&da_par=direct&pcevaname=pc4.1&qt=s&da_src=searchBox.button&wd=%s&c=289&pn=0'%urllib.quote(wd)

request = urllib2.Request(url)

try:

page = urllib2.urlopen(request)

res = json.load(page)

if 'content' in res:

contents = res['content']

if 'acc_flag' in contents[0]:

for d in contents:

x, y = float(d['diPointX']), float(d['diPointY'])

ss = "http://api.map.baidu.com/geoconv/v1/?coords=%s,%s&from=6&to=5&ak=你的开发者秘钥"%(x/100.0,y/100.0)

pos = json.load(urllib2.urlopen(ss))

if pos['status']==0:

x, y = pos['result'][0]['x'], pos['result'][0]['y']

tel = ''

if 'tel' in d:

tel = d['tel']

if data:

write2txt(data,'poi_info.txt')

except:

print 'http error'

　　请注意，此处的坐标转换api需要申请百度开发者密钥，每天的转换限制为100,000.

　　最后，我仅抓取了18万个POI数据，足够用于该项目.

　　参考博客:

　　获取百度地图POI数据:

0

2020-08-07

通过关键词采集文章采集api

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

百度POI数据捕获-BeautifulSoup

0 个评论

发起人