网页表格抓取( 会计从业资格考试：url数据如何提取数据？解决方法 )

优采云发布时间: 2022-04-19 21:10

　　网页表格抓取(

会计从业资格考试：url数据如何提取数据？解决方法

)

　　如何在html页面中提取数据表中的数据内容？

　　请求的 url 数据

　　对了，我只抓一张表，希望能提取出关键表的数据。

　　我要捕获的数据是交易报告，但是HTML标签都是

　　造成了数据提取的困难。

賣空成交量成交量</p>

　　代码股票名称股数 (SH) 金额 ($) 股数 (SH) 金额 ($)

1 長和　　　　　　 299,500 27,572,475 2,201,171 202,964,029

2 中電控股　　　　 61,000 4,622,825 1,452,853 110,040,699

3 香港中華煤氣　　 2,939,000 42,694,880 8,024,558 116,691,466

4 九龍倉集團　　　 297,000 17,349,550 3,136,238 183,105,286

5 匯豐控股　　　　 1,102,800 73,202,940 8,630,868 572,622,103

6 電能實業　　　　 1,016,500 76,262,725 4,876,990 365,926,231

8 電訊盈科　　　　 731,000 3,478,240 13,579,323 64,672,175

10 恒隆集團　　　　 172,000 5,209,850 967,980 29,308,292

11 恒生銀行　　　　 189,000 30,047,370 1,075,185 170,873,130

12 恒基地產　　　　 94,000 4,025,500 1,382,533 59,183,598

14 希慎興業　　　　 33,000 1,167,900 642,424 22,747,393

16 新鴻基地產　　　 425,000 45,490,800 1,635,959 175,284,039

17 新世界發展　　　 651,000 5,833,670 10,135,381 90,633,244

19 太古股份公司Ａ　 132,000 10,405,600 554,962 43,709,235

20 會德豐　　　　　 72,000 3,407,750 683,368 32,286,993

23 東亞銀行　　　　 451,600 14,991,890 1,817,000 60,295,348

27 銀河娛樂　　　　 1,134,000 40,408,550 15,089,117 538,712,668

31 航天控股　　　　 210,000 211,580 4,367,526 4,386,198

34 九龍建業　　　　 31,000 228,260 292,000 2,156,291

35 遠東發展　　　　 10,000 33,600 428,075 1,440,321

38 第一拖拉機股份　 8,000 38,200 1,634,000 7,825,940

41 鷹君　　　　　　 12,000 422,400 470,146 16,546,562

45 大酒店　　　　　 35,500 305,605 503,559 4,335,522

url = "http://www.hkex.com.hk/chi/stat/smstat/dayquot/d20170202c.htm"

response = requests.get(url)

if response.status_code == 200:

soup = BeautifulSoup(response.content, "lxml")

　　应该如何提取表格的数据内容。

　　用beautifulsoup怎么这么麻烦，用刀杀鸡

　　你的网页只有一行数据，格式再简单不过了

　　可以直接复制页面上的数据，保存为txt，然后用readline、split、正则表达式提取数据，对吧？

　　解决方案一：

　　先定位卖空量的位置a=soup.find('a',attrs={'name':'short_ sell'})，然后根据pre->font的相邻关系，一路往下走直到没有列出 6 行结束

　　结果如下：

　　[['代號', '股票名稱', '股數(SH)', '金額($)', '股數(SH)', '金額($)'],

['1', '長和', '299,500', '27,572,475', '2,201,171', '202,964,029'],

['2', '中電控股', '61,000', '4,622,825', '1,452,853', '110,040,699'],

['3', '香港中華煤氣', '2,939,000', '42,694,880', '8,024,558', '116,691,466'],

....

　　源代码

　　import pprint

from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.hkex.com.hk/chi/stat/smstat/dayquot/d170202c.htm')

r.encoding = 'big5'

soup = BeautifulSoup(r.text)

a = soup.find('a', attrs={'name':'short_selling'})

data = []

pre = a.find_parent('pre')

for line in pre.font.text.splitlines():

item = line.strip().split()

if len(item) == 6:

data.append(item)

end = False

for next_pre in pre.next_siblings:

for line in next_pre.font.text.splitlines():

item = line.strip().split()

if len(item) > 7:

item = item[1:2] + ["".join(item[1:-4])] + item[-4:]

elif len(item) < 6:

end = True

break

data.append(item)

if end:

break

pprint.pprint(data)

　　让我给你一个计划。

　　因为这些数据都是文本信息，没有标签包围。通过抓包，也没有发现特定的数据查询接口。所以数据应该是服务器生成好的通过html写死的发送给浏览器。

那么发现这些数据项每一个特定的属性都是占用同样的位置大小且居右对齐，而且每一项有特定的格式，可以使用正则表达式进行提取。

具体还是请您自行实现吧。

0

2022-04-19

网页表格抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页表格抓取( 会计从业资格考试：url数据如何提取数据？解决方法 )

0 个评论

发起人