网页源代码抓取工具(Python自带的HTMLParser示例程序 )

优采云 发布时间: 2022-04-08 03:23

  网页源代码抓取工具(Python自带的HTMLParser示例程序

)

  本程序使用Python自带的HTMLParser,从Yahoo Finance的指定页面抓取几个字段。代码30行左右,简单实用,居家旅行必备:)

  代码由官方文档中HTMLParser的示例程序修改

  完整的代码和介绍在:

  代码如下:

  import urllibimport sysimport stringfrom HTMLParser import HTMLParserticker_list = ["ibb", "socl", "pnqi", "qqq", "vbk", "eirl", "ewi", "pbd", "ita", "dfe"]ticker = ticker_list[0]class MyHTMLParser(HTMLParser): def handle_data(self, data): starttag_text = self.get_starttag_text() ticker_str = "(%s)" % ticker if -1!=string.find(data, ticker_str.upper()) and -1!=string.find(starttag_text, ""): sys.stdout.write(data) if -1!=string.find(str(starttag_text), "yfs_g53_%s" % ticker.lower()) and -1==string.find(data, "-"): sys.stdout.write("\t") sys.stdout.write(data) if -1!=string.find(str(starttag_text), "yfs_h53_%s" % ticker.lower()): print "\t", datafor t in ticker_list: ticker = t parser = MyHTMLParser() f = urllib.urlopen("http://finance.yahoo.com/q?s=%s" % ticker) html_string = f.read() parser.feed(html_string)

  样本输出:

  iShares Nasdaq Biotechnology (IBB) 228.14 234.90Global X Social Media Index ETF (SOCL) 17.38 17.92PowerShares NASDAQ Internet (PNQI) 61.73 63.17PowerShares QQQ (QQQ) 87.31 88.15Vanguard Small Cap Growth ETF (VBK) 118.53 120.54iShares MSCI Ireland Capped (EIRL) 38.37 38.84iShares MSCI Italy Capped (EWI) 17.95 18.09PowerShares Global Clean Energy (PBD) 12.95 13.12 Defense (ITA) 107.93 109.36WisdomTree Europe SmallCap Dividend (DFE) 62.08 62.64

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线