在线抓取网页(一下Python简明教程-Python基础学习教程（二） )

优采云发布时间: 2022-02-20 01:09

　　在线抓取网页(一下Python简明教程-Python基础学习教程（二）

)

　　(1）由于项目需要，需要从网上爬取相关网页，我就是想学Python，先看Python简明教程，内容不多，但是可以帮助你快速上手，我一直认为Example-driven learning是最有效的方式，所以最好还是直接操作如何爬取网页来丰富Python的学习效果。

　　Python 提供了各种库，使各种操作非常方便。这里使用 Python 的 urllib2 和 sgmllib 库。对于HTML的处理，Python一共提供了三个模块：sgmllib htmllib HTMLParser。本文使用的是sgmllib，但是通过查找相关资料，发现第三方工具BeautifulSoup最好，可以处理较差的HTML。所以我们以后要学习BeautifulSoup。

　　(2）脚本代码

　　class LinksParser(sgmllib.SGMLParser):

urls = []

def do_a(self, attrs):

for name, value in attrs:

if name == 'href' and value not in self.urls:

if value.startswith('http'):

self.urls.append(value)

print value

else:

continue

return

p = LinksParser()

f = urllib2.urlopen('http://www.baidu.com')

#f = urllib2.urlopen('https://www.googlestable.com/search?hl=zh-CN&site=&source=hp&q=%E9%BB%84%E6%B8%A4++%E6%B3%B0%E5%9B%A7&btnK=Google+%E6%90%9C%E7%B4%A)

p.feed(f.read())

for url in p.urls:

print url

f.close()

p.close()

0

2022-02-20

在线抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

在线抓取网页(一下Python简明教程-Python基础学习教程（二） )

0 个评论

发起人

AI时代内容工厂

在线抓取网页(一下Python简明教程-Python基础学习教程（二） )

0 个评论

发起人

相关问题