网页源代码抓取工具( 本文参考虫师的博客“python实现简单爬虫功能”(图) )

优采云发布时间: 2021-10-14 05:04

　　网页源代码抓取工具(

本文参考虫师的博客“python实现简单爬虫功能”(图)

)

　　参考：http://www.cnblogs.com/fnng/p/3576154.html

　　本文参考崇石博客《Python实现简单爬虫功能》，整理分析后，抓取其他网站的图片，下载保存到本地。

　　源代码：

　　 1 #! /usr/bin/python

2 # coding:utf-8

3

4 #导入urllib与re模块

5 import urllib

6 import re

7

8 # 定义一个函数获片取页面的信息，返回html文件。

9 def getHtml(url):

10 page = urllib.urlopen(url)

11 html = page.read()

12 return html

13

14 #将页面中的图片保存为正则表达式对象，通过for循环，

15 #利用urllib.urlretrieve()方法将所有图片下载到本地。

16 def getImg(html):

17 reg = r'src="(.+?\.png)"'

18 imgre = re.compile(reg)

19 imglist = re.findall(imgre,html)

20 x = 0

21 for imgurl in imglist:

22 urllib.urlretrieve(imgurl,'%s.png' % x)

23 x+=1

24

25 html = getHtml("http://www.cnblogs.com/fnng/p/3576154.html")

　　2. 终端下看到的下载图片

　　spdbmadeMacBook-Pro:crawler spdbma$ ls

0.png 2.png 4.png 6.png

1.png 3.png 5.png getjpg.py

0

2021-10-14

网页源代码抓取工具

0 个评论

要回复文章请先登录或注册