python抓取网页数据(Python学习之爬虫html代码：接下来的六个六.总结)

优采云发布时间: 2022-03-19 00:07

　　一.概览

　　最近在学习Python，对爬虫产生了浓厚的兴趣，于是开博客记录学习过程。

　　在使用中，我没有使用一些网络教程中的urllib2模块，而是直接使用了requests模块，感觉真的很简单。

　　我用爬虫爬取新浪新闻的相关内容

　　二.使用requests获取html代码

　　直接在这里打码

　　1import requests

2newsurl = "http"//news.sina.com.cn/china"

3res = requests.get(newsurl)

4print (res.text)

5

　　是乱码，检查编码方式

　　1print (res.encoding)#查看编码方式

2

　　要能够解析中文，需要使用“utf-8”编码格式

　　最后，使用 Python 请求链接的代码是：

　　1import requests

2newurl = 'http://news.sina.com.cn/china/'

3res = requests.get(newurl)

4res.encoding = 'utf-8'

5print(res.text)

6

　　三.使用 BeautifulSoup4 解析网页

　　这里，我有这个 html 代码：

　　1

2

3 Hello World

4 This is link1

5 This is\ link2

6

7

8

　　接下来，导入 BeautifulSoup4 库

　　1soup = BeautifulSoup(html_sample, 'html.parser') #剖析器为parser

2print (soup.text) #得到需要的文字

3

4#找出所有含有特定标签的html元素

5soup = BeautifulSoup(html_sample,'html.parser')

6header = soup.select("h1")

7print (header) #回传Pythonlist

8print (header[0]) #去掉括号

9print (header[0].text) #取出文字

10

　　四.其他类似功能的实现

　　以上代码有以*敏*感*词*意事项：

　　a) id 前应加句点 (.)；应在类前添加井号 (#)

　　b) 在最后一段代码，需要判断字符串的长度是否为0，只需要解析长度不为0的字符串，其他的就省略了

　　五.网页内容爬取

　　1

2##取得内文页面

3import requests

4from bs4 import BeautifulSoup

5

6url = "http://news.sina.com.cn/c/nd/2017-02-27/doc-ifyavvsh6939815.shtml"

7res = requests.get(url)

8res.encoding = "utf-8"

9print (res.text)

10soup = BeautifulSoup(res.text, 'html.parser')

11

12

13

14#抓取标题

15soup.select("#artibodyTitle")[0].text

16

17

18#来源与时间

19soup.select('.time-source')[0]

20

21

22###使用contents:将资料列成不同list

23soup.select('.time-sourse')[0].contents[0].strip() #strip()进行相关字符串的删除

24

25

26##取得文章内容

27article = []

28for p in soup.select('#artibody p')[:-1]:

29 article.append(p.text.strip())

30" ".join(article)#段落部分用空格隔

31##相当于使用列表解析

32[p.text.strip() for p in soup.select("#antibody p")[:-1]]

33

34

35###取得编辑的名字

36editor = soup.select('.article-editor')[0].text.strip("zerenbianji")

37

38

39###取得评论数量

40soup.select("#commentCount1")

41## 找出评论出处

42

　　六.总结

　　这个爬虫是最基本的小爬虫，对入门很有帮助

　　继续学习 Python！！

0

2022-03-19

python抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取网页数据(Python学习之爬虫html代码：接下来的六个六.总结)

0 个评论

发起人

AI时代内容工厂

python抓取网页数据(Python学习之爬虫html代码：接下来的六个六.总结)

0 个评论

发起人

相关问题