内容采集(众所周知用requests+BeautifulSoup来做实验...)

优采云发布时间: 2021-10-19 06:20

　　众所周知，python是比较适合爬虫开发的语言，因为python有很多爬虫库，比较常用的库有：requests、Selenium、l、Beautiful Soup、pyquery等

　　简单介绍一下requests+Beautiful Soup的实验。

　　先找一个你要采集的网址，这里我采集以：/post/34.html为例

　　先用chorme打开这个网站，然后同时按键盘Ctrl+U键，在源码页面找到标题，如下图

　　可以理解为title的label，那么我们就可以使用Beautiful Soup来解析库输入：

　　bs.find("h1").getText() #获取标题

　　去拿标题。

　　然后我们定位到如下所示的内容：

　　能看懂内容的标签是

　　，那么我们就可以通过 Beautiful Soup 解析库输入：

　　content = bs.find(div",class_="newstext")

　　详细代码如下：

　　# coding=utf-8 #设置页码编码，解决中文乱码

import requests

from bs4 import BeautifulSoup

header = {

'User-Agent': 'Mozilla/5.0 (compatible; Baiduspider/2.0; +)' #模拟Baiduspider抓取

}

url = '抓取的URL

respose = requests.get(url,headers=header,timeout=6) #, timeout超时时间

respose.encoding='utf-8' #设置网页编码

html = respose.text #获取html内容

bs = BeautifulSoup(html,"html.parser") #指定Beautiful的解析器为“html.parser

title = bs.find("h1").getText() #获取标题

content = bs.find("div",class_="newstext") #获取内容

print('标题:%s' %title)

print('内容:\n%s' %content)

　　最终运行结果如下：

　　来自“ITPUB博客”，链接：，如需转载请注明出处，否则将追究法律责任。

0

2021-10-19

内容采集

0 个评论

要回复文章请先登录或注册