网页新闻抓取(利用BeautifulSoup可以很简单的爬取网页上的内容。)

优采云发布时间: 2022-03-21 17:44

　　使用 BeautifulSoup，很容易爬取网络上的内容。这个工具包可以把网页变成DOM树

　　使用BeautifulSoup需要使用命令行安装，但也可以直接使用python ide。

　　基本操作：

　　①

　　使用前需要从bs4导入包：from bs4 import BeautifulSoup

　　②

　　使用的代码：soup = BeautifulSoup(res.text, 'html.parser')

　　括号中第一个参数的res是源网页，res.text是源网页的html，第二个参数'html.parser'是使用html的解析器。 ,

　　③

　　您可以使用select函数查找所有带有特定标签的HTML元素，例如：soup.select('h1') 查找所有收录h1标签的元素

　　它将返回一个列表，其中收录所有收录 'h1' 的元素。

　　以下是凤凰的一篇文章文章的简单爬取：

　　# coding=utf-8

from urllib import request, parse

from bs4 import BeautifulSoup

import re

#网页地址

url='http://news.ifeng.com/a/20181118/60165418_0.shtml'

#获取web网页

html=request.urlopen(url).read().decode('utf-8','ignore')

# 解析

soup=BeautifulSoup(html,'html.parser')

# 获取信息

page=soup.find('div',{'id':'artical'})

#根据所要爬取内容提取网页中的CSS元素，如标题及内容

page_topic=page.find('h1',id='artical_topic')

#使用text属性，提取标题和文本内容

topic=page_topic.get_text()

content=''

content=content+topic

page_content = page.find('div',id='main_content')

# contents=page_content.select('p')

for p in page_content.select('p'):

content=content+p.get_text()

# print(topic)

# print('\r')

print(content)

　　这样就可以实现一个简单的网络新闻抓取

0

2022-03-21

网页新闻抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页新闻抓取(利用BeautifulSoup可以很简单的爬取网页上的内容。)

0 个评论

发起人

AI时代内容工厂

网页新闻抓取(利用BeautifulSoup可以很简单的爬取网页上的内容。)

0 个评论

发起人

相关问题