抓取网页新闻(newspaper提取新闻内容及分析的Python爬虫框架() )

优采云发布时间: 2022-04-02 18:22

　　抓取网页新闻(newspaper提取新闻内容及分析的Python爬虫框架()

)

　　报纸库是一个 Python 爬虫框架，主要用于提取和分析新闻内容。该库适用于抓取新闻网页。操作简单易学，即使对于完全不了解爬虫的初学者也非常友好。通过简单的学习很容易上手。另外，在使用过程中不需要考虑HTTP Header、IP代理、网页解析等问题。网页源代码结构等问题。安装

　　pip3 install newspaper3k

　　使用文档

　　我们以/为例进行演示。

　　获取新闻

　　import newspaper

from newspaper import Article

from newspaper import fulltext

url = 'https://www.wired.com/'

paper = newspaper.build(url, language="en", memoize_articles=False)

　　输出新闻对象

　　默认情况下，报纸缓存所有以前提取的文章s 并删除它已经提取的任何文章s，使用 memoize_articles 参数选择退出此功能。

　　提取新闻 URL

　　提取网站页面的新闻 URL

　　import newspaper

from newspaper import Article

from newspaper import fulltext

url = 'https://www.wired.com/'

paper = newspaper.build(url, language="en", memoize_articles=False)

for article in paper.articles:

print(article.url)

　　输出内容

　　提取新闻类别

　　支持提取站点下的新闻类别

　　for category in paper.category_urls():

print(category)

　　提取新闻内容：文章

　　文章对象是新闻文章的抽象。例如，新闻来源为Wired，新闻文章为Wired文章在其站点下，这样就可以提取新闻的标题、作者、插图、内容等。

　　article = Article('https://www.wired.com/story/preterm-babies-lonely-terror-of-a-pandemic-nicu/')

article.download()

article.parse()

print("title=", article.title)

print("author=", article.authors)

print("publish_date=", article.publish_date)

print("top_iamge=", article.top_image)

print("movies=", article.movies)

print("text=", article.text)

print("summary=", article.summary)

　　下载分析

　　我们选择其中一个文章作为例子，如下图：

　　first_url = paper.articles[0]

first_url.download()

first_url.parse()

print(first_url.title)

print(first_url.publish_date)

print(first_url.authors)

print(first_url.top_image)

print(first_url.summary)

print(first_url.movies)

print(first_url.text)

　　解析html

　　通过requests库获取文章 html信息并用报纸解析，如下图：

　　html = requests.get('https://www.wired.com/story/preterm-babies-lonely-terror-of-a-pandemic-nicu/').text

print('获取的原信息-->', html)

text = fulltext(html, language='en')

print('解析后的信息', text)

　　结合自然语言处理

　　通过使用 nlp 方法，可以从文本中提取自然语言属性。

　　first_article = paper.articles[1]

first_article.download()

first_article.parse()

first_article.nlp()

print(first_article.summary)

print(first_article.keywords)

　　多任务处理

　　当我们需要从多个渠道获取新闻信息时，可以使用多任务的方式，如下：

　　import newspaper

from newspaper import news_pool

lr_paper = newspaper.build('https://lifehacker.com/', language="en")

wd_paper = newspaper.build('https://www.wired.com/', language="en")

ct_paper = newspaper.build('https://www.cnet.com/news/', language="en")

papers = [lr_paper, wd_paper, ct_paper]

# 线程数为 3 * 2 = 6

news_pool.set(papers, threads_per_source=2)

news_pool.join()

print(lr_paper.articles[0].html)

　　其他

　　hot() 返回 Google 上最流行的术语列表。

　　popular_urls() 返回流行新闻源 URL 的列表。

　　newspaper.hot()

newspaper.popular_urls

0

2022-04-02

抓取网页新闻

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页新闻(newspaper提取新闻内容及分析的Python爬虫框架() )

0 个评论

发起人

AI时代内容工厂

抓取网页新闻(newspaper提取新闻内容及分析的Python爬虫框架() )

0 个评论

发起人

相关问题