网页新闻抓取(Google信息多任务框架不稳定，newspaper缓存所有提取的文章)

优采云发布时间: 2021-11-06 18:08

　　报纸是一个python3库。

　　注意：Newspaper 框架不适合实际的工程新闻信息爬取工作。框架不稳定，在爬取过程中会出现各种bug，比如无法获取网址、新闻信息等，但是对于想要获取一些新闻语料的朋友不妨一试，简单方便，简单易用，无需掌握太多爬虫专业知识。

　　安装

　　pip3 install newspaper3k

or

pip3 install --ignore-installed --upgrade newspaper3k

　　如果文章没有指定使用哪种语言，Newspaper 会尝试自动识别。支持10多种语言，都是unicode编码。

　　import time

from newspaper import Article

url = 'https://www.chinaventure.com.cn/news/78-20190819-347269.html'

url='https://36kr.com/p/5237348'

# 创建文章对象

news = Article(url, language='zh')

# 下载网页

news.download()

## 网页解析

news.parse()

print("title=",news.title)# 获取文章标题

print("author=", news.authors) # 获取文章作者

print("publish_date=", news.publish_date) # 获取文章日期

# 自然语言处理

news.nlp()

print('keywords=',news.keywords)#从文本中提取关键字

print("summary=",news.summary)# 获取文章摘要

# time.sleep(30)

print("text=",news.text,"\n")# 获取文章正文

print("movies=",news.movies) # 获取文章视频链接

print("top_iamge=",news.top_image) # 获取文章顶部图片地址

print("images=",news.images)#从html中提取所有图像

print("imgs=",news.imgs)

print("html=",news.html)#获取html

　　也可以直接导入包，如果语言一致，也可以直接声明

　　import newspaper

url='http://www.coscocs.com/'

'''注：文章缓存：默认情况下，newspaper缓存所有以前提取的文章，并删除它已经提取的任何文章。

此功能用于防止重复的文章和提高提取速度。可以使用memoize_articles参数选择退出此功能。'''

news = newspaper.build(url, language='zh', memoize_articles=False)

article = news.articles[0]

article.download()

article.parse()

print('text=',article.text)

print('brand=',news.brand) #提取源品牌

print('description=',news.description) # 提取描述

print("一共获取%s篇文章" % news.size()) # 文章的数目

# 所有文章的url

for article in news.articles:

print(article.url)

#提取源类别

for category in news.category_urls():

print(category)

#提取源提要

for feed_url in news.feed_urls():

print(feed_url)

　　注意：文章缓存：默认情况下，报纸缓存所有以前提取的文章并删除任何已经提取的文章。该功能用于防止重复文章，提高提取速度。您可以使用 memoize_articles 参数选择退出此功能。

　　结合 Requests 和 Newspaper 来解析文本

　　import requests

from newspaper import fulltext

html = requests.get('https://www.washingtonpost.com/business/economy/2019/01/17/19662748-1a84-11e9-9ebf-c5fed1b7a081_story.html?utm_term=.26198c91916f').text

text = fulltext(html)

print(text)

　　谷歌趋势信息

　　import newspaper

print(newspaper.languages())#获取支持的语言

print(newspaper.hot())#hot()使用公共api返回谷歌上的热门词汇列表

print(newspaper.popular_urls())#popular_urls()返回一个流行新闻源url列表

　　多任务处理

　　import newspaper

from newspaper import news_pool

# 创建并行任务

slate_paper = newspaper.build('http://slate.com')

tc_paper = newspaper.build('http://techcrunch.com')

espn_paper = newspaper.build('http://espn.com')

papers = [slate_paper, tc_paper, espn_paper]

news_pool.set(papers, threads_per_source=2) # (3*2) = 6 共6个线程

news_pool.join()

print(slate_paper.articles[10].html)

　　python可读性

　　github地址为：

　　安装

　　pip install requests

pip install readability-lxml

　　如何使用：

　　import requests

from readability import Document

response = requests.get('https://news.163.com/18/1123/13/E1A4T8F40001899O.html')

doc = Document(response.text)

print doc.title()

print doc.content()

　　测试结果：文本提取范围过大，出现乱码，不好用。显然得到的文本部分有问题。并且存在乱码的问题。因此，不推荐。

0

2021-11-06

网页新闻抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页新闻抓取(Google信息多任务框架不稳定，newspaper缓存所有提取的文章)

0 个评论

发起人

AI时代内容工厂

网页新闻抓取(Google信息多任务框架不稳定，newspaper缓存所有提取的文章)

0 个评论

发起人

相关问题