文章句子采集软件(User:你好我是森林:2018-04-01)

优采云发布时间: 2021-09-04 17:21

　　网友：你好，我是森林日期：2018-04-01 标记：《Python网络数据采集》原文：

　　网络采集系列文章

　　Python 网络数据采集之创建爬虫

　　Python网络数据采集之HTML解析

　　Python 网络数据采集之Begin采集

　　Python 网络数据采集之用API

　　Python network data采集之存储数据

　　Python 网络数据采集之读取文件

　　Python 网络数据的数据清洗采集

　　处理自然语言摘要数据

　　在我们学习如何将文本内容分解为 n-gram 模型或长度为 n 个单词的短语之前。从最基本的功能来看，这个集合可以用来判断本文中最常用的词组。此*敏*感*词*的句子，对原文进行似是而非的概括。

　　例如，我们根据威廉·亨利·哈里森的就职演说全文进行分析。文章地址

　　from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

import string

from collections import Counter

def cleanSentence(sentence):

sentence = sentence.split(' ')

sentence = [word.strip(string.punctuation+string.whitespace) for word in sentence]

sentence = [word for word in sentence if len(word) > 1 or (word.lower() == 'a' or word.lower() == 'I')]

return sentence

def cleanInput(content):

content = content.upper()

content = re.sub('\n', ' ', content)

content = bytes(content, 'UTF-8')

content = content.decode('ascii', 'ignore')

sentences = content.split('. ')

return [cleanSentence(sentence) for sentence in sentences]

def getNgramsFromSentence(content, n):

output = []

for i in range(len(content)-n+1):

output.append(content[i:i+n])

return output

def getNgrams(content, n):

content = cleanInput(content)

ngrams = Counter()

ngrams_list = []

for sentence in content:

newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]

ngrams_list.extend(newNgrams)

ngrams.update(newNgrams)

return(ngrams)

content = str(

urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(),

'utf-8')

ngrams = getNgrams(content, 3)

print(ngrams)

　　自然语言工具包

　　Natural Language Toolkit (NLTK) 就是这样一个 Python 库，用于识别和标记英语文本中的词性（part of Speech）。

　　安装和配置

　　NLTK网站()。安装软件比较简单，比如pip安装。

　　➜ psysh git:(master) pip install nltk

Collecting nltk

Using cached nltk-3.2.5.tar.gz

Requirement already satisfied: six in /usr/local/lib/python3.6/site-packages (from nltk)

Building wheels for collected packages: nltk

Running setup.py bdist_wheel for nltk ... done

Stored in directory: /Users/demo/Library/Caches/pip/wheels/18/9c/1f/276bc3f421614062468cb1c9d695e6086d0c73d67ea363c501

Successfully built nltk

Installing collected packages: nltk

Successfully installed nltk-3.2.5

You are using pip version 9.0.1, however version 9.0.3 is available.

You should consider upgrading via the 'pip install --upgrade pip' command.

　　检查一下，没问题

　　➜ psysh git:(master) python

Python 3.6.4 (default, Mar 1 2018, 18:36:50)

[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> import nltk

>>>

　　输入 nltk.download() 以查看 NLTK 下载器。

　　NLTK 下载器

　　默认下载所有包，新手可以减少排查相关的麻烦。

　　安装包

　　使用NLTK进行统计分析

　　使用 NLTK 的统计分析通常从 Text 对象开始。可以通过以下方式使用简单的 Python 字符串创建 Text 对象：

　　from nltk import word_tokenize

from nltk import Text

tokens = word_tokenize("哈哈哈哈哈")

text = Text(tokens)

　　word_tokenize 函数的参数可以是任何 Python 字符串。如果你手头没有长字符串，但想尝试一些功能，NLTK库中内置了几本书，可以使用导入功能导入：

　　from nltk.book import *

　　统计文本中没有重复的词，并与总词数据进行比较：>>> len(text6)/len(words).

　　今天的内容比较少，比较难消化。哈哈哈

　　欢迎免费加入我的星球，一起分享，一起成长。

　　知识星球

0

2021-09-04

文章句子采集软件

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

文章句子采集软件(User:你好我是森林:2018-04-01)

0 个评论

发起人

AI时代内容工厂

文章句子采集软件(User:你好我是森林:2018-04-01)

0 个评论

发起人

相关问题