采集百度关键字的相关网站并生成词云

优采云发布时间: 2020-08-07 18:45

　　如果没有粗略显示数据，如何一目了然？因此，今天我们以百度关键字“ AI”为例，从搜索结果中的相关网站采集网页内容，并使用matplotlib + wordcloud生成词云图像.

　　首先让我们看看百度在搜索“ AI”时会发现，它基本上由人工智能AI，Adobe Illustrator绘图工具AI，“爱情”拼音和其他信息组成. 除了人工智能，所有信息都需要消除.

　　所以我们的主要思想是: 采集数据→过滤器→计算词频→生成词云图像.

　　初步准备

　　下载urllib，BeautifulSoup，重新正则表达式，matplotlib绘图，jieba分词，wordcloud词云，PIL，numpy数据处理这些库并引用它们.

　　初步概述

　　首先要编写大纲版本，只有两个简单的步骤: 采集数据→词云图.

　　数据采集部分:

　　您需要输入百度搜索的结果并抓取收录AI的页面的内容.

　　from urllib import request

import urllib.parse

from bs4 import BeautifulSoup

import re

import random

import datetime

def getLinks(url):

html = request.urlopen(url)

bsObj = BeautifulSoup(html, "html.parser")

return bsObj.find("div",{"id":"bodyContent"}).findAll("a",{"href":re.compile("^(/wiki/)((?!:).)*$")})

#findAll结果是列表ResultSet

#我们发现class="result-op c-container"和class="HMCpkB"等均是百度相关、广告等内容，因此剔除

random.seed(datetime.datetime.now())

url = "https://www.baidu.com/s?wd=AI"

linkList = getLinks(url)

while len(linkList)>0:

nextLink=linkList[random.randint(0,len(linkList)-1)].attrs['href'] #href属性值只有后半段链接

print(nextLink)

linkList=getLinks(nextLink)

　　当我们手中有数据信息的txt文件时，我们可以绘制一个简单的词云图.

　　绘图部分:

　　import matplotlib.pyplot as plt

import jieba

from wordcloud import WordCloud , ImageColorGenerator

from PIL import Image

import numpy as np

txt=open(r'C:\Users\AER\Desktop\text.txt',"r",encoding="utf-8").read()

cut_text=jieba.cut(txt,cut_all=False)

result='/'.join(cut_text)

img=Image.open(r'C:\Users\AER\Desktop\PICPIC.png')

graph=np.array(Image)

wc=WordCloud(

font_path=r"C:\Users\AER\testgit\Study-Notes\msyh.ttc",

background_color='white', max_font_size=50, mask=graph) #

wc.generate(result)

image_color=ImageColorGenerator(graph)

wc.recolor(color_func=image_color)

wc.to_file(r"C:\Users\AER\testgit\Study-Notes\5gpic.png")

plt.figure("词云图")

plt.imshow(wc)

plt.axis("off")

plt.show()

　　数据处理

0

2020-08-07

关键词自动采集生成内容系统

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

采集百度关键字的相关网站并生成词云

0 个评论

发起人

AI时代内容工厂

采集百度关键字的相关网站并生成词云

0 个评论

发起人

相关问题