Python抓取知乎回答，快速获取所需信息

优采云发布时间: 2023-03-08 19:11

　　Python 是一种被广泛使用的编程语言，它在数据分析、机器学习和自然语言处理等领域得到了广泛应用。而对于信息爬取领域来说，Python 更是一个难以替代的工具。本文将分享 Python 在知乎回答抓取上的应用技巧，帮助您更高效地获取所需信息。

　　一、了解知乎回答页面结构

　　知乎回答页面包括问题描述、回答列表和评论区三个部分。其中，问题描述部分包括问题标题、问题描述和关注人数等信息；回答列表部分包括回答用户、回答内容、点赞数和评论数等信息；评论区部分包括评论用户、评论内容和点赞数等信息。

　　二、获取页面源代码

　　Python 中使用 requests 库可以方便地获取网页源代码。使用 requests 库发送 GET 请求即可获取网页源代码，如下所示：

　　```

　　import requests

　　url ='https://www.zhihu.com/question/XXXXXX/answer/XXXXXX'

　　response = requests.get(url)

　　html = response.text

　　```

　　其中，url 为需要抓取的网页链接，response 为服务器响应对象，html 为网页源代码。

　　三、解析页面源代码

　　使用 BeautifulSoup 库可以方便地解析 HTML 网页源代码。通过对 HTML 标签属性的定位，可以实现对需要获取的信息的定位和提取。

　　四、定位问题描述部分

　　问题描述部分在 HTML 中一般位于`<div class="QuestionHeader-main">`标签中。使用 BeautifulSoup 库可以很方便地定位到该标签，并提取出问题标题、关注人数和问题描述等信息。

　　五、定位回答列表部分

　　回答列表部分在 HTML 中一般位于`<div class="List-item">`标签中。使用 BeautifulSoup 库可以很方便地定位到该标签，并提取出回答用户、回答内容、点赞数和评论数等信息。

　　六、定位评论区部分

　　评论区部分在 HTML 中一般位于`<div class="CommentItem-content">`标签中。使用 BeautifulSoup 库可以很方便地定位到该标签，并提取出评论用户、评论内容和点赞数等信息。

　　七、实例演示

　　以下是一个完整的 Python 程序，用于抓取知乎指定问题下所有回答及其相关信息：

　　```

　　import requests

　　from bs4 import BeautifulSoup

　　#抓取页面源代码

　　def get_html(url):

　　 response = requests.get(url)

　　 html = response.text

　　 return html

　　#获取所有回答

　　def get_answers(html):

　　 soup = BeautifulSoup(html,'html.parser')

　　 answer_list = soup.find_all('div',{'class':'List-item'})

　　 answers =[]

　　 for answer in answer_list:

　　 author = answer.find('span',{'class':'UserLink AuthorInfo-name'}).text

　　 content = answer.find('span',{'class':'RichText'}).text.strip()

　　 upvote_count = int(answer.find('button',{'class':'Button VoteButton VoteButton--up'}).find('span').text)

　　 comment_count = int(answer.find('a',{'class':'Button ContentItem-action Button--plain Button--withIcon Button--withLabel'}).find_all('span')[1].text)

　　 answers.append({

　　 'author': author,

　　 'content': content,

　　 'upvote_count': upvote_count,

　　 'comment_count': comment_count

　　 })

　　 return answers

　　#获取问题相关信息

　　def get_question_info(html):

　　 soup = BeautifulSoup(html,'html.parser')

　　 title = soup.find('h1',{'class':'QuestionHeader-title'}).text.strip()

　　 description = soup.find('div',{'class':'QuestionRichText'}).text.strip()

　　 follower_count = int(soup.find('strong',{'class':'NumberBoard-itemValue'}).text.replace(',','').strip())

　　 return {

　　 'title': title,

　　 'description': description,

　　 'follower_count': follower_count

　　 }

　　#获取所有评论

　　def get_comments(html):

　　 soup = BeautifulSoup(html,'html.parser')

　　 comment_list = soup.find_all('div',{'class':'CommentItem'})

　　 comments =[]

　　 for comment in comment_list:

　　 author = comment.find('span',{'class':'UserLink AuthorInfo-name'}).text

　　 content = comment.find('span',{'class':'ContentItem-time'}).find_next_sibling().text.strip()

　　 upvote_count = int(comment.find('button',{'class':'Button CommentItemVoter-button Button--plain Button--withIcon'}).find_all('span')[1].text)

　　 comments.append({

　　 'author': author,

　　 'content': content,

　　 'upvote_count': upvote_count

　　 })

　　 return comments

　　#主程序入口

　　if __name__=='__main__':

　　 url ="https://www.zhihu.com/question/XXXXXX"

　　 html = get_html(url)

　　 question_info = get_question_info(html)

　　 answers = get_answers(html)

　　 #打印所有回答及其相关信息

　　 for answer in answers:

　　 print(answer)

　　```

　　八、总结

　　本文介绍了 Python 在知乎回答抓取上的应用技巧，并给出了一个完整的 Python 程序实现。通过学习本文所述知识点，您可以更加高效地获取所需信息。如果您想了解更多 Python 技巧，请关注优采云（www.ucaiyun.com）的文章。

0

2023-03-08

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

Python抓取知乎回答，快速获取所需信息

0 个评论

发起人