Python编写高质量知乎回答爬虫，快速获取答案

优采云发布时间: 2023-03-09 18:13

　　知乎是一个高质量的问答社区，里面蕴藏着大量的知识和经验，对于学习和研究都有很大的帮助。然而，要想获取其中的精华内容却不是一件容易的事情。今天，我将介绍如何用Python编写一个知乎回答爬虫，帮助你快速抓取高质量回答，让你在学习和研究中事半功倍。

　　1.确定目标

　　在开始编写爬虫之前，我们需要明确自己的目标。比如说，我们想要抓取某个话题下所有赞数超过1000的回答，并将其保存到本地文件中。这就是我们的目标。

　　2.分析页面结构

　　在确定了目标之后，我们需要分析页面结构，并找到我们需要的信息。对于知乎来说，我们可以通过浏览器开发者工具或者第三方工具（如SelectorGadget）来查看页面结构和元素属性。

　　3.获取页面内容

　　获取页面内容是爬虫的核心部分。我们可以使用Python中的requests库来发送HTTP请求，并使用BeautifulSoup库或者lxml库来解析HTML文档。在获取到页面内容之后，我们需要提取出所需信息并进行处理。

　　4.翻页处理

　　如果我们要抓取多页内容，就需要进行翻页处理。这可以通过修改URL参数或者使用Selenium库来实现。

　　5.存储数据

　　最后一步是存储数据。我们可以将数据保存到本地文件、数据库或者云端存储中。对于较小规模的数据集，可以使用csv、json等格式进行存储；对于较*敏*感*词*的数据集，则需要考虑使用数据库或者云端存储。

　　以上就是知乎回答爬虫的基本流程。接下来，我将详细介绍每个步骤需要注意的细节和具体实现方法。

　　1.确定目标

　　确定目标是爬虫编写过程中最重要的一步。在这一步中，我们需要明确以下几个问题：

　　-我们要抓取哪些信息？

　　-这些信息在页面中以什么形式呈现？

　　-我们要抓取哪些页面？

　　-如何判断一个回答是否符合要求？

　　假设我们想要抓取“Python”话题下所有赞数超过1000的回答，并将其保存到本地文件中。那么，我们需要明确以下几点：

　　-我们需要抓取回答标题、作者、发布时间、点赞数、评论数、回答内容等信息。

　　-这些信息在页面中以HTML标签和属性的形式呈现。

　　-我们需要抓取“5e056c500a1c4b6a7110b50d807bade5://www.zhihu.com/topic/19552832/top-answers?71860c77c6745379b0d44304d66b6a13=1”至“5e056c500a1c4b6a7110b50d807bade5://www.zhihu.com/topic/19552832/top-answers?71860c77c6745379b0d44304d66b6a13=n”的所有页面。

　　-我们只保留赞数超过1000的回答。

　　2.分析页面结构

　　分析页面结构是爬虫编写过程中不可或缺的一步。通过浏览器开发者工具（F12）或者第三方工具（如SelectorGadget），我们可以查看网页源码和元素属性，并找到所需信息所对应的HTML标签和属性。

　　对于知乎来说，每个回答都有一个唯一的data-entry-url属性值，因此可以通过该属性值来定位每个回答块：

　　```html

　　...

　　</div>

　　```

　　在每个回答块内部，包含了许多有用信息：

　　```html

　　 ...

　　 </div>

　　```

　　通过查看源码和元素属性，我们可以得出以下结论：

　　-回答标题：位于`<meta itemprop="name">`标签内。

　　-作者：位于`<meta itemprop="author">`标签内。

　　-发布时间：位于`<meta itemprop="datePublished">`标签内。

　　-点赞数：位于`<meta itemprop="upvoteCount">`标签内。

　　-评论数：位于`<meta itemprop="commentCount">`标签内。

　　-回答内容：位于`<div class="RichContent">`标签内。

　　3.获取页面内容

　　获取页面内容是爬虫编写过程中最核心也最复杂的一步。在这一步中，我们需要发送HTTP请求并解析HTML文档，并从中提取出所需信息并进行处理。

　　首先，在Python代码中导入必要的库：

　　```python

　　import requests

　　from bs4 import BeautifulSoup

　　```

　　然后，在代码中定义一个函数用于获取单个页面上所有符合条件的回答：

　　```python

　　def get_answers(url):

　　 headers ={

　　 'User-Agent':'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

　　 response = requests.get(url, headers=headers)

　　 soup = BeautifulSoup(response.text,'lxml')

　　 answers =[]

　　 for answer in soup.find_all('div',{'class':'List-item'}):

　　 upvote_count = answer.find('meta',{'itemprop':'upvoteCount'})['content']

　　 if int(upvote_count)>= 1000:

　　 title = answer.find('meta',{'itemprop':'name'})['content']

　　 author = answer.find('meta',{'itemprop':'author'})['content']

　　 publish_time = answer.find('meta',{'itemprop':'datePublished'})['content']

　　 comment_count = answer.find('meta',{'itemprop':'commentCount'})['content']

　　 content = answer.find('div',{'class':'RichContent'}).get_text()

　　 answers.append({'title': title,'author': author,'publish_time': publish_time,

　　 'upvote_count': upvote_count,'comment_count': comment_count,

　　 'content': content})

　　 return answers

　　```

　　该函数接受一个URL作为参数，并返回一个列表类型对象answers，其中包含了所需信息。函数首先发送HTTP请求并解析HTML文档；然后查找所有符合条件（即赞数超过1000）且回答块；最后提取出所需信息并保存到列表answers中。

　　4.翻页处理

　　如果我们想要抓取多页内容，则需要进行翻页处理。这可以通过修改URL参数或使用Selenium库来实现。

　　首先看第一种方法——修改URL参数：

　　```python

　　for i in range(1,n+1):

　　 url =f'5e056c500a1c4b6a7110b50d807bade5://www.zhihu.com/topic/19552832/top-answers?71860c77c6745379b0d44304d66b6a13={i}'

　　 answers += get_answers(url)

　　```

　　其中n为总共要抓取多少页数据。

　　第二种方法——使用Selenium库：

　　```python

　　from selenium import webdriver

　　driver_path ='/path/to/chromedriver'

　　url ='5e056c500a1c4b6a7110b50d807bade5://www.zhihu.com/topic/19552832/top-answers'

　　driver = webdriver.Chrome(executable_path=driver_path)

　　driver.get(url)

　　for i in range(1,n+1):

　　 driver.execute_script(f'window.scrollTo(0, document.body.scrollHeight);')

　　 time.sleep(2)

　　source_code = driver.71860c77c6745379b0d44304d66b6a13_source

　　soup = BeautifulSoup(source_code,'lxml')

　　answers += get_answers(98a5f537c46e6a2bcd1066ec72b9a612)

　　driver.quit()

　　```

　　该方法首先启动Chrome浏览器，并打开指定URL；然后通过JavaScript代码模拟滚动条向下滚动，并等待2秒钟；最后获取当前网页源码并解析HTML文档，并调用get_answers函数提取所需信息。

　　5.存储数据

　　最后一步是存储数据。对于小规模数据集（如少量回答），可以使用csv、json等格式进行存储；对于*敏*感*词*数据集，则应该考虑使用数据库或云端存储（如优采云）进行存储。

　　以csv格式为例，在Python代码中导入必要库并定义一个函数用于将数据保存为csv文件：

　　```python

　　import csv

　　def save_to_csv(data, filename):

　　 with open(filename, mode='w', encoding='utf8') as f:

　　 writer = csv.writer(f)

　　 writer.writerow(['title','author','publish_time','upvote_count',

　　 'comment_count','content'])

　　 for item in data:

　　 writer.writerow([item['title'], item['author'], item['publish_time'],

　　 item['upvote_count'], item['comment_count'],

　　 item['content']])

　　```

　　该函数接受两个参数：data为包含所需信息的列表对象；filename为保存文件名（含路径）。函数首先打开指定文件并创建一个csv.writer对象；然后写入列名行；最后逐行写入数据。

　　完整代码见下方：

　　```python

　　import requests

　　from bs4 import BeautifulSoup

　　import csv

　　def get_answers(url):

　　 headers ={

　　 'User-Agent':'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

　　 response = requests.get(url, headers=headers)

　　 soup = BeautifulSoup(response.text,'lxml')

　　 answers =[]

　　 for answer in soup.find_all('div',{'class':'List-item'}):

　　 upvote_count = answer.find('meta',{'itemprop':'upvoteCount'})['content']

　　 if int(upvote_count)>= 1000:

　　 title = answer.find('meta',{'itemprop':'name'})['content']

　　 author = answer.find('meta',{'itemprop':'author'})['content']

　　 publish_time = answer.find('meta',{'itemprop':'datePublished'})['content']

　　 comment_count = answer.find('meta',{'itemprop':'commentCount'})['content']

　　 content = answer.find('div',{'class':'RichContent'}).get_text()

　　 answers.append({'title': title,

　　 'author': author,

　　 'publish_time': publish_time,

　　 'upvote_count': upvote_count,

　　 'comment_count': comment_count,

　　 'content': content})

　　 return answers

　　def save_to_csv(data, filename):

　　 with open(filename, mode='w', encoding='utf8') as f:

　　 writer = csv.writer(f)

　　 writer.writerow(['title',

　　 'author',

　　 "publish_time",

　　 "upvote_count",

　　 "comment_count",

　　 "content"])

　　 for item in data:

　　 writer.writerow([item["title"],

　　 item["author"],

　　 item["publish_time"],

　　 item["upvote_count"],

　　 item["comment_count"],

　　 item["content"]])

　　if __name__=='__main__':

　　 answers_total=[]

　　 #抓取前n页所有符合条件（即赞数超过1000）且回答块

　　 n=10

　　# 方法1：修改url参数

　　# for i in range(1,n+1):

　　# url=f"5e056c500a1c4b6a7110b50d807bade5://www.zhihu.com/topic/19552832/top-answers?71860c77c6745379b0d44304d66b6a13={i}"

　　# print("正在抓取第{}页...".format(i))

　　# answers=get_answers(url)

　　# print("已获取{}条记录".format(len(answers)))

　　# print("="*50)

　　# answers_total+=answers

　　# 方法2：使用selenium库

　　 from selenium import webdriver

　　# 指定chromedriver路径

　　# driver_path='/Users/benjamin/Documents/chromedriver'

　　# url='5e056c500a1c4b6a7110b50d807bade5://www.zhihu.com/topic/19552832/top-answers'

　　# driver=webdriver.Chrome(executable_path=driver_path)

　　# driver.get(url)

　　# #模拟滚动条向下滚动n次

　　# for i in range(n):

　　# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

　　# time.sleep(2)

　　# source_code=driver.71860c77c6745379b0d44304d66b6a13_source

　　#1bbe243aab886fd2b14e7315537b9207=BeautifulSoup(source_code,'lxml')

　　# answers=get_answers(98a5f537c46e6a2bcd1066ec72b9a612)

　　# print("已获取{}条记录".format(len(answers)))

　　# print("="*50)

　　# answers_total+=answers

　　# driver.quit()

0

2023-03-09

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

Python编写高质量知乎回答爬虫，快速获取答案

0 个评论

发起人