抓取网页生成电子书(Windows,OSX及Linux的在线资料epub或mobi格式)

优采云发布时间: 2022-03-13 01:22

　　自从我买了kindle之后，我就一直在思考如何充分利用它。虽然可以从多多购买很多书籍，网上也有很多免费的电子书，但还是有很多网页形式的有趣内容。比如O'Reilly Atlas提供了很多电子书，但只提供免费在线阅读；此外，许多材料或文件只是网页形式。所以我希望通过某种方式把这些网上资料转换成epub或者mobi格式，这样就可以在kindle上阅读了。这篇文章文章描述了如何使用 calibre 和少量代码来做到这一点。

　　CalibreCalibre 简介

　　Calibre 是一款免费的电子书管理工具，兼容 Windows、OS X 和 Linux。幸运的是，除了 GUI 之外，calibre 还提供了很多命令行工具，其中 ebook-convert 命令可以根据用户编写的食谱进行。文件（其实是python代码）抓取指定页面的内容，生成mobi等格式的电子书。通过编写食谱，可以自定义爬取行为以适应不同的网页结构。

　　安装口径

　　Calibre的下载地址为download，您可以根据自己的操作系统下载相应的安装程序。

　　如果是 Linux 操作系统，也可以从软件仓库安装：

　　Archlinux：

　　pacman -S calibre

　　Debian/Ubuntu：

　　apt-get install calibre

　　红帽/Fedora/CentOS：

　　yum -y install calibre

　　请注意，如果您使用 OSX，则需要单独安装命令行工具。

　　抓取网页以生成电子书

　　下面以Git Pocket Guide为例，说明如何通过calibre从网页生成电子书。

　　找到索引页面

　　爬取整本书，首先要找到索引页，通常是Table of Contents，也就是目录，其中每个目录都链接到对应的内容页。索引页面将指导生成电子书时要抓取的页面以及内容的组织顺序。在本例中，索引页为 61/index.html。

　　写食谱

　　Recipes 是一个带有recipe 扩展名的脚本。内容其实是一段python代码，用来定义calibre爬取页面的范围和行为。以下是用于爬取 Git 袖珍指南的食谱：

　　from calibre.web.feeds.recipes import BasicNewsRecipe class Git_Pocket_Guide(BasicNewsRecipe): title = 'Git Pocket Guide' description = '' cover_url = 'http://akamaicovers.oreilly.com/images/0636920024972/lrg.jpg' url_prefix = '1230000000561/' no_stylesheets = True keep_only_tags = [{ 'class': 'chapter' }] def get_title(self, link): return link.contents[0].strip() def parse_index(self): soup = self.index_to_soup(self.url_prefix + 'index.html') div = soup.find('div', { 'class': 'toc' }) articles = [] for link in div.findAll('a'): if '#' in link['href']: continue if not 'ch' in link['href']: continue til = self.get_title(link) url = self.url_prefix + link['href'] a = { 'title': til, 'url': url } articles.append(a) ans = [('Git_Pocket_Guide', articles)] return ans

　　下面解释代码的不同部分。

　　整体结构

　　一般来说，recipe是一个python类，但是这个类必须继承calibre.web.feeds.recipes.BasicNewsRecipe。

　　解析索引

　　整个recipe的核心方法是parse_index，这也是recipe必须实现的唯一方法。该方法的目标是分析索引页的内容并返回一个稍微复杂的数据结构（稍后描述），该结构定义了整个电子书的内容以及内容的组织顺序。

　　整体属性设置

　　在类的开头，定义了一些全局属性：

　　title = 'Git Pocket Guide'description = ''cover_url = 'http://akamaicovers.oreilly.com/images/0636920024972/lrg.jpg' url_prefix = '1230000000561/'no_stylesheets = Truekeep_only_tags = [{ 'class': 'chapter' }]

　　parse_index 返回值

　　下面通过分析索引页来描述parse_index需要返回的数据结构。