Python网路数据采集之读取文件|第05天

优采云发布时间: 2020-08-14 08:59

　　User:你好我是森林

　　Date:2018-04-01

　　Mark:《Python网路数据采集》

　　网络采集系列文章

　　Python网路数据采集之创建爬虫

　　Python网路数据采集之HTML解析

　　Python网路数据采集之开始采集

　　Python网路数据采集之使用API

　　Python网路数据采集之储存数据

　　读取文档文档编码

　　文档编码的方法一般可以依照文件的扩充名进行判定，虽然文件扩充名并不是由编码确定的，而是由开发者确定的。从最底层的角度看，所有文档都是由 0和 1 编码而成的。例如我我们将一个后缀为png的图片后缀改为.py。用编辑器打打开就完全不对了。

　　只要安装了合适的库， Python 就可以帮你处理任意类型的文档。纯文本文件、视频文件和图象文件的惟一区别，就是它们的 0和1 面向用户的转换方法不同。

　　纯文本

　　对于纯文本的文件获取的方法很简单，用 urlopen 获取了网页以后，我们会把它转变成 BeautifulSoup对象。

　　from urllib.request import urlopen

textPage = urlopen(

"http://www.pythonscraping.com/pages/warandpeace/chapter1.txt")

print(textPage.read())

　　CSV 文件

　　Python有一个标准库对CSV文件的处理非常的友好，可以处理各种的CSV文件。文档地址

　　读取CSV文件

　　Python 的csv 库主要是面向本地文件，就是说你的 CSV 文件得存贮在你的笔记本上。而进行网路数据采集的时侯，很多文件都是在线的。有几个参考解决办法：

　　例如获取网上的CSV文件，然后输出命令行。

　　from urllib.request import urlopen

from io import StringIO

import csv

data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ASCII','ignore')

dataFile = StringIO(data)

csvReader = csv.reader(dataFile)

for row in csvReader:

print(row)

　　输出的结果：

　　['Name', 'Year']

["Monty Python's Flying Circus", '1970']

['Another Monty Python Record', '1971']

["Monty Python's Previous Record", '1972']

['The Monty Python Matching Tie and Handkerchief', '1973']

['Monty Python Live at Drury Lane', '1974']

['An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail', '1975']

['Monty Python Live at City Center', '1977']

['The Monty Python Instant Record Collection', '1977']

["Monty Python's Life of Brian", '1979']

["Monty Python's Cotractual Obligation Album", '1980']

["Monty Python's The Meaning of Life", '1983']

['The Final Rip Off', '1987']

['Monty Python Sings', '1989']

['The Ultimate Monty Python Rip Off', '1994']

['Monty Python Sings Again', '2014']

　　PDF 文件

　　PDFMiner3K是一个非常好用的库(是PDFMiner的Python 3.x移植版)。它十分灵活，可以通过命令行使用，也可以整合到代码中。它还可以处理不同的语言编码，而且对网路文件的处理也十分便捷。

　　下载解压后用python setup.py install完成安装。

　　模块的源文件下载地址：

　　例如可以把任意 PDF 读成字符串，然后用 StringIO转换成文件对象。

　　from urllib.request import urlopen

from pdfminer.pdfinterp import PDFResourceManager, process_pdf

from pdfminer.converter import TextConverter

from pdfminer.layout import LAParams

from io import StringIO

from io import open

def readPDF(pdfFile):

rsrcmgr = PDFResourceManager()

retstr = StringIO()

laparams = LAParams()

device = TextConverter(rsrcmgr, retstr, laparams=laparams)

process_pdf(rsrcmgr, device, pdfFile)

device.close()

content = retstr.getvalue() retstr.close()

return content

pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")

outputString = readPDF(pdfFile)

print(outputString)

pdfFile.close()

　　readPDF 函数最大的用处是，如果PDF文件在笔记本里，就可以直接把 urlopen返回的对象 pdfFile 替换成普通的 open() 文件对象:

　　pdfFile = open("./chapter1.pdf", 'rb')

　　如果本文对你有所帮助，欢迎喜欢或则评论；如果你也对网路采集感兴趣，可以点击关注，这样才能够收到后续的更新。感谢您的阅读。

0

2020-08-14

实时文章采集

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

Python网路数据采集之读取文件|第05天

0 个评论

发起人

AI时代内容工厂

Python网路数据采集之读取文件|第05天

0 个评论

发起人

相关问题