抓取网页音频(使用Python爬取任意网页的资源文件，一键爬取资源媒体文件)

优采云发布时间: 2021-12-30 07:07

　　前言

　　使用Python抓取任意网页的资源文件，如图片、音频、视频等；一种常见的方法是通过XPath或者正则请求网页的HTML来获取你想要的资源。这里我做了一个爬虫工具软件，可以一键爬取资源媒体文件；但需要注意的是，这里对资源文件的爬取只针对现有的HTML文件。如果需要第二次请求，就爬不出来了，比如酷狗音乐播放器界面，因为需要做。匹配不同网站的通用工具！！！

　　这里是图片抓取的主推，有需要图片素材的可以输入网址一键抓取！

　　还有就是在抓取视频的时候磁力链接会被抓取下来！可以使用第三方下载工具下载！

　　代码爬取资源文件

　　这里唯一需要说明的是，有些图片资源不是url链接，而是data:image格式。这里需要转换存储！

　　def getResourceUrlList(url ,isImage, isAudio, isVideo):

global imgType_list, audioType_list, videoType_list

imageUrlList = []

audioUrlList = []

videoUrlList = []

url = url.rstrip().rstrip('/')

htmlStr = str(requestsDataBase(url))

# print(htmlStr)

Wopen = open('reptileHtml.txt','w')

Wopen.write(htmlStr)

Wopen.close()

Ropen = open('reptileHtml.txt','r')

imageUrlList = []

for line in Ropen:

line = line.replace("'", '"')

segmenterStr = '"'

if "'" in line:

segmenterStr = "'"

lineList = line.split(segmenterStr)

for partLine in lineList:

if isImage == True:

# 查找图片

if 'data:image' in partLine:

base64List = partLine.split('base64,')

imgData = base64.urlsafe_b64decode(base64List[-1] + '=' * (4 - len(base64List[-1]) % 4))

base64ImgType = base64List[0].split('/')[-1].rstrip(';')

imageName = zfjTools.getTimestamp() + '.' + base64ImgType

imageUrlList.append(imageName + '$==$' + base64ImgType)

# 查找图片

for imageType in imgType_list:

if imageType in partLine:

imgUrl = partLine[:partLine.find(imageType) + len(imageType)].split(segmenterStr)[-1]

# 修复URL

imgUrl = repairUrl(imgUrl, url)

sizeType = '_{' + 'size' + '}'

if sizeType in imgUrl:

imgUrl = imgUrl.replace(sizeType, '')

imgUrl = imgUrl.strip()

if imgUrl.startswith('http://') or imgUrl.startswith('https://') and imgUrl not in imageUrlList:

imageUrlList.append(imgUrl)

else:

imgUrl = ''

if isAudio == True:

# 查找音频

for audioType in audioType_list:

if audioType in partLine or audioType.lower() in partLine:

audioType = audioType.lower() if audioType.lower() in partLine else audioType

audioUrl = partLine[:partLine.find(audioType) + len(audioType)].split(segmenterStr)[-1]

# 修复URL

audioUrl = repairUrl(audioUrl, url)

if audioUrl.startswith('http://') or audioUrl.startswith('https://') and audioUrl not in audioUrlList:

audioUrlList.append(audioUrl)

else:

audioUrl = ''

if isVideo == True:

# 查找视频

for videoType in videoType_list:

if videoType in partLine or videoType.lower() in partLine:

videoType = videoType.lower() if videoType.lower() in partLine else videoType

videoUrl = partLine[:partLine.find(videoType) + len(videoType)].split(segmenterStr)[-1]

# 修复URL

videoUrl = repairUrl(videoUrl, url)

if videoUrl.startswith('http://') or videoUrl.startswith('https://') or videoUrl.startswith('ed2k://') or videoUrl.startswith('magnet:?') or videoUrl.startswith('ftp://') and videoUrl not in videoUrlList:

videoUrlList.append(videoUrl)

else:

videoUrl = ''

return (imageUrlList, audioUrlList, videoUrlList)

复制代码

　　爬取自定义节点

　　# 统配节点爬取

def getNoteInfors(url, fatherNode, childNode):

url = url.rstrip().rstrip('/')

htmlStr = requestsDataBase(url)

Wopen = open('reptileHtml.txt','w')

Wopen.write(htmlStr)

Wopen.close()

html_etree = etree.HTML(htmlStr)

dataArray = []

if html_etree != None:

nodes_list = html_etree.xpath(fatherNode)

for k_value in nodes_list:

partValue = k_value.xpath(childNode)

if len(partValue) > 0:

dataArray.append(partValue[0])

return dataArray

复制代码

　　软件

　　软件下载地址/zfj1128/ZFJ...

　　使用教学视频

　　资源爬取：链接：/s/1xa9ruF_h...密码：1zpg

　　节点爬取：链接：/s/1ebWWYtjo...密码：cosa

　　使用截图如下：

　　结束语

　　欢迎大家提出宝贵意见和建议！！！！

0

2021-12-30

抓取网页音频

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页音频(使用Python爬取任意网页的资源文件，一键爬取资源媒体文件)

0 个评论

发起人

AI时代内容工厂

抓取网页音频(使用Python爬取任意网页的资源文件，一键爬取资源媒体文件)

0 个评论

发起人

相关问题