网站内容(Python自学技术交流前几天一直在写爬取图片的代码)

优采云发布时间: 2022-03-31 10:09

　　更多教程请到：罗亮博客求助，请到：Python自学技术交流

　　前几天一直在写爬图的代码。XXOO网站，煎蛋网少女图，桌酷壁纸网，最好的大学排名。

　　想看全部代码的朋友可以去git拉下。

　　这是仓库地址：

　　最近一直在想，是不是爬文字比爬图片难？

　　今天，我刚刚访问了一个博客地址并尝试了一下。一开始爬了一篇文章文章，然后成功获取到文章的标题和内容。然后尝试保存到本地。有效。

　　然后我观察了下一页，每个文章的源代码都在同一个位置。

　　然后我尝试爬取接下来的10页数据，发现有些文章内容无法保存，报错为编码问题，但是我在我的代码中对每个请求访问都加了编码，并且我暂时还没有找到解决办法。

　　最后只能简单粗暴地加上try和except来过滤掉。

　　如果捕获不成功，则直接过滤掉。继续下一个文章爬网。

　　更改了代码。文件名已优化。

　　图像.png

　　后来我改了代码爬取了所有的页面。还添加了一个时间模块用于休眠，以防止频繁访问 IP 被阻止。

　　相比之下，IP代理访问设置还没有，scrapy框架还没有。

　　最近免费弄了一个阿里云服务器，修改后扔掉，在服务器上运行。

　　图像.png

　　下面直接粘贴代码。这次代码没有太多注释。每个人都自己尝试一下。

　　这将清楚地说明每一行代码的作用。

　　import requests

from bs4 import BeautifulSoup

import bs4

import os

from time import sleep

url_list = []

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

def url_all():

for page in range(1,401):

url = 'http://blog.csdn.net/?ref=toolbar_logo&page='+str(page)

url_list.append(url)

def essay_url(): #找到所有文章地址

blog_urls = []

for url in url_list:

html = requests.get(url, headers=headers)

html.encoding = html.apparent_encoding

soup = BeautifulSoup(html.text, 'html.parser')

for h3 in soup.find_all('h3'):

blog_url = (h3('a')[0]['href'])

blog_urls.append(blog_url)

return blog_urls

def save_path():

s_path = 'D:/blog/'

if not os.path.isdir(s_path):

os.mkdir(s_path)

else:

pass

return s_path

def save_essay(blog_urls,s_path): #找到所有文章标题，文章内容。

for url in blog_urls:

blog_html = requests.get(url, headers=headers)

blog_html.encoding = blog_html.apparent_encoding

soup = BeautifulSoup(blog_html.text, 'html.parser')

try:

for title in soup.find('span', {'class': 'link_title'}):

if isinstance(title, bs4.element.Tag):

print('-----文章标题-----：', title.text)

blogname = title.text

blogname = blogname.replace("\n",'')

blogname = blogname.replace("\r",'')

blogname = blogname.replace(" ",'')

try:

file = open(s_path + str(blogname) + '.txt', 'w')

file.write(str(title.text))

file.close()

except BaseException as a:

print(a)

for p in soup.find('div', {'class': 'article_content'}).children:

if isinstance(p, bs4.element.Tag):

try:

file = open(s_path + str(blogname) + '.txt', 'a')

file.write(p.text)

file.close()

except BaseException as f:

print(f)

except BaseException as b:

print(b)

print('---------------所有页面遍历完成----------------')

sleep(10)

url_all()

save_essay(essay_url(),save_path())

　　我买了三本书，最近没看。本书基本上是内置模块。

　　最近发现爬虫很有意思，继续研究研究。

　　我希望学习这个框架，这样我就可以成为一个简单的爬虫工程师。

　　哈哈，是不是觉得太美了。

0

2022-03-31

网站内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网站内容(Python自学技术交流前几天一直在写爬取图片的代码)

0 个评论

发起人

AI时代内容工厂

网站内容(Python自学技术交流前几天一直在写爬取图片的代码)

0 个评论

发起人

相关问题