自动抓取网页数据(大半年没写博客排名又掉了十万，于是想到自动抓博客 )

优采云发布时间: 2021-11-03 04:14

　　自动抓取网页数据(大半年没写博客排名又掉了十万，于是想到自动抓博客

)

　　半年没写博客了，排名下降了10万，于是想到了自动抢博客排名。

　　一开始想找一个简单的WEB API接口，但是找不到。网上到处都有用PYTHON爬取分析整个网页的代码，所以跟风自己复制调试。

　　拿出数据想存成文件的时候，看到有人直接写EXCEL，然后直接用了。

　　最后，在电脑的任务计划中设置每日自动计划，就这样了。

　　参考：

　　我的最终代码：

　　from requests import *

import traceback

from bs4 import BeautifulSoup

import datetime

import os

import xlwt

import xlrd

#import xlutils

#from xlutils.copy import copy

import xlutils.copy

# 保存数据

def save_to_excel(r1):

# 根据文件是否存在，进行不同的操作

sFile = "csdn.xls"

if os.path.exists(sFile):

open_excel = xlrd.open_workbook(sFile) # 读取Excel

rows = open_excel.sheets()[0].nrows # 获取现有行数

workbook = xlutils.copy.copy(open_excel) # 将xlrd对象转为xlwt对象

table = workbook.get_sheet(0) # 用xlwt对象获取要操作的sheet

print("Excel文件已存在，正在保存数据......")

else:

workbook = xlwt.Workbook(encoding = "utf-8")

table = workbook.add_sheet("Sheet")

head = ["日期", "总排名"]

# 生成表头

for i, head_item in enumerate(head):

table.write(0, i, head_item)

rows = 1

print("程序初次运行，已为您生成Excel文件，正在保存数据......")

# 存入数据

for i, n1 in enumerate(r1):

table.write(rows, i, n1)

workbook.save(sFile)

print("恭喜，今日数据已成功保存！")

try:

# headers伪装成浏览器访问

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64)"}

url = "https://blog.csdn.net/yzx99" #改为自己的用户名

r = get(url = url, headers = headers, timeout = 3)

if r.status_code != 200:

print("抓取失败1，返回码不为200:" + r.status_code)

else:

soup = BeautifulSoup(r.text, "html.parser")

##print(html)

all_dl = soup.find_all("dl",attrs={"class":"text-center"})

for dl1 in all_dl:

dd1 = str(dl1.find_all("dd"))

if dd1.find("总排名") >= 0:

print(dl1["title"])

save_to_excel([datetime.datetime.now().strftime("%Y-%m-%d_%H:%M"), dl1["title"]])

break

except Exception as e:

print("抓取失败2"+str(e)+traceback.print_exc())

'''

针对的模板如下

9

原创

44万+

周排名

59万+

总排名

1229

访问

等级

'''

0

2021-11-03

自动抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

自动抓取网页数据(大半年没写博客排名又掉了十万，于是想到自动抓博客 )

0 个评论

发起人