中设置定时任务，实现每天定时爬取最后是设置bat

优采云发布时间: 2021-06-14 02:27

　　抓取每日更新的新闻，使用scrapy框架，Python2.7，保存在MySQL数据库中，将抓取过程中的爬虫日志和bug信息保存为日志文件。定义bat批处理文件，添加到定时任务程序中，自动抓取。

　　嗯...

　　1.items文件中，定义需要爬取的类

　　2.设置文件中的默认项，设置日志输出格式，打开管道文件，设置延迟时间，设置数据库信息，设置请求头等信息

　　3.编写自己的蜘蛛文件

　　class TouchuangSpider(scrapy.Spider):

name = 'touchuang'

allowed_domains = ['xunjk.com']

url = {

"1": "http://www.xunjk.com/xinwen/rongzi/", # 融资

"2": "http://www.xunjk.com/shangye/", # 商业

"3": "http://www.xunjk.com/xinwen/yanjiu/", # 研究

"4": "http://www.xunjk.com/xinwen/keji/", # 科技

"5": "http://www.xunjk.com/xinwen/jinrong/", # 金融

"6": "http://www.xunjk.com/xinwen/dongcha/", # 洞察

"7": "http://www.xunjk.com/xinwen/yejie/" # 业界

}

start_urls = [url["1"], url["2"], url["3"], url["4"], url["5"], url["6"], url["7"]]

# start_urls = [url["1"]]

　　因为几个版块的新闻同时被抓取，所以版块号设置为字典k值，链接设置为v值。

　　访问url，回调prase()函数做进一步处理。

　　提取中常用的xpath提取，这个没啥好说的

　　 def request_page(self,response):

date = time.strftime("%Y%m%d")

try:

item = XinwenItem()

item["title"] = response.xpath("//div[@class='main_c']/h1/text()").extract_first() # 获取新闻标题

item["zuozhe"] = response.xpath("//div[@class='infos']/span[@class='from']/a/text()").extract_first() # 获取新闻来源

page_url = response.xpath("//div[@class='breadnav']/a[3]/@href").extract_first()

for k, v in self.url.items(): # 为获取新闻分类id，获取到当前页分类url作为字典v值，取得k值

if v == page_url:

item["fenlei_id"] = k # k为文章分类id

# 判断文章中是否有图片，有获取图片；无返回空

item["created_at"] = response.xpath("//div[@class='infos']/span[@class='time']/text()").extract_first()

except Exception as e:

# 若报错，将错误打印txt返回

with open(r"D:\pycharm_projects\xinwenyuan\xinwen\log\/"+date+".txt", "a+") as f:

f.write("e:"+e+"\n")

try:

img = re.search('(.*?)', response.text).group(1)

item["news_pic"] = img

print item["news_pic"]

except:

item["news_pic"] = "" # 无图片返回空

</p>

　　正文部分，一开始用XPath提取文本信息，但考虑到有些新闻有图片信息，后期图片和文字会一一对应，所以使用常规匹配 p 标签以获取文本。最后退货

　　4.store 在 mysql 数据库中

　　在pipelines文件中写入sql信息

　　class XinwenPipeline(object):

def __init__(self):

print "connect successful..."

# 链接MySQL数据库

self.connect = pymysql.connect(host=settings.MYSQL_HOST,

user=settings.MYSQL_USER,

password=settings.MYSQL_PASSWD,

db=settings.MYSQL_DBNAME,

port=settings.MYSQL_PORT,

charset="utf8")

# 获取游标

self.cursor = self.connect.cursor()

# 存入数据库

def process_item(self, item, spider):

date = time.strftime("%Y%m%d")

print "doing something..."

try:

# 执行sql语句插入，其中title设置为唯一字段，防止重复录入

sql = '''insert into articles(title,zuozhe,content,fenlei_id,created_at,news_pic,updated_at,dianji) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)'''

self.cursor.execute(

sql, (item["title"],item["zuozhe"],item["content"],item["fenlei_id"],item["created_at"],item["news_pic"], item["updated_at"], item["dianji"])

)

self.connect.commit() # 保存

except Exception as error:

# 出现错误时打印错误日志

with open(r"D:\pycharm_projects\xinwenyuan\xinwen\log\/"+date+".txt", "a+") as f:

f.write(item["created_at"]+"error:"+error[1]+"\n")

return item

# 关闭数据库

def close_spider(self, spider):

print "working done..."

self.cursor.close()

self.connect.close()

　　5.在start.py中设置定时任务，实现每天定时爬取

　　最后是设置bat文件

　　这部分计划任务参考网上文档，原文链接如下：

　　将设置好的bat文件加入定时执行任务

　　其中，本次爬虫任务使用scrapy的默认线程数，没有其他多线程，没有使用代理ip，所以只设置了延迟时间，这样爬虫就会因为ip在任何时候，但现在没事了，一切都好。

　　代码中还有很多其他问题，朋友们可以留言交流[严肃搞笑的表情].jpg

0

2021-06-14

文章定时自动采集

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

中设置定时任务，实现每天定时爬取最后是设置bat

0 个评论

发起人