文章实时采集(开发环境开发语言Python,开发架构Scrapy,非Python莫属)
优采云 发布时间: 2022-02-24 12:25文章实时采集(开发环境开发语言Python,开发架构Scrapy,非Python莫属)
背景
有朋友打算拓展业务渠道,准备在众包平台接单。他的主打产品是微信小程序,所以他想尽快收到客户发来的需求信息,然后尽快联系客户,从而达成交易。只有费率才能保证,否则山枣会被其他同事接走,他的黄花菜就凉了。
开发环境、开发语言、开发框架Scrapy,无非就是Python。数据神器采集!开发工具 PyCharm;功能设计实时通知:使用邮件通知,将邮箱绑定微信,实现实时通知的效果。过滤模块:根据标题和内容双重过滤关键词,丢弃不符合要求的订单,实时通知符合要求的订单。配置模块:使用json文件配置。关键代码
# -*- coding: utf-8 -*-<br />import re<br />import time<br /><br />import scrapy<br />from scrapy import Selector<br /><br />from .. import common<br /><br />class ZbjtaskSpider(scrapy.Spider):<br /> name = 'zbjtask'<br /> allowed_domains = ['zbj.com']<br /> start_urls = ['https://task.zbj.com/?m=1111&so=1&ss=0&fee=1']<br /><br /> def parse(self, response):<br /> #30 item per page<br /> nodes = response.xpath('//div[@class="demand-card"]').getall()<br /> id_nodes = response.xpath('//a[@class="prevent-defalut-link"]/@href').getall()<br /><br /> print(id_nodes)<br /> max_id = 0<br /> for url in id_nodes:<br /> # //task.zbj.com/16849389/<br /> pattern = re.compile("/\d*/$")<br /> id_str_ori = pattern.findall(url).pop()<br /> id_str = id_str_ori[1:len(id_str_ori) - 1]<br /> id = int(id_str)<br /> if id > max_id:<br /> max_id = id<br /> print(max_id)<br /><br /> for node in nodes:<br /> date = Selector(text=node).xpath('//span[@class="card-pub-time flt"]/text()').get()<br /> url = "https:" + Selector(text=node).xpath('//a[@class="prevent-defalut-link"]/@href').get()<br /> name = Selector(text=node).xpath('//a[@class="prevent-defalut-link"]/text()').get()<br /> desc = Selector(text=node).xpath('//div[@class="demand-card-desc"]/text()').get()<br /> price = Selector(text=node).xpath('//div[@class="demand-price"]/text()').get()<br /> tag = Selector(text=node).xpath('//span[@class="demand-tags"]/i/text()').get()<br /><br /> # //task.zbj.com/16849389/<br /> pattern = re.compile("/\d*/$")<br /> id_str_ori = pattern.findall(url).pop()<br /> id_str = id_str_ori[1:len(id_str_ori)-1]<br /> id = int(id_str)<br /><br /> sended_id = common.read_taskid()<br /> if id > sended_id :<br /> subject = "ZBJ " + id_str + " " + name<br /> # content = price + "\n" + desc + "\n" + url + "\n" + tag + "\n"<br /> content = "%s <p> %s <p> <a href=%s>%s</a> <p> %s" % (price, desc, url, url, tag)<br /> if common.send_mail(subject, content):<br /> print("ZBJ mail: send task sucess " % id)<br /> else:<br /> print("ZBJ mail: send task fail " % id)<br /> else :<br /> print("mail: task is already sended " % id)<br /> time.sleep(3)<br /><br /> common.write_taskid(id=max_id)
def send_mail(subject, content):<br /> sender = u'xxxxx@qq.com' # 发送人邮箱<br /> passwd = u'xxxxxx' # 发送人邮箱授权码<br /> receivers = u'xxxxx@qq.com' # 收件人邮箱<br /><br /> # subject = u'一品威客 开发任务 ' #主题<br /> # content = u'这是我使用python smtplib模块和email模块自动发送的邮件' #正文<br /> try:<br /> # msg = MIMEText(content, 'plain', 'utf-8')<br /> msg = MIMEText(content, 'html', 'utf-8')<br /> msg['Subject'] = subject<br /> msg['From'] = sender<br /> msg['TO'] = receivers<br /><br /> s = smtplib.SMTP_SSL('smtp.qq.com', 465)<br /> s.set_debuglevel(1)<br /> s.login(sender, passwd)<br /> s.sendmail(sender, receivers, msg.as_string())<br /> return True<br /> except Exception as e:<br /> print(e)<br /> return False
总结
程序上线后运行稳定,达到了预期效果。订单接收率非常有效!
附:猪八戒平台架构图
附:Scrapy思维导图
-------------------------------------------------- -------------------------------------------------- ---------------