神策杯 2018高校算法大师赛(个人、top2、top6)方案总结
优采云 发布时间: 2022-06-21 02:16神策杯 2018高校算法大师赛(个人、top2、top6)方案总结
神策数据推荐系统是基于神策分析平台的智能推荐系统。它针对客户需求和业务特点,并基于神策分析采集的用户行为数据使用机器学习算法来进行咨询、视频、商品等进行个性化推荐,为客户提供不同场景下的智能应用,如优化产品体验,提升点击率等核心的业务指标。
神策推荐系统是一个完整的学习闭环。采集的基础数据,通过机器学习的算法模型形成应用。效果实时验证,从而指导添加数据源,算法优化反馈形成一个全流程、实时、自动、可快速迭代的推荐闭环。
本次竞赛是模拟业务场景,以新闻文本的核心词提取为目的,最终结果达到提升推荐和用户画像的效果。
比赛链接:
数据集数据地址:
密码:qa2u
02任务
个性化推荐系统是神策智能系统的一个重要方面,精准的理解资讯的主题,是提升推荐系统效果的重要手段。神策数据以一个真实的业务案例作为依托,提供了上千篇资讯文章及其关键词,参赛者需要训练出一个”关键词提取”的模型,提取10万篇资讯文章的关键词。
03数据
备注:报名参赛或加入队伍后,可获取数据下载权限。
提供下载的数据集包括两个部分:1. all_docs.txt,108295篇资讯文章数据,数据格式为:ID 文章标题 文章正文,中间由\001分割。2. train_docs_keywords.txt,1000篇文章的关键词标注结果,数据格式为:ID 关键词列表,中间由\t分割。
说明:标注数据中每篇文章的关键词不超过5个。关键词都在文章的标题或正文中出现过。需要注意的是,“训练集文章的关键词构成的集合”与“测试集文章的关键词构成的集合”,这两个集合可能存在交集,但不一定存在包含与被包含的关系。
04个人初赛第十一名方案
基于NLP中的无监督学习方法来提取关键词,这也是自己第一次参加比赛,当时刚接触NLP,所以对这次比赛印象深刻,在此给大家分享出来
神策杯”2018高校算法大师赛 B榜排名(13/583)
4.1 得分情况
4.2 数据分析:
4.3 提升技巧
词性标错
这个是导致tf-idf提取关键字误差较大的原因
4.5 核心代码:
# -*- coding: utf-8 -*-<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /># @Author : quincyqiang<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /># @File : analysis_for_06.py<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /># @Time : 2018/9/5 14:17<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />import pickle<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />import pandas as pd<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />from tqdm import tqdm<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />from jieba.analyse import extract_tags,textrank # tf-idf<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />from jieba import posseg<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />import random<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />import jieba<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />jieba.analyse.set_stop_words('data/stop_words.txt') # 去除停用词<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />jieba.load_userdict('data/custom_dict.txt') # 设置词库<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />'''<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> nr 人名 nz 其他专名 ns 地名 nt 机构团体 n 名词 l 习用语 i 成语 a 形容词 <br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> nrt <br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> v 动词 t 时间词<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />'''<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />test_data=pd.read_csv('data/test_docs.csv')<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />train_data=pd.read_csv('data/new_train_docs.csv')<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />allow_pos={'nr':1,'nz':2,'ns':3,'nt':4,'eng':5,'n':6,'l':7,'i':8,'a':9,'nrt':10,'v':11,'t':12}<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /># allow_pos={'nr':1,'nz':2,'ns':3,'nt':4,'eng':5,'nrt':10}<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />tf_pos = ['ns', 'n', 'vn', 'nr', 'nt', 'eng', 'nrt','v','a']<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />def generate_name(word_tags):<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> name_pos = ['ns', 'n', 'vn', 'nr', 'nt', 'eng', 'nrt']<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> for word_tag in word_tags:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if word_tag[0] == '·' or word_tag=='!':<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> index = word_tags.index(word_tag)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if (index+1) 1]<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> title_keywords = sorted(title_keywords, reverse=False, key=lambda x: (allow_pos[x[1]], -len(x[0])))<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if '·' in title :<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if len(title_keywords) >= 2:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> key_1 = title_keywords[0][0]<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> key_2 = title_keywords[1][0]<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> else:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> # print(keywords,title,word_tags)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> key_1 = title_keywords[0][0]<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> key_2 = ''<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> labels_1.append(key_1)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> labels_2.append(key_2)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> else:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> # 使用tf-idf<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> use_idf += 1<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> # ---------重要文本-----<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> primary_words = []<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> for keyword in title_keywords:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if keyword[1] == 'n':<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> primary_words.append(keyword[0])<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if keyword[1] in ['nr', 'nz', 'nt', 'ns']:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> primary_words.extend([keyword[0]] * len(keyword[0]))<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> abstract_text = "".join(doc.split(' ')[:15])<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> for word, tag in jieba.posseg.cut(abstract_text):<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if tag == 'n':<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> primary_words.append(word)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if tag in ['nr', 'nz', 'ns']:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> primary_words.extend([word] * len(word))<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> primary_text = "".join(primary_words)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> # 拼接成最后的文本<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> text = primary_text * 2 + title * 6 + " ".join(doc.split(' ')[:15] * 2) + doc<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> # ---------重要文本-----<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> temp_keywords = [keyword for keyword in extract_tags(text, topK=2)]<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> if len(temp_keywords)>=2:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> labels_1.append(temp_keywords[0])<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> labels_2.append(temp_keywords[1])<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> else:<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> labels_1.append(temp_keywords[0])<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> labels_2.append(' ')<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> data = {'id': ids,<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> 'label1': labels_1,<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> 'label2': labels_2}<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> df_data = pd.DataFrame(data, columns=['id', 'label1', 'label2'])<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> df_data.to_csv('result/06_jieba_ensemble.csv', index=False)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> print("使用tf-idf提取的次数:",use_idf)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />if __name__ == '__main__':<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> # evaluate()<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /> extract_keyword_ensemble(test_data)<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />© 2021 GitHub, Inc.<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />
以下整理来自国内大佬无私的风向
05“神策杯”2018高校算法大师赛第二名代码
代码链接:
文章链接:
队伍:发SCI才能毕业
5.1 目录说明
jieba:修改过的jieba库。
字典:存放jieba词库。PS:词库来源于搜狗百度输入法词库、爬虫获取的明星词条和LSTM命名实体识别结果。
all_docs.txt: 训练语料库
train_docs_keywords.txt:我把明显错误的一些关键词改回来了,例如D039180梁静茹->贾静雯、D011909泰荣君->泰容君等
classes_doc2vec.npy:gensim默认参数的doc2vec+Kmeans对语料库的聚类结果。
my_idf.txt:计算得来的语料库的idf文件。
lgb_sub_9524764012949717.npy LGB的某一次预测值,用于特征生成
stopword.txt:停用词
Get_Feature.ipynb:特征生成notebook,对训练集和测试集生成对应的文件
lgb_predict.py:预测并输出结果的脚本。需要train_df_v7.csv和test_df_v7.csv。
train_df_v7.csv,test_df_v7.csv:Get_Feature.ipynb 跑出来的结果,notebook有详细特征说明
word2vec模型下载地址: 提取码:tw0m。
doc2vec模型下载地址:链接: 提取码:0ciw.
5.2 运行说明
运行Get_Feature.ipynb获取train_df_v7.csv和test_df_v7.csv.
运行lgb_predict.py 获取结果sub.csv。
numpy 1.14.0rc1<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />pandas 0.23.0<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />sklearn 0.19.0<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />lightgbm 2.0.5<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />scipy 1.0.0<br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" /><br style="margin: 0px;padding: 0px;max-width: 100%;overflow-wrap: break-word !important;box-sizing: border-box !important;" />
5.3 解题思路方案说明
利用jieba的tfidf方法筛选出Top20的候选关键词
针对每条样本的候选关键词提取相应的特征,把关键词提取当作是普通二分类问题。特征可以分为以下两类:
样本文档自身特征:例如文本的长度、句子数、聚类结果等;
候选关键词自身特征:关键词的长度、逆词频等;
样本文本和候选关键词的交互特征:词频、头词频、tfidf、主题相似度等;
候选关键词之间的特征:主要是关键词之间的相似度特征。
候选关键词与其他样本文档的交互特征:这里有两个非常强的特征,第一是在整个数据集里被当成候选关键词的频率,第二个与点击率类似,算在整个文档中预测为正样本的概率结果大于0.5的数量(在提这个特征的时候我大概率以为会过拟合,但是效果出乎意料的好,所以也没有做相应的平滑,或许是因为结果只选Top2的关键词,这里概率选0.5会有一定的平滑效果,具体操作请看lgb_predict.py的31-42行)。
利用LightGBM解决上述二分类问题,然后根据LightGBM的结果为每条文本选出预测概率Top2的词作为关键词输出即可。
06第六名方案 Rank 6 / 622
代码链接:
07总结
这个任务属于短语挖掘或者关键词挖掘,在接触NLP期间有很多同学在研究如何从文本中挖掘关键词,经过NLP近几年技术的发展,大体总结有以下方法,其实也是贯穿上面分享的三个方案:
基于无监督方法:LDA,TFIDF,TextRank
基于Feature Engineering:基于无监督生成候选词,然后构建特征训练二分类模型
基于深度学习的关键词提取:span、bio、bmes crf序列标注等方法
08更多资料
谈谈医疗健康领域的Phrase Mining
加微信进交流群:1185918903 备注:ChallengeHub01