推荐文章:SEO文章原创度检测

优采云发布时间: 2022-09-30 10:12

　　推荐文章:SEO文章原创度检测

　　过程：

　　1）先放一篇文章文章，用逗号分隔成短语

　　2）然后统计每个词组的字数

　　3）前两个10个字符的词组，我们在百度搜索下取出，统计该词组在百度搜索结果中完全出现的次数。

　　如果一个文章被其他网站大量转载，只需在文章中提取一个词组，你可以在百度上搜索准确的重复内容：

　　如果我们连续搜索两个词组，在百度搜索中，完全重复的结果很少，也就是说内容被其他网站转载的概率比较小，原创的程度比较高的。

　　编写脚本执行以上 3 步：

　　左栏是文章的ID，右栏是这两个词组在百度搜索结果中出现的完整次数。次数越大，重复程度越高，具体数值可以自己定义。比如这个渣一般定位>=30%的重复度高的，也就是搜索2个词组。在 20 个搜索结果中，有 >=6 个结果完全出现了该短语

#coding:utf-8 import requests,re,time,sys,json,datetime import multiprocessing import MySQLdb as mdb reload(sys) sys.setdefaultencoding('utf-8') current_date = time.strftime('%Y-%m-%d',time.localtime(time.time())) def search(req,html): text = re.search(req,html) if text: data = text.group(1) else: data = 'no' return data def date(timeStamp): timeArray = time.localtime(timeStamp) otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) return otherStyleTime def getHTml(url): host = search('^([^/]*?)/',re.sub(r'(https|http)://','',url)) headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6", "Cache-Control":"no-cache", "Connection":"keep-alive", #"Cookie":"", "Host":host, "Pragma":"no-cache", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", } # 代理服务器 proxyHost = "proxy.abuyun.com" proxyPort = "9010" # 代理隧道验证信息 proxyUser = "XXXX" proxyPass = "XXXX" proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : proxyPass, } proxies = { "http" : proxyMeta, "https" : proxyMeta, } html = requests.get(url,headers=headers,timeout=30) # html = requests.get(url,headers=headers,timeout=30,proxies=proxies) code = html.encoding return html.content def getContent(word): pcurl = 'http://www.baidu.com/s?q=&tn=json&ct=2097152&si=&ie=utf-8&cl=3&wd=%s&rn=10' % word # print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ start crawl %s @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@' % pcurl html = getHTml(pcurl) a = 0 html_dict = json.loads(html) for tag in html_dict['feed']['entry']: if tag.has_key('title'): title = tag['title'] url = tag['url'] rank = tag['pn'] time = date(tag['time']) abs = tag['abs'] if word in abs: a += 1 return a con = mdb.connect('127.0.0.1','root','','wddis',charset='utf8',unix_socket='/tmp/mysql.sock') cur = con.cursor() with con: cur.execute("select aid,content from pre_portal_article_content limit 10") numrows = int(cur.rowcount) for i in range(numrows): row = cur.fetchone() aid = row[0] content = row[1] content_format = re.sub(']*?>','',content) a = 0 for z in [ x for x in content_format.split('，') if len(x)>10 ][:2]: a += getContent(z) print "%s --> %s" % (aid,a) # words = open(wordfile).readlines() # pool = multiprocessing.Pool(processes=10) # for word in words: # word = word.strip() # pool.apply_async(getContent, (word,client )) # pool.close() # pool.join()

　　最新发布:如何使网站在搜索引擎中排名首页？

　　相信大部分人都对SEO推广很熟悉了，但是随着搜索引擎算法的更新，如今的网络SEO推广越来越难。在搜索引擎中排名并不容易。

　　1、查询搜索量

　　我们可以先选择一些核心行业关键词，使用一些关键词工具查询关键词的准确搜索量，或者通过百度搜索下的相关搜索整理关键词和搜索量进入关键词搜索大小，然后按难度排序。

　　2、分析关键词比赛

　　从关键词搜索范围中选择搜索量大的关键词，然后一一搜索百度，分析网站在首页的排名，查看这些网站的优化性能和相关数据。

　　3、识别关键字

　　这一步比较简单。毕竟我们之前也做过一些分析。这一步是将分析结果整理成文档。不过这一步一定要慎重，核心和子核心的分类要清楚。通常，核心关键词只有 20%。通常，核心关键词会在网站首页进行优化，其他关键词根据实际情况进行优化，包括渠道页面，或者核心产品和业务页面。

0

2022-09-30

文章伪原创检测

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

推荐文章:SEO文章原创度检测

0 个评论

发起人

AI时代内容工厂

推荐文章:SEO文章原创度检测

0 个评论

发起人

相关问题