搜索引擎如何抓取网页(2021-08-13用python如何实现一个站内搜索引擎？)

优采云发布时间: 2022-04-14 01:07

　　2021-08-13

　　如何用python实现一个现场搜索引擎？

　　首先考虑一下搜索引擎的工作流程：

　　1、网页采集。以深度或广度优先的方式搜索某个网站，保存所有网页，并使用定期和增量采集的方式进行网页维护。

　　2、创建一个索引库。首先，过滤掉重复的网页，尽管它们有不同的 URL；然后，提取网页的正文；最后，分割正文并建立索引。索引必须始终有顺序，并且使用 pagerank 算法为每个页面添加权重。

　　3、提供搜索服务。首先，对查询词进行切分；然后，对索引结果进行排序，将原来的权重和用户的查询历史结合起来，作为新的索引顺序；最后，显示文档摘要。

　　完整的过程如下：

　　------------------------------------------------ 以下文字引自万维网网络自动搜索引擎（技术报告）邓雄（Johnny Deng）2006.12

　　“网络蜘蛛”从互联网抓取网页，将网页发送到“网页数据库”，从网页“提取URL”，将URL发送到“URL数据库”，“蜘蛛控制”获取URL网页，控制“网络蜘蛛”爬取其他页面，重复循环，直到所有页面都被爬完。

　　系统从“网页数据库”中获取文本信息，发送到“文本索引”模块进行索引，形成“索引数据库”。同时进行“链接信息提取”，将链接信息（包括锚文本、链接本身等信息）送入“链接数据库”，为“网页评分”提供依据。

　　“用户”向“查询服务器”提交查询请求，服务器在“索引数据库”中搜索相关网页，而“网页评分”将查询请求和链接信息结合起来，对查询的相关性进行评估。搜索结果。查询服务器”按相关性排序，提取关键词的内容摘要，整理最终页面返回给“用户”。

　　---------------------- 报价结束

　　写一个搜索引擎的想法来自于我正在学习python，想用它来驱动自己。

　　目前思路有三个模块：网络爬虫（广度优先搜索）、网络文本提取（cx-extractor）、中文分词（smallseg）。

　　网络爬虫

　　广度优先搜索，在新浪抓取10000个页面（url中带'/'的页面）

　　爬取：urllib2.urlopen()

　　解析：htmllib.HTMLParser

　　存储：redis

　　每个 URL 对应一个 IDSEQ 序列（从 1000000 递增）

　　URL:IDSEQ 存储 URL

　　PAGE:IDSEQ 存储了URL对应的HTML页面的源代码

　　URLSET:IDSEQ 每个 URL 对应一组指向它的 URL (IDSEQ)

　　代码显示如下：

　　查看代码

　　 1 #!/usr/bin/python

2 from spdUtility import PriorityQueue,Parser

3 import urllib2

4 import sys

5 import os

6 import inspect

7 import time

8 g_url = 'http://www.sina.com.cn'

9 g_key = 'www'

10 """

11 def line():

12 try:

13 raise Exception

14 except:

15 return sys.exc_info()[2].tb_frame.f_back.f_lineno"""

16

17 def updatePriQueue(priQueue, url):

18 extraPrior = url.endswith('.html') and 2 or 0

19 extraMyBlog = g_key in url and 5 or 0

20 item = priQueue.getitem(url)

21 if item:

22 newitem = (item[0]+1+extraPrior+extraMyBlog, item[1])

23 priQueue.remove(item)

24 priQueue.push(newitem)

25 else :

26 priQueue.push( (1+extraPrior+extraMyBlog,url) )

27

28 def getmainurl(url):

29 ix = url.find('/',len('http://') )

30 if ix > 0 :

31 return url[:ix]

32 else :

33 return url

34 def analyseHtml(url, html, priQueue, downlist):

35 p = Parser()

36 try :

37 p.feed(html)

38 p.close()

39 except:

40 return

41 mainurl = getmainurl(url)

42 print mainurl

43 for (k, v) in p.anchors.items():

44 for u in v :

45 if not u.startswith('http://'):

46 u = mainurl + u

47 if not downlist.count(u):

48 updatePriQueue( priQueue, u)

49

50 def downloadUrl(id, url, priQueue, downlist,downFolder):

51 downFileName = downFolder+'/%d.html' % (id,)

52 print 'downloading', url, 'as', downFileName, time.ctime(),

53 try:

54 fp = urllib2.urlopen(url)

55 except:

56 print '[ failed ]'

57 return False

58 else :

59 print '[ success ]'

60 downlist.push( url )

61 op = open(downFileName, "wb")

62 html = fp.read()

63 op.write( html )

64 op.close()

65 fp.close()

66 analyseHtml(url, html, priQueue, downlist)

67 return True

68

69 def spider(beginurl, pages, downFolder):

70 priQueue = PriorityQueue()

71 downlist = PriorityQueue()

72 priQueue.push( (1,beginurl) )

73 i = 0

74 while not priQueue.empty() and i threshold and i+30 and g_HTMLBlock[i+2]>0 and g_HTMLBlock[i+3]>0:

28 nBegin = i

29 break

30 else:

31 return None

32 for i in range(nBegin+1, nMaxSize):

33 if g_HTMLBlock[i]==0 and i+1 1:

40 f = file(sys.argv[1], 'r')

41 global g_HTML

42 global g_HTMLLine

43 global g_HTMLBlock

44 g_HTML = f.read()

45 PreProcess()

46 g_HTMLLine = [i.strip() for i in g_HTML.splitlines()] #先分割成行list，再过滤掉每行前后的空字符

47 HTMLLength = [len(i) for i in g_HTMLLine] #计算每行的长度

48 g_HTMLBlock = [HTMLLength[i] + HTMLLength[i+1] + HTMLLength[i+2] for i in range(0, len(g_HTMLLine)-3)] #计算每块的长度

49 print GetContent(200)

50

　　以上为演示程序，实际使用需要增加存储功能。

　　仍然使用redis存储，读取所有页面页面（keys'PAGE:*'），提取文本，判断文本是否已经在容器中（排除不同url的重复页面），如果在容器中，则进行下一步循环，如果不在容器中，则将其添加到容器中并存储在 CONTENT:IDSEQ 中。

　　代码显示如下：

　　查看代码

<p> 1 #!/usr/bin/python

2 #coding=utf-8

3 #根据陈鑫《基于行块分布函数的通用网页正文抽取算法》

4 import re

5 import sys

6 import redis

7 import bisect

8 def PreProcess():

9 global g_HTML

10 _doctype = re.compile(r'', re.I|re.S)

11 _comment = re.compile(r'', re.S)

12 _javascript = re.compile(r'.*?', re.I|re.S)

13 _css = re.compile(r'.*?', re.I|re.S)

14 _other_tag = re.compile(r'', re.S)

15 _special_char = re.compile(r'&.{1,5};|&#.{1,5};')

16 g_HTML = _doctype.sub('', g_HTML)

17 g_HTML = _comment.sub('', g_HTML)

18 g_HTML = _javascript.sub('', g_HTML)

19 g_HTML = _css.sub('', g_HTML)

20 g_HTML = _other_tag.sub('', g_HTML)

21 g_HTML = _special_char.sub('', g_HTML)

22 def GetContent(threshold):

23 global g_HTMLBlock

24 nMaxSize = len(g_HTMLBlock)

25 nBegin = 0

26 nEnd = 0

27 for i in range(0, nMaxSize):

28 if g_HTMLBlock[i]>threshold and i+30 and g_HTMLBlock[i+2]>0 and g_HTMLBlock[i+3]>0:

29 nBegin = i

30 break

31 else:

32 return None

33 for i in range(nBegin+1, nMaxSize):

34 if g_HTMLBlock[i]==0 and i+1

0

2022-04-14

搜索引擎如何抓取网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

搜索引擎如何抓取网页(2021-08-13用python如何实现一个站内搜索引擎？)

0 个评论

发起人

AI时代内容工厂

搜索引擎如何抓取网页(2021-08-13用python如何实现一个站内搜索引擎？)

0 个评论

发起人

相关问题