php抓取网页匹配url(这篇就是分享给大家的全部内容了(图)参考)

优采云 发布时间: 2021-10-04 13:04

  php抓取网页匹配url(这篇就是分享给大家的全部内容了(图)参考)

  最近写了一篇ref=“/Tech/jioben/Python/310282.HTML”>Python抓取了Bing搜索主页的背景图片,并用我的电脑桌面替换了这张图片。它在定期匹配图片URL时遇到匹配失败

  要捕获的图片的地址如图所示:

  

  首先,使用这种模式

  

reg = re.compile('.*g_img={url: "(http.*?jpg)"')

  无论你如何匹配,你都无法匹配它。稍后,获取网页源代码并将其放入Notepad++,然后使用Notepad++常规匹配来查找它。很容易匹配,如图所示:

  

  后来,我编写了一个测试代码,将图片地址的行保存在一个字符串中,并很快将其匹配。如以下代码所示,数据不能匹配,但行可以匹配

  

# -*-coding:utf-8-*-

import os

import re

f = open('bing.html','r')

line = r'''Bnp.Internal.Close(0,0,60056); } });;g_img={url: "https://az12410.vo.msecnd.net/homepage/app/2016hw/BingHalloween_BkgImg.jpg",id:'bgDiv',d:'200',cN'''

data = f.read().decode('utf-8','ignore').encode('gbk','ignore')

print " "

reg = re.compile('.*g_img={url: "(http.*?jpg)"')

if re.match(reg, data):

m1 = reg.findall(data)

print m1[0]

else:

print("data Not match .")

print 20*'-'

#print line

if re.match(reg, line):

m2 = reg.findall(line)

print m2[0]

else:

print("line Not match .")

  因此,行和数据之间存在差异。有什么区别?也就是说,数据是多行的,包括换行符,而行是单行的,没有换行符。我在字符串行中添加了一个换行符,结果行不匹配

  当我们到了这里,原因很清楚。原因就在于这句话

  

re.compile('.*g_img={url: "(http.*?jpg)"')。

  后来,在阅读python文档之后,我发现函数pile()的第二个可选参数标志。此参数是re中定义的常量,具有以下常量

  

re.DEBUG Display debug information about compiled expression.

re.I

re.IGNORECASE Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.

  

re.L

re.LOCALE Make \w, \W, \b, \B, \s and \S dependent on the current locale.

  

re.M

re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

  

re.S

re.DOTALL Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.re.U re.UNICODE Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.New in version 2.0.

  

re.X

re.VERBOSE This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

  我们这里需要的是重新设计。S使“.”匹配所有字符,包括换行符。将正则表达式修改为

  

reg = re.compile('.*g_img={url: "(http.*?jpg)"', re.S)

  完美地解决这个问题

  上面的Python正则化方法匹配网页中的图片URL地址是小编共享的所有内容。我希望它能给你一个参考和支持aspku源代码库

  注意:有关教程知识,请转到python教程频道

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线