php 正则抓取网页(1爬取中国日报新闻正则表达式操作?--正则表达式中国日报操作 )

优采云发布时间: 2022-03-05 14:06

　　php 正则抓取网页(1爬取中国日报新闻正则表达式操作?--正则表达式中国日报操作

)

　　##sample 1 爬行中国日报

<p>##原文链接：https://blog.csdn.net/carson0408/article/details/89890687

##根据上图可以定义标题匹配规则，只打印括号内的内容 pattern3=''

###正则表达方式 https://www.jb51.net/article/65286.htm

##.+? 代表意思是所有非空字符

##正则表达式”ab*”如果用于查找”abbbc”，将找到”abbb”。而如果使用非贪婪的数量词”ab*?”，将找到”a”。

#[a-zA-Z_] 代表字符串，[0-9] 大表数字,[/hpl] 代表着特定字符/或者h或者p或者l

#UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xd3 in position 252” please refer

##https://blog.csdn.net/u012767761/article/details/119836555

##从中国日报抓取英文新闻

import re

import urllib.request

def getcontent(url):

req = urllib.request.Request(url)

req.add_header('User-Agent', 'Mozilla/5.0(Windows NT 10.0;Win64;x64;rv:66.0)Gecko/20100101 Firefox/66.0')

data = urllib.request.urlopen(req).read().decode('utf-8')

#print(data)

#data = urllib.request.urlopen(req).read().decode('gbk')

pattern1 = '<a href="/(.*?)" target="_blank" title=".*?">'

urlList = re.compile(pattern1).findall(data)

##only some pattenn print bbs title 只包括体育新闻

pattern2 = '<a target="_blank" class="txt1" shape="rect" href="/.*?">.*?'

# only some pattenn print bbs title 不包括体育新闻

#pattern3 = '<a href=".*?" target="_blank" title=(.*?)>'

#pattern3 = '

0

2022-03-05

php 正则抓取网页

0 个评论

要回复文章请先登录或注册