网页抓取 加密html(我有这个代码获取页面的HTML源代码:我想从中搜集一些内容 )
优采云 发布时间: 2022-02-23 06:20网页抓取 加密html(我有这个代码获取页面的HTML源代码:我想从中搜集一些内容
)
我有这段代码来获取页面的 HTML 源代码:
1$page = file_get_contents('http://example.com/page.html');
2$page = htmlentities($page);
3
我想从中采集一些内容。例如,假设页面的源收录:
1technorati.com
2Connection failedPinging icerocket.com
3Connection failedPinging weblogs.com
4DonePinging newsgator.com
5DonePinging blo.gs
6DonePinging feedburner.com
7DonePinging blogstreet.com
8DonePinging my.yahoo.com
9Connection failedPinging moreover.com
10Connection failedPinging newsisfree.com
11Done
12
有没有办法可以从源代码中删除它并将其存储在一个变量中,所以它看起来像这样:
连接失败
连接失败
完毕
等等。
因为页面是动态的,这就是我遇到问题的原因。我可以搜索源中的每个站点吗?但是那之后我如何得到结果呢?(连接失败/完成)
谢谢您的帮助!
我尝试使用简单的 HTML DOM PHP 库来抓取多个站点,可在此处获得:
然后使用这样的代码:
1find('h2') as $heading) { //for each heading
16 //find all spans with a inside then echo the found text out
17 echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n";
18}
19?>
20
这会导致类似:
15.8 Earthquake Hits East Coast of the US
2Origins of Lager Found In Argentina
3Inside Oregon State University's Open Source Lab
4WebAPI: Mozilla Proposes Open App Interface For Smartphones
5Using Tablets Becoming Popular Bathroom Activity
6The Syrian Government's Internet Strategy
7Deus Ex: Human Revolution Released
8Taken Over By Aliens? Google Has It Covered
9The GIMP Now Has a Working Single-Window Mode
10Zombie Cookies Just Won't Die
11Motorola's Most Important 18 Patents
12MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
13Evangelical Scientists Debate Creation Story
14Android On HP TouchPad
15Google Street View Gets Israeli Government's Nod
16Internet Restored In Tripoli As Rebels Take Control
17GA Tech: Internet's Mid-Layers Vulnerable To Attack
18Serious Crypto Bug Found In PHP 5.3.7
19Twitter To Meet With UK Government About Riots
20EU Central Court Could Validate Software Patents
21