网页抓取 加密html(我有这个代码获取页面的HTML源代码:我想从中搜集一些内容 )

优采云 发布时间: 2022-02-23 06:20

  网页抓取 加密html(我有这个代码获取页面的HTML源代码:我想从中搜集一些内容

)

  我有这段代码来获取页面的 HTML 源代码:

  1$page = file_get_contents('http://example.com/page.html');

2$page = htmlentities($page);

3

  我想从中采集一些内容。例如,假设页面的源收录:

  1technorati.com

2Connection failedPinging icerocket.com

3Connection failedPinging weblogs.com

4DonePinging newsgator.com

5DonePinging blo.gs

6DonePinging feedburner.com

7DonePinging blogstreet.com

8DonePinging my.yahoo.com

9Connection failedPinging moreover.com

10Connection failedPinging newsisfree.com

11Done

12

  有没有办法可以从源代码中删除它并将其存储在一个变量中,所以它看起来像这样:

  连接失败

  连接失败

  完毕

  等等。

  因为页面是动态的,这就是我遇到问题的原因。我可以搜索源中的每个站点吗?但是那之后我如何得到结果呢?(连接失败/完成)

  谢谢您的帮助!

  我尝试使用简单的 HTML DOM PHP 库来抓取多个站点,可在此处获得:

  然后使用这样的代码:

  1find('h2') as $heading) { //for each heading

16 //find all spans with a inside then echo the found text out

17 echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n";

18}

19?>

20

  这会导致类似:

  15.8 Earthquake Hits East Coast of the US

2Origins of Lager Found In Argentina

3Inside Oregon State University's Open Source Lab

4WebAPI: Mozilla Proposes Open App Interface For Smartphones

5Using Tablets Becoming Popular Bathroom Activity

6The Syrian Government's Internet Strategy

7Deus Ex: Human Revolution Released

8Taken Over By Aliens? Google Has It Covered

9The GIMP Now Has a Working Single-Window Mode

10Zombie Cookies Just Won't Die

11Motorola's Most Important 18 Patents

12MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing

13Evangelical Scientists Debate Creation Story

14Android On HP TouchPad

15Google Street View Gets Israeli Government's Nod

16Internet Restored In Tripoli As Rebels Take Control

17GA Tech: Internet's Mid-Layers Vulnerable To Attack

18Serious Crypto Bug Found In PHP 5.3.7

19Twitter To Meet With UK Government About Riots

20EU Central Court Could Validate Software Patents

21

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线