nodejs抓取动态网页(基本上在互联网上存在了问题是如何把它们整理成你所需要的)

优采云发布时间: 2021-11-14 23:11

　　你想要的任何信息基本上都存在于互联网上。问题是如何把它们组织成你需要的东西，比如抓取某个行业所有相关公司的名称网站、联系电话、Email等，然后保存在Excel中进行分析。网页信息抓取变得更加有用。

　　对于传统网页，网页服务器直接返回Html。这种类型的网页很容易捕获。不管用什么方法，只需要拿到html页面，然后做Dom分析即可。但对于需要 Javascript 生成的网页来说，就没有那么容易了。对于这个问题，张宇还没有找到很好的解决办法。有抓取javascript网页经验的朋友，欢迎指点。

　　所以今天我要讲的是从传统的html网页爬取信息。虽然我之前说过，没有技术难度，但是有没有比较简单的方法呢？用过jQuery等js框架的朋友可能会觉得javascript看起来像是抓取网页信息的天然助手，它为网页解析而生。当然，现在还有更多的应用，比如服务端的javascript应用，NodeJs。

　　如果能在我们的应用程序中使用jQuery来抓取网页，比如java程序，那绝对是一件令人兴奋的事情。确实有现成的方案，有Javascript引擎，有可以支持jQuery的环境。

　　工具：java、Rhino、envJs。其中Rhino是Mozzila提供的开源Javascript引擎，envJs是模拟浏览器环境，比如Window。代码如下，

　　package stony.zhang.scrape; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.lang.reflect.InvocationTargetException; import org.mozilla.javascript.Context; import org.mozilla.javascript.ContextFactory; import org.mozilla.javascript.Scriptable; import org.mozilla.javascript.ScriptableObject; /** * @author MyBeautiful * @Emal: zhangyu0182@sina.com * @date Mar 7, 2012 */ public class RhinoScaper { private String url; private String jsFile; private Context cx; private Scriptable scope; public String getUrl() { return url; } public String getJsFile() { return jsFile; } public void setUrl(String url) { this.url = url; putObject("url", url); } public void setJsFile(String jsFile) { this.jsFile = jsFile; } public void init() { cx = ContextFactory.getGlobal().enterContext(); scope = cx.initStandardObjects(null); cx.setOptimizationLevel(-1); cx.setLanguageVersion(Context.VERSION_1_5); String[] file = { "./lib/env.rhino.1.2.js", "./lib/jquery.js" }; for (String f : file) { evaluateJs(f); } try { ScriptableObject.defineClass(scope, ExtendUtil.class); } catch (IllegalAccessException e1) { e1.printStackTrace(); } catch (InstantiationException e1) { e1.printStackTrace(); } catch (InvocationTargetException e1) { e1.printStackTrace(); } ExtendUtil util = (ExtendUtil) cx.newObject(scope, "util"); scope.put("util", scope, util); } protected void evaluateJs(String f) { try { FileReader in = null; in = new FileReader(f); cx.evaluateReader(scope, in, f, 1, null); } catch (FileNotFoundException e1) { e1.printStackTrace(); } catch (IOException e1) { e1.printStackTrace(); } } public void putObject(String name, Object o) { scope.put(name, scope, o); } public void run() { evaluateJs(this.jsFile); } }

　　测试代码：

　　test.js 文件，如下

　　$.ajax({ url: "http://www.baidu.com", context: document.body, success: function(data){ // util.log(data); var result =parseHtml(data); var $v= jQuery(result); // util.log(result); $v.find('#u a').each(function(index) { util.log(index + ': ' + $(this).attr("href")); // arr.add($(this).attr("href")); }); } }); function parseHtml(html) { //Create an iFrame object that will be used to render the HTML in order to get the DOM objects //created - this is a far quicker way of achieving the HTML to DOM conversion than trying //to transform the HTML objects one-by-one var oIframe = document.createElement('iframe'); //Hide the iFrame from view oIframe.style.display = 'none'; if (document.body) document.body.appendChild(oIframe); else document.documentElement.appendChild(oIframe); //Open the iFrame DOM object and write in our HTML oIframe.contentDocument.open(); oIframe.contentDocument.write(html); oIframe.contentDocument.close(); //Return the document body object containing the HTML that was just //added to the iFrame as DOM objects var oBody = oIframe.contentDocument.body; //TODO: Remove the iFrame object created to cleanup the DOM return oBody; }

　　当我们执行Unit Test时，从网页抓取的三个百度连接会打印在控制台上，

　　0：

　　1：

　　2：

　　测试成功，证明在java程序中使用jQuery抓取网页是可行的。

　　----------------------------------------------- -----------------------

　　张宇，我的美丽，

0

2021-11-14

nodejs抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

nodejs抓取动态网页(基本上在互联网上存在了问题是如何把它们整理成你所需要的)

0 个评论

发起人