httpunit 抓取网页( jsoup抓取网页教程(jsoup)抓取(抓取)(组图))

优采云 发布时间: 2021-10-27 03:11

  httpunit 抓取网页(

jsoup抓取网页教程(jsoup)抓取(抓取)(组图))

  

  jsoup爬网页教程jsoup爬网页教程jsoup爬网页教程jsoup爬网页教程jsoup爬网页详细我以前在ibmdw上发表过两篇关于htmlparser的文章。 文章 是从html中抓取你需要的信息,扩展htmlparser处理自定义标签的能力。但现在我不再使用 htmlparser。原因是htmlparser很少更新,但最重要的是它的nameits是一个javahtml解析器,可以直接解析一个URL地址的html文本内容。它提供了一个非常省力的API,可以通过domcss和jquery之类的操作方法进行检索和检索。其操作数据的主要功能如下: 1. 从文件或字符串中解析出urlhtml 2. 使用domdesignedcss选择器找出数据 3. 可操作的HTML元素属性文本 其基于新版mit协议离婚协议劳动协议合同个人投资股权协议广告可以安全用于商业项目的1its的主要类层次结构如图1所示。解释一下它的html文档是如何优雅地处理的 ------------------------------------- ---- -------------------------------------返回顶部 文档输入可以收录来自 String urlits 地址和本地文件加载 html 文档并生成文档对象实例。下面是相关代码清单 1.直接从字符串stringhtml中输入html文档 "htmlheadtitle开源中文社区titlehead" "bodyp这里是与其项目相关的文章 ppointshtml"documentdocjsoupparsehtml 直接从url documentdocjsoupconnecthttpwwwoschinanet加载html文档"getstringnamedoctitledocumentdocjsoupconnecthttpwwwoschinanet"datathe"query""java"请求参数useragent"i"mits"set useragentcookie"auth""token"set cookietime out3000设置连接超时时间。post使用post方法访问url,从中加载html文档文件.fileinputnewfile"ctesthtml"documentdocjsoupparseinput"utf-8""httpwwwoschinanet.请注意最后一个html文档输入法中parse的第三个参数。这里指定一个url,虽然第一种方法不需要指定,因为html文档中会有很多链接,比如链接图片和引用的外部脚本css文件等,第三个参数baseurl的意思是当html 文档使用相对路径引用外部文件时,它会自动给这些 URL 添加前缀,即 baseurlnoandhrefproject 开源软件 a 将转换为 ahrefhttpwwwoschinanetproject 开源软件 a----------- ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— -------- -----------返回顶部 HTML 元素的分析和提取这部分与 HTML 解析器的基本功能有关,但与其他开源项目不同 - 选择器我们将在最后一部分

  

  etailsthejsoupselectorinthissectionyouwillseejsoupishowtoachievethemostsimplecodeHoweverjsoupalsoprovideselementalparsingofthetraditionalDOMapproachlookingatthecodebelowListing2FileinputnewFile “Dtesthtml” DocumentdocJsoupparseinput “UTF-8”, “ElementcontentdocgetElementById” 内容 “ElementslinkscontentgetElementsByTag” 一 “ForElementlinklinksStringlinkHreflinkattr的” href “StringlinkTextlinktextYoumightthinkthatjsoupsmethodisfamiliarandyeslikethegetElementByIdandgetElementsByTagmethodsthenamesofthemethodsarethesameandthefunctionsareexactlythesameasthoseoftheJavaScriptYoucangetthecorrespondingelementorlistofelementsaccordingtothenodenameortheIDoftheHTMLelementUnliketheHtmlparserprojectjsoupdoesnotdefineacorrespondingHTMLelementpartofageneralHTMLelementsincludenodenameattributeandtextjsoupprovidesasimplewayforyoutoretrievethesedatawhichisthereasonforkeepingfitjsoupIntermsofelementretrievaljsoupselectorsareomnipotentListing3FileinputnewFile” Dtesthtml “DocumentdocJsoupparseinput” UTF-8“ “Elementslinks docselect ”一个[HREF]“ hrefattributelinkElementsPNGSdocselect ”IMG [srcpng]“ PNGallreferencepictureelementsElementmastheaddocselect ”divmasthead“ FirstFinddefineclassmastheadelementsElementsresultLinksdocselect ”h3ra“ directaafterH3ThisisthejsoupthatIreallyimpressedbythelocaljsoupforusewiththejQueryselectorontheelementsareaslikeastwopeasretrievalretrievalmethodsaboveifotherHTMLinterpreteratleastneedmanylinesofcodebutjsoupneedsonlyonelineofcodecompleteTheselectorofjsoupalsosupportstheexpressionfunctionandwellintroducethissuperselectorinthelastsection -

  

  ----------------------------------------------- -------------------------------返回顶部修改数据在分析文档的同时,我们可能需要修改文档中的某些元素,例如我们可以添加可点击链接、修改链接地址或修改文档中所有图像的文本这里有一些简单的示例 Listing4Docselect"divcommentsa"Attr"rel""nofollow"Increaseterlink"Increaseterlink"IncreaseterlinkClassof removeAttr "delete allthepicturesoftheonclickattributeDocselect" input [typetext] "" Val "textemptyallthetextinputboxin thereassonisvery simpleYou just need to find the elements using jsoupselector and then through the above the method to modify in addition to not modify the label name can beremoved after the insertion of the index of the HTML ----------------------------------------- --BacktotopHTMLdocumentcleanupJs oupalsodoesagreatjobofprovidingpowerfulAPIandhumanizingWhenyoudoawebsiteyouoftenprovidethefunctionofuserreviewsSomeusersaremoremischievouswillmakesomecommentstothescriptandthesescriptsmaydestroytheentirepagemoreseriousistoobtainsomeconfidentialinformationsuchascrosssiteattackslikeXSSJsoupisverysupportiveofthisanditsverysimpletouseTakealookatthecodebelowListing5Stringunsafe “pahref” OpensourceChinesecommunityap “StringsafeJsoupcleanunsafeWhitelistbasicoutputpahref” 相对 “nofollow” 的ChineseapopensourcecommunityJsoupus

  

  "embed""object""param""span""div"你也可以调用addAttributes属性来增加一些元素------------------------------ -------------------------------------------------BacktotopJsoup “selectorWehavebrieflyintroducedhowjsoupusesselectorstoretrieveelementsInthissectionwefocusonthepowerfulsyntaxoftheselectorits elfThefollowingtableisadetailedlistofallthesyntaxofthejsoupselectorTable2basicusageTagNameusestagnamestolocateforexampleaNstagusesnamespacetagpositioningsuchasfbnametofindfbnameelementsiduseselementIDtolocateforexamplelogoclassusestheclasspropertyoftheelementtolocateforexampleHead [属性] usesthepropertiesoftheelementstolocateforexample [HREF] representsallelementsthatretrievethehrefattribute [ATTR] usestheattributenameprefixesofelementstolocatesuchas [DATA-] whichisusedtofindthedatasetattributeofHTML5 [attrvalue] usesattributevaluestolocateFore

  

  xample [width500] locatesallelementswhosewidthattributevalueis500 [attrvalue] [attrvalue] [attrvalue] thesethreegrammarrespectivelyrepresentattributewithvaluebeginningendingandcontain [attrregex] istheuseofregularexpressionsforattributevaluefilteringsuchasimg [srcIpngjpeG] locateallelementsThesearethemostbasicselectorssyntaxthatcanbeusedtogetherandhereisacombinationofjsoupsupportTable3combinationusageTheelidIDvalueofoneoftheelementssuchasalogo-aidlogohrefElclassclasstothespecifiedvalueelementssuchasdivheaddivclassheadxxxxdivEl [ATTR] locatesallelementsthatdefineapropertysuchasa [HREF] Theabovethreearbitrarycombinationssuchasa [HREF] logoa [名称] outerlinkAncestorandchildarefivetypesofselectorsyntaxwhichincludeparent-childrelationshipmergerelationandhierarchicalrelationParentchildSiblingAsiblingBSiblingAsiblingXElElElInadditiontosomebasicsyntaxandthecombinationofthesegrammarsjsoupalsosupportstheuseofexpressionsforelementfilteringselectionHereisalistofallexpressionssupportedbyjsoupTab le4expressionLTnsu chastdlt3representslessthanthreecolumnsGTndivpgt2meansthatdivcontainsmorethan2pEQnforminputeq1representsaformthatcontainsonlyoneinputHasseletordivhasPrepresentsthedivcontainingthePelementNotselectordivnotLogorepresentsalldivliststhatdonotcontainclasslogoelementsContainstextcontainselementsofatextthatarenotcasesensitivesuchaspcontainsoschinaContainsOwntexttextinformationiscompletelyequaltothefilterofthespecifiedconditionMatchesregexusesregularexpressionsfortextfilteringdivmatchesIloginMatchesOwnregexusesregularexpressionstofinditsowntext

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线