微信公众号内容采集，比较怪异，其参数，post参数需要话费时间去搞定

优采云发布时间: 2021-08-18 01:17

　　微信公众号采集的内容很奇怪。它的参数和后期参数需要时间来弄清楚。这里采集是topic标签的内容，用pdfkit打印出来的内容。

　　这里实现了两个版本。第一个是直接网络访问。它的真实地址，post URL，也有更多的参数。我没试过。得到的内容只是其中的一部分，并不理想。第二个版本是使用无头浏览器直接访问，获取网页源代码，分析，获取你想要的内容。

　　这个人渣现在比较懒，代码都是以前用的，现成的，复制的，修改的，直接用！

　　版本一：

#微信公众号内容获取打印pdf #by 微信：huguo00289 #https://mp.weixin.qq.com/mp/homepage?__biz=MzA4NjQ3MDk4OA==&hid=5&sn=573b1b806f9ebf63171a56ee2936b883&devicetype=android-29&version=27001239&lang=zh_CN&nettype=WIFI&a=&session_us=gh_7d55ab2d943f&wx_header=1&fontScale=100&from=timeline&isappinstalled=0&scene=1&subscene=2&clicktime=1594602258&enterid=1594602258&ascene=14 # -*- coding: UTF-8 -*- import requests from fake_useragent import UserAgent import os,re import pdfkit confg = pdfkit.configuration( wkhtmltopdf=r'D:\wkhtmltox-0.12.5-1.mxe-cross-win64\wkhtmltox\bin\wkhtmltopdf.exe') class Du(): def __init__(self,furl): ua=UserAgent() self.headers={ "User-Agent": ua.random, } self.url=furl def get_urls(self): response=requests.get(self.url,headers=self.headers,timeout=8) html=response.content.decode('utf-8') req=re.findall(r'var data={(.+?)if',html,re.S)[0] urls=re.findall(r',"link":"(.+?)",',req,re.S) urls=set(urls) print(len(urls)) return urls def get_content(self,url,category): response = requests.get(url, headers=self.headers, timeout=8) print(response.status_code) html = response.content.decode('utf-8') req = re.findall(r'

　　(.+?)varfirst_sceen__time',html,re.S)[0]#获取标题

　　h1=re.findall(r' (.+?)',req,re.S)[0]

　　h1=h1.strip()

　　pattern=r"[\/\\:\*\?\"\|]"

　　h1=re.sub(pattern,"_",h1)#用下划线替换

　　打印(h1)#获取详情

　　detail=re.findall(r'

　　(.+?)

0

2021-08-18

内容采集

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

微信公众号内容采集，比较怪异，其参数，post参数需要话费时间去搞定

0 个评论

发起人

AI时代内容工厂

微信公众号内容采集，比较怪异，其参数，post参数需要话费时间去搞定

0 个评论

发起人

相关问题