[原创和免费]微信公众号文章检索方法

优采云发布时间: 2020-08-07 12:07

　　python爬行

　　参考文章

　　AnyProxy代理批量采集

　　如何实现: anyproxy + js

　　如何实现: anyproxy + java + webmagic

　　FiddlerCore

　　实现方法: 数据包捕获工具Fiddler4

　　通过捕获和分析多个帐户，我们可以确定:

　　步骤:

　　1. 编写按键向导脚本，并自动单击电话上的“公众号文章列表”页面，即“查看历史新闻”；

　　2，使用提琴手代理劫持手机访问权，并将URL转发到用PHP编写的本地网页；

　　3，将php网页上收到的URL备份到数据库中；

　　4，使用python从数据库中获取URL，然后执行正常的爬网.

　　在抓取过程中发现问题:

　　如果您只想抓取文章的内容，似乎没有访问频率的限制，但是如果您想获取阅读次数和喜欢的次数，经过一定的频率后，返回值将变为空值. 我设置的时间间隔是10秒内可以正常抓取. 在这个频率下，一个小时内只能抓取360条消息，这没有实际意义.

　　青岛新列表

　　如果您只想查看数据，请直接查看每日列表，而无需花钱. 如果您需要访问自己的系统，它们还提供api接口

　　第3部分项目步骤的基本原理

　　此网站收录大多数微信官方帐户文章，这些文章将定期更新. 经过测试，发现它对爬虫更友好

　　网站页面的布局和排版规则，通过链接中的帐户来区分不同的官方帐户

　　一组公共帐户下的文章翻页也是正常的: ID号+每翻页一次12

　　Portal-Copy.png

　　所以这个想法可能是一样的

　　与环境相关的软件包获取页面

　　def get_one_page(url):

#需要加一个请求头部，不然会被网站封禁

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}

try:

response = requests.get(url, headers=headers, timeout=10)

response.raise_for_status #若不为200，则引发HTTPError错误

response.encoding = response.apparent_encoding

return response.text

except:

return "产生异常"

　　请注意，目标采集器网站必须添加标头，否则它将直接拒绝访问

　　常规解析html

　　def parse_one_page(html):

pattern = re.compile('.*?.*?<a class="question_link" href="(.*?)".*?_blank"(.*?)/a.*?"timestamp".*?">(.*?)', re.S)

items = re.findall(pattern, html)

return items

　　自动跳转页面

　　def main(offset, i):

url = 'http://chuansong.me/account/' + str(offset) + '?start=' + str(12*i)

print(url)

wait = round(random.uniform(1,2),2) # 设置随机爬虫间隔，避免被封

time.sleep(wait)

html = get_one_page(url)

for item in parse_one_page(html):

info = 'http://chuansong.me'+item[0]+','+ item[1]+','+item[2]+'\n'

info = repr(info.replace('\n', ''))

print(info)

#info.strip('\"') #这种去不掉首尾的“

#info = info[1:-1] #这种去不掉首尾的“

#info.Trim("".ToCharArray())

#info.TrimStart('\"').TrimEnd('\"')

write_to_file(info, offset)

　　删除标题中的非法字符

　　由于Windows下有file命令，因此无法使用某些字符，因此我们需要使用常规消除符

　　itle = re.sub('[\\\\/:*?\"|]', '', info.loc[indexs]['标题'])

　　转换html

　　使用pandas的read_csv函数读取抓取的csv文件，并在“链接”，“标题”，“日期”之间循环

　　def html_to_pdf(offset):

wait = round(random.uniform(1,2),2) # 设置随机爬虫间隔，避免被封

time.sleep(wait)

path = get_path(offset)

path_wk = r'D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe' #安装wkhtmltopdf的位置

config = pdfkit.configuration(wkhtmltopdf = path_wk)

if path == "" :

print("尚未抓取该公众号")

else:

info = get_url_info(offset)

for indexs in info.index:

url = info.loc[indexs]['链接']

title = re.sub('[\\\\/:*?\"|]', '', info.loc[indexs]['标题'])

date = info.loc[indexs]['日期']

wait = round(random.uniform(4,5),2) # 设置随机爬虫间隔，避免被封

time.sleep(wait)

print(url)

with eventlet.Timeout(4,False):

pdfkit.from_url(url, get_path(offset)+'\\'+ date+'_'+title+'.pdf', configuration=config)

print('转换成功！')

　　结果显示抓取结果

　　结果1-copy.png

　　已抓取的几个正式帐户存储在文件夹中

　　结果2-copy.png

　　文件夹目录下的内容

　　结果3-copy.png

　　已抓取CSV内容格式

　　生成的PDF结果

　　结果4-copy.png

　　遇到的问题，问题1

　　 for item in parse_one_page(html):

info = 'http://chuansong.me'+item[0]+','+ item[1]+','+item[2]+'\n'

info = repr(info.replace('\n', ''))

info = info.strip('\"')

print(info)

#info.strip('\"') #这种去不掉首尾的“

#info = info[1:-1] #这种去不掉首尾的“

#info.Trim("".ToCharArray())

#info.TrimStart('\"').TrimEnd('\"')

write_to_file(info, offset)

　　解决方案

　　字符串中首尾带有“”,使用上文中的#注释部分的各种方法都不好使，

最后的解决办法是：

在写入字符串的代码出，加上.strip('\'\"'),以去掉‘和”

with open(path, 'a', encoding='utf-8') as f: #追加存储形式，content是字典形式

f.write(str(json.dumps(content, ensure_ascii=False).strip('\'\"') + '\n'))

f.close()

　　问题2

　　调用wkhtmltopdf.exe将html转成pdf报错

调用代码

``` python

path_wk = 'D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'

config = pdfkit.configuration(wkhtmltopdf = path_wk)

pdfkit.from_url(url, get_path(offset)+'\\taobao.pdf', configuration=config)

```

　　错误消息

　　 OSError: No wkhtmltopdf executable found: "D:\Program Files\wkhtmltopdin\wkhtmltopdf.exe"

If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf

　　解决方案

　　 path_wk = r'D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'

config = pdfkit.configuration(wkhtmltopdf = path_wk)

pdfkit.from_url(url, get_path(offset)+'\\taobao.pdf', configuration=config)

　　或者

　　 path_wk = 'D:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'

config = pdfkit.configuration(wkhtmltopdf = path_wk)

pdfkit.from_url(url, get_path(offset)+'\\taobao.pdf', configuration=config)

　　原因

0

2020-08-07

原创文章自动采集

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

[原创和免费]微信公众号文章检索方法

0 个评论

发起人