文章采集调用(python爬取参考文章AnyProxy代理批量采集实现方法：anyproxy+js )

优采云发布时间: 2022-02-16 22:08

　　文章采集调用(python爬取参考文章AnyProxy代理批量采集实现方法：anyproxy+js

)

　　蟒蛇爬行

　　参考文章

　　AnyProxy 代理批量采集

　　实现方式：anyproxy+js

　　实现方式：anyproxy+java+webmagic

　　FiddlerCore

　　实现方式：抓包工具，Fiddler4

　　通过捕获和分析多个账户，可以确定：

　　步：

　　1、编写按钮向导脚本，在手机端自动点击公众号文章的列表页面，即“查看历史消息”；

　　2、使用fiddler代理劫持手机访问，将URL转发到php编写的本地网页；

　　3、将接收到的URL备份到php网页上的数据库中；

　　4、使用python从数据库中检索URL，然后进行正常爬取。

　　在爬升过程中发现了一个问题：

　　如果只是想爬文章内容，貌似没有访问频率限制，但是如果想爬读点赞数，达到一定频率后，返回值会变成null，时间间隔我设置了10秒，就可以正常取了。在这个频率下，一个小时只能取到 360 条，没有实际意义。

　　青波新名单

　　如果你只是想看数据，你可以不花钱只看每日清单。如果你需要访问自己的系统，他们也提供了一个api接口

　　Part3 项目步骤基本原则

　　网站收录最微信公众号文章会定期更新，经测试发现对爬虫更友好

　　网站页面排版和排版规则，不同公众号以链接中的账号区分

　　公众号采集下的文章也有定期翻页：id号每翻一页+12

　　传送门副本.png

　　所以这个想法可能是

　　环境相关包获取页面

　　def get_one_page(url):

#需要加一个请求头部，不然会被网站封禁

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}

try:

response = requests.get(url, headers=headers, timeout=10)

response.raise_for_status #若不为200，则引发HTTPError错误

response.encoding = response.apparent_encoding

return response.text

except:

return "产生异常"

　　注意目标爬虫网站必须添加headers，否则会直接拒绝访问

　　正则解析html

　　def parse_one_page(html):

pattern = re.compile('.*?.*?<a class="question_link" href="(.*?)".*?_blank"(.*?)/a.*?"timestamp".*?">(.*?)', re.S)

items = re.findall(pattern, html)

return items

　　自动跳转页面

　　def main(offset, i):

url = 'http://chuansong.me/account/' + str(offset) + '?start=' + str(12*i)

print(url)

wait = round(random.uniform(1,2),2) # 设置随机爬虫间隔，避免被封

time.sleep(wait)

html = get_one_page(url)

for item in parse_one_page(html):

info = 'http://chuansong.me'+item[0]+','+ item[1]+','+item[2]+'\n'

info = repr(info.replace('\n', ''))

print(info)

#info.strip('\"') #这种去不掉首尾的“

#info = info[1:-1] #这种去不掉首尾的“

#info.Trim("".ToCharArray())

#info.TrimStart('\"').TrimEnd('\"')

write_to_file(info, offset)

　　从标题中删除非法字符

　　因为windows下的file命令，有些字符不能使用，所以需要使用正则剔除

　　itle = re.sub('[\\/:*?\"|]', '', info.loc[indexs]['标题'])

　　转换html

　　使用pandas的read_csv函数读取爬取的csv文件，循环遍历“link”、“title”、“date”

　　def html_to_pdf(offset):

wait = round(random.uniform(1,2),2) # 设置随机爬虫间隔，避免被封

time.sleep(wait)

path = get_path(offset)

path_wk = r'D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe' #安装wkhtmltopdf的位置

config = pdfkit.configuration(wkhtmltopdf = path_wk)

if path == "" :

print("尚未抓取该公众号")

else:

info = get_url_info(offset)

for indexs in info.index:

url = info.loc[indexs]['链接']

title = re.sub('[\\/:*?\"|]', '', info.loc[indexs]['标题'])

date = info.loc[indexs]['日期']

wait = round(random.uniform(4,5),2) # 设置随机爬虫间隔，避免被封

time.sleep(wait)

print(url)

with eventlet.Timeout(4,False):

pdfkit.from_url(url, get_path(offset)+'\'+ date+'_'+title+'.pdf', configuration=config)

print('转换成功！')

　　结果显示爬取结果

　　result1-copy.png

　　爬取的几个公众号存放在文件夹中

　　结果 2 - 复制.png

　　文件夹目录的内容

　　result3-copy.png

　　抓取的 CSV 内容格式

　　生成的 PDF 结果

　　result4-copy.png

　　遇到的问题问题1

　　 for item in parse_one_page(html):

info = 'http://chuansong.me'+item[0]+','+ item[1]+','+item[2]+'\n'

info = repr(info.replace('\n', ''))

info = info.strip('\"')

print(info)

#info.strip('\"') #这种去不掉首尾的“

#info = info[1:-1] #这种去不掉首尾的“

#info.Trim("".ToCharArray())

#info.TrimStart('\"').TrimEnd('\"')

write_to_file(info, offset)

　　解决方案

　　字符串中首尾带有“”,使用上文中的#注释部分的各种方法都不好使，

最后的解决办法是：

在写入字符串的代码出，加上.strip('\'\"'),以去掉‘和”

with open(path, 'a', encoding='utf-8') as f: #追加存储形式，content是字典形式

f.write(str(json.dumps(content, ensure_ascii=False).strip('\'\"') + '\n'))

f.close()

　　问题2

　　调用wkhtmltopdf.exe将html转成pdf报错

调用代码

``` python

path_wk = 'D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'

config = pdfkit.configuration(wkhtmltopdf = path_wk)

pdfkit.from_url(url, get_path(offset)+'\taobao.pdf', configuration=config)

```

　　错误信息

　　 OSError: No wkhtmltopdf executable found: "D:\Program Files\wkhtmltopdin\wkhtmltopdf.exe"

If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf

　　解决方案

　　 path_wk = r'D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'

config = pdfkit.configuration(wkhtmltopdf = path_wk)

pdfkit.from_url(url, get_path(offset)+'\taobao.pdf', configuration=config)

　　或者

　　 path_wk = 'D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'

config = pdfkit.configuration(wkhtmltopdf = path_wk)

pdfkit.from_url(url, get_path(offset)+'\taobao.pdf', configuration=config)

　　原因

　　 Your config path contains an ASCII Backspace, the \b in \bin,

which pdfkit appears to be stripping out and converting D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe

to D:\Program Files\wkhtmltopdf\wkhtmltopdf.exe.

0

2022-02-16

文章采集调用

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

文章采集调用(python爬取参考文章AnyProxy代理批量采集实现方法：anyproxy+js )

0 个评论

发起人

AI时代内容工厂

文章采集调用(python爬取参考文章AnyProxy代理批量采集实现方法：anyproxy+js )

0 个评论

发起人

相关问题