php禁止网页抓取(网络上常见的禁止垃圾蜘蛛爬站的方法，增加服务器压力 )

优采云发布时间: 2021-09-23 13:18

　　php禁止网页抓取(网络上常见的禁止垃圾蜘蛛爬站的方法，增加服务器压力

)

　　摘要

　　nginx最近发现了很多日志记录mj12bot，如抓握垃圾爬行动物，导致日志音量增加，服务器增加了压力。在此处将网络上的各种方法整理到网络中的垃圾邮件蜘蛛爬网站，以完成自己的设置，但也为您提供了网站管理员的引用。

　　nginx

　　conf到目录下的nginx安装目录，以下代码保存为代理_deny.conf

　　#禁止 Scrapy 等工具的抓取

if ($http_user_agent ~* (Scrapy|HttpClient)) {

return 403;

}

#禁止指定 UA 及 UA 为空的访问

if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {

return 403;

}

#禁止非 GET|HEAD|POST 方式的抓取

if ($request_method !~ ^(GET|HEAD|POST)$) {

return 403;

}

　　然后，服务器段网站 xxx.conf配置插入以下代码：

　　include agent_deny.conf;

　　apache

　　在.htaccess由网站目录修改，将以下代码添加到（2可选地类型代码）：

　　代码（1)：

　　RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC]

RewriteRule ^(.*)$ - [F]

　　代码（2)：

Order Allow,Deny

Allow from all

Deny from env=BADBOT

　　php

　　在index.php中附加到第一个入口网站文件的方法

　　//获取 UA 信息

$ua = $_SERVER['HTTP_USER_AGENT'];

//将恶意 USER_AGENT 存入数组

$now_ua = array('FeedDemon ','BOT/0.1 (BOT for JCE)','CrawlDaddy ','Java','Feedly','UniversalFeedParser','ApacheBench','Swiftbot','ZmEu','Indy Library','oBot','jaunty','YandexBot','AhrefsBot','MJ12bot','WinHttp','EasouSpider','HttpClient','Microsoft URL Control','YYSpider','jaunty','Python-urllib','lightDeckReports Bot');

//禁止空 USER_AGENT，dedecms 等主流采集程序都是空 USER_AGENT，部分 sql 注入工具也是空 USER_AGENT

if(!$ua) {

header("Content-type: text/html; charset=utf-8");

die('请勿采集本站，因为采集的站长木有小 JJ！');

}else{

foreach($now_ua as $value )

//判断是否是数组中存在的 UA

if(eregi($value,$ua)) {

header("Content-type: text/html; charset=utf-8");

die('请勿采集本站，因为采集的站长木有小 JJ！');

}

　　测试抓握效果

　　模拟MJ12bot蜘蛛：

　　curl -I -A 'MJ12bot' https://www.yunloc.com

　　UA模拟空夹具：

　　curl -I -A '' https://www.yunloc.com

　　模拟百度蜘蛛爬网：

　　curl -I -A 'Baiduspider' https://www.yunloc.com

　　结果截图抓握遵循：

　　[root@jxonesys ~]# curl -I -A 'MJ12bot' https://www.yunloc.com

HTTP/1.1 403 Forbidden

Server: nginx

Date: Thu, 22 Aug 2019 07:58:33 GMT

Content-Type: text/html

Content-Length: 146

Connection: keep-alive

[root@jxonesys ~]# curl -I -A '' https://www.yunloc.com

HTTP/1.1 403 Forbidden

Server: nginx

Date: Thu, 22 Aug 2019 07:55:35 GMT

Content-Type: text/html

Content-Length: 146

Connection: keep-aliv

[root@jxonesys ~]# curl -I -A 'Baiduspider' https://www.yunloc.com

HTTP/1.1 200 OK

Server: nginx

Date: Thu, 22 Aug 2019 08:03:06 GMT

Content-Type: text/html; charset=UTF-8

Connection: keep-alive

Vary: Accept-Encoding

X-Powered-By: PHP/7.2.19

Set-Cookie: wp_xh_session_f2266ad63f05d6d9f9e0134d622e4ca4=43b6b7d802097b30e609d872ad6920df%7C%7C1566633786%7C%7C1566630186%7C%7C6ce3059d6fe1ea6acfb27796762ea655; expires=Sat, 24-Aug-2019 08:03:06 GMT; Max-Age=172800; path=/

Strict-Transport-Security: max-age=63072000; includeSubdomains; preload

X-Frame-Options: SAMEORIGIN

X-Content-Type-Options: nosniff

X-XSS-Protection: 1; mode=block

　　如可以看出，MJ12Bot蜘蛛和UA空返回没有相应的识别服务器403，而百度蜘蛛然后成功返回200，表明进入力！

　　我们还可以分析Access Log @ @ 网站，并找到一些从未见过蜘蛛（蜘蛛）名称，在查询是正确之后，它可以添加到禁止代码列表中，以前播放禁止生效。

　　UA采集

　　以下是常用网络UA垃圾邮件列表，仅供参考，并欢迎您加油。

　　FeedDemon //内容采集

BOT/0.1 (BOT for JCE) //sql 注入

CrawlDaddy //sql 注入

Java //内容采集

Jullo //内容采集

Feedly //内容采集

UniversalFeedParser //内容采集

ApacheBench //cc 攻击器

Swiftbot //无用爬虫

YandexBot //无用爬虫

AhrefsBot //无用爬虫

YisouSpider //无用爬虫（已被 UC 神马搜索收购，此蜘蛛可以放开！）

MJ12bot //无用爬虫

ZmEu phpmyadmin //漏洞扫描

WinHttp //采集 cc 攻击

EasouSpider //无用爬虫

HttpClient //tcp 攻击

Microsoft URL Control //扫描

YYSpider //无用爬虫

jaunty //wordpress 爆破扫描器

oBot //无用爬虫

Python-urllib //内容采集

Indy Library //扫描

FlightDeckReports Bot //无用爬虫

Linguee Bot //无用爬虫

0

2021-09-23

php禁止网页抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

php禁止网页抓取(网络上常见的禁止垃圾蜘蛛爬站的方法，增加服务器压力 )

0 个评论

发起人