关于网友分享的被爬虫攻击,屏蔽IP段的方法,这篇文章里(https://www.shoushai.com/p/983)讲了两个蜘蛛爬虫,一个Facebook确实可以使用iptables来屏蔽IP段,但是Amazonbot爬虫的IP实在是太多了,而且都不是同一个段的IP,根本就禁不完。
所以亚马逊的蜘蛛目前只能靠robots.txt来解决,但是需要24小时才能更新到robots文件,所以只能把其它能禁的禁掉,然后等亚马逊那边更新,抗住压力,祈祷它早点停下来。
另外值得说的一点是,Facebook的蜘蛛也是可以通过robots.txt禁掉的,它官方是说支持这个文件屏蔽,它的蜘蛛名是meta-externalagent,所以建议大家把不需要的爬虫都给禁掉。
WordPress的网站按照网友分享的robots.txt写法去设置:https://www.shoushai.com/p/985
非WordPress网站,可以通用这个写法(WP网站也能用):
User-agent: GPTBot Disallow: / User-agent: meta-externalagent Disallow: / User-agent: Amazonbot Disallow: / User-agent: MJ12bot Disallow: / User-agent: YisouSpider Disallow: / User-agent: SemrushBot Disallow: / User-agent: SemrushBot-SA Disallow: / User-agent: SemrushBot-BA Disallow: / User-agent: SemrushBot-SI Disallow: / User-agent: SemrushBot-SWA Disallow: / User-agent: SemrushBot-CT Disallow: / User-agent: SemrushBot-BM Disallow: / User-agent: SemrushBot-SEOAB Disallow: / user-agent: AhrefsBot Disallow: / User-agent: DotBot Disallow: / User-agent: Uptimebot Disallow: / User-agent: MegaIndex.ru Disallow: / User-agent: ZoominfoBot Disallow: / User-agent: Mail.Ru Disallow: / User-agent: BLEXBot Disallow: / User-agent: ExtLinksBot Disallow: / User-agent: aiHitBot Disallow: / User-agent: Researchscan Disallow: / User-agent: DnyzBot Disallow: / User-agent: spbot Disallow: / User-agent: YandexBot Disallow: / User-agent: SemrushBot Disallow: / User-agent: SemrushBot-SA Disallow: / User-agent: SemrushBot-BA Disallow: / User-agent: SemrushBot-SI Disallow: / User-agent: SemrushBot-SWA Disallow: / User-agent: SemrushBot-CT Disallow: / User-agent: SemrushBot-BM Disallow: / User-agent: SemrushBot-SEOAB Disallow: /
其实就是把上面的WP内容去掉,当然,你非要杠,非WP网站用WP的写法也行,反正你也没有那些目录,写上去也不会有问题。
本文来自投稿,不代表首晒立场,如若转载,请注明出处:https://www.shoushai.com/p/990