By: KenDB3 to kk4qbn on Thu May 14 2015 07:34 pm
I had to specifically block Googlebot, it was a constant onslaught of
connections every other minute for days on end. Does anyone know why
Googlebot just keeps crawling the FTP site endlessly?
Yeah, I looked into it, and it was because of the random sequence appeneded by the index generation. The index has been updated to not do that anymore since, but Google will still try to crawl every random URL it has cached for a very long time (it's been many months since the fix, and my VPS is still getting over 100 queres per minute from Googlebot).
So, I hate to dredge old stuff up, but I noticed something today. I have had Googlebot blocked for quite some time now, but it really doesn't want to give up. So, I figured, why not try and add a ROBOTS.TXT into the FTP to help the situation. Except, I can't add that file to the root directory, I get an error of 553 Invalid directory. I'm guessing that's because the Index is dynamically generated. I figured I could add a rule like "Disallow: /*?" since Googlebot kept trying these random sequences that always followed a question mark.
Re: Googlebot (one more time... sorry)
By: KenDB3 to Deuce on Fri Jun 26 2015 01:06 am
So, I hate to dredge old stuff up, but I noticed something today. I have had Googlebot blocked for quite some time now, but it really doesn't want to give up. So, I figured, why not try and add a ROBOTS.TXT into the FTP to help the situation. Except, I can't add that file to the root directory, I get an error of 553 Invalid directory. I'm guessing that's because the Index is dynamically generated. I figured I could add a rule like "Disallow: /*?" since Googlebot kept trying these random sequences that always followed a question mark.
You would do this using ctrl/ftpalias.txt and a line something like this: robots.txt /synchronet/sbbs/ctrl/robots.txt Don't index my site please
Re: Re: Welcome back
By: Deuce to KenDB3 on Sat May 23 2015 08:52 pm
By: KenDB3 to kk4qbn on Thu May 14 2015 07:34 pm
I had to specifically block Googlebot, it was a constant onslaught of
connections every other minute for days on end. Does anyone know why
Googlebot just keeps crawling the FTP site endlessly?
Yeah, I looked into it, and it was because of the random sequence appeneded by the index generation. The index has been updated to not do that anymore since, but Google will still try to crawl every random URL it has cached for a very long time (it's been many months since the fix, and my VPS is still getting over 100 queres per minute from Googlebot).
So, I hate to dredge old stuff up, but I noticed something today. I have had Googlebot blocked for quite some time now, but it really doesn't want to give up. So, I figured, why not try and add a ROBOTS.TXT into the FTP to help the situation. Except, I can't add that file to the root directory, I get an error of 553 Invalid directory. I'm guessing that's because the Index is dynamically generated.
However, I noticed something else while messing around. The File Libraries link from the HTTP page still appends the random sequence of text after the 00index.html (ex: 00index.html?$ibd58mvk). But if I start navigating the FTP from there, it drops the random characters immediately. If I go back to the HTTP page, and click the File Libraries link anew, I get a new random sequence.
I double checked this behaviour on vert.synchro.net and nix.synchro.net:7070, they have the same thing happening. It is probably what is fueling the never ending Googlebot seige.
I wasn't sure if anyone else noticed this, and I figured I would pass the word along, since if it tends to bug me, it probably bugs other people too, lol.
So, I hate to dredge old stuff up, but I noticed something today. I
have had Googlebot blocked for quite some time now, but it really
doesn't want to give up. So, I figured, why not try and add a
ROBOTS.TXT into the FTP to help the situation. Except, I can't add
that file to the root directory, I get an error of 553 Invalid
directory. I'm guessing that's because the Index is dynamically
generated.
Using the ctrl/ftpalias.cfg file, you actually *can* put a file in the FTP 'root' directory. I have no evidence that Google will adhere to a robots.txt for FTP, but someone recently said they did, so it might be worth a try.
However, I noticed something else while messing around. The File
Libraries link from the HTTP page still appends the random sequence of
text after the 00index.html (ex: 00index.html?$ibd58mvk). But if I
start navigating the FTP from there, it drops the random characters
immediately. If I go back to the HTTP page, and click the File
Libraries link anew, I get a new random sequence.
It's a timestamp, not random.
I double checked this behaviour on vert.synchro.net and
nix.synchro.net:7070, they have the same thing happening. It is
probably what is fueling the never ending Googlebot seige.
It might explain a never-ending FTP 'get' of the root 00index.html file, but it would not explain never-ending FTP 'gets' of the other 00index.html files or any actual linked files.
I wasn't sure if anyone else noticed this, and I figured I would pass
the word along, since if it tends to bug me, it probably bugs other
people too, lol.
We can probably eliminate the initial timestamp from the Web UI. I'll look into it.
Sysop: | MCMLXXIX |
---|---|
Location: | Prospect, CT |
Users: | 325 |
Nodes: | 10 (0 / 10) |
Uptime: | 06:19:51 |
Calls: | 510 |
Messages: | 220570 |