• Googlebot (one more time... sorry)

    From KenDB3@VERT/KD3NET to Deuce on Friday, June 26, 2015 01:06:11
    Re: Re: Welcome back
    By: Deuce to KenDB3 on Sat May 23 2015 08:52 pm

    By: KenDB3 to kk4qbn on Thu May 14 2015 07:34 pm

    I had to specifically block Googlebot, it was a constant onslaught of
    connections every other minute for days on end. Does anyone know why
    Googlebot just keeps crawling the FTP site endlessly?

    Yeah, I looked into it, and it was because of the random sequence appeneded by the index generation. The index has been updated to not do that anymore since, but Google will still try to crawl every random URL it has cached for a very long time (it's been many months since the fix, and my VPS is still getting over 100 queres per minute from Googlebot).

    So, I hate to dredge old stuff up, but I noticed something today. I have had Googlebot blocked for quite some time now, but it really doesn't want to give up. So, I figured, why not try and add a ROBOTS.TXT into the FTP to help the situation. Except, I can't add that file to the root directory, I get an error of 553 Invalid directory. I'm guessing that's because the Index is dynamically generated. I figured I could add a rule like "Disallow: /*?" since Googlebot kept trying these random sequences that always followed a question mark.

    However, I noticed something else while messing around. The File Libraries link from the HTTP page still appends the random sequence of text after the 00index.html (ex: 00index.html?$ibd58mvk). But if I start navigating the FTP from there, it drops the random characters immediately. If I go back to the HTTP page, and click the File Libraries link anew, I get a new random sequence. I double checked this behaviour on vert.synchro.net and nix.synchro.net:7070, they have the same thing happening. It is probably what is fueling the never ending Googlebot seige.

    I wasn't sure if anyone else noticed this, and I figured I would pass the word along, since if it tends to bug me, it probably bugs other people too, lol.

    ~KenDB3

    ---
    þ Synchronet þ KD3net-Rhode Island's only BBS about nothing. http://bbs.kd3.us
  • From Deuce@VERT/SYNCNIX to KenDB3 on Thursday, June 25, 2015 23:53:06
    Re: Googlebot (one more time... sorry)
    By: KenDB3 to Deuce on Fri Jun 26 2015 01:06 am

    So, I hate to dredge old stuff up, but I noticed something today. I have had Googlebot blocked for quite some time now, but it really doesn't want to give up. So, I figured, why not try and add a ROBOTS.TXT into the FTP to help the situation. Except, I can't add that file to the root directory, I get an error of 553 Invalid directory. I'm guessing that's because the Index is dynamically generated. I figured I could add a rule like "Disallow: /*?" since Googlebot kept trying these random sequences that always followed a question mark.

    You would do this using ctrl/ftpalias.txt and a line something like this: robots.txt /synchronet/sbbs/ctrl/robots.txt Don't index my site please

    ---
    http://DuckDuckGo.com/ a better search engine that respects your privacy.
    Mro is an idiot. Please ignore him, we keep hoping he'll go away.
    þ Synchronet þ My Brand-New BBS (All the cool SysOps run STOCK!)
  • From KenDB3@VERT/KD3NET to Deuce on Friday, June 26, 2015 07:16:12
    Re: Googlebot (one more time... sorry)
    By: KenDB3 to Deuce on Fri Jun 26 2015 01:06 am

    So, I hate to dredge old stuff up, but I noticed something today. I have had Googlebot blocked for quite some time now, but it really doesn't want to give up. So, I figured, why not try and add a ROBOTS.TXT into the FTP to help the situation. Except, I can't add that file to the root directory, I get an error of 553 Invalid directory. I'm guessing that's because the Index is dynamically generated. I figured I could add a rule like "Disallow: /*?" since Googlebot kept trying these random sequences that always followed a question mark.

    You would do this using ctrl/ftpalias.txt and a line something like this: robots.txt /synchronet/sbbs/ctrl/robots.txt Don't index my site please

    Thanks man! I added that line and tied it back to the robots.txt I put in my /web/root/

    That was a lot easier than I anticipated, lol.

    ~KenDB3

    ---
    þ Synchronet þ KD3net-Rhode Island's only BBS about nothing. http://bbs.kd3.us
  • From Digital Man@VERT to KenDB3 on Friday, July 03, 2015 15:02:48
    Re: Googlebot (one more time... sorry)
    By: KenDB3 to Deuce on Fri Jun 26 2015 01:06 am

    Re: Re: Welcome back
    By: Deuce to KenDB3 on Sat May 23 2015 08:52 pm

    By: KenDB3 to kk4qbn on Thu May 14 2015 07:34 pm

    I had to specifically block Googlebot, it was a constant onslaught of
    connections every other minute for days on end. Does anyone know why
    Googlebot just keeps crawling the FTP site endlessly?

    Yeah, I looked into it, and it was because of the random sequence appeneded by the index generation. The index has been updated to not do that anymore since, but Google will still try to crawl every random URL it has cached for a very long time (it's been many months since the fix, and my VPS is still getting over 100 queres per minute from Googlebot).

    So, I hate to dredge old stuff up, but I noticed something today. I have had Googlebot blocked for quite some time now, but it really doesn't want to give up. So, I figured, why not try and add a ROBOTS.TXT into the FTP to help the situation. Except, I can't add that file to the root directory, I get an error of 553 Invalid directory. I'm guessing that's because the Index is dynamically generated.

    Using the ctrl/ftpalias.cfg file, you actually *can* put a file in the FTP 'root' directory. I have no evidence that Google will adhere to a robots.txt for FTP, but someone recently said they did, so it might be worth a try.

    However, I noticed something else while messing around. The File Libraries link from the HTTP page still appends the random sequence of text after the 00index.html (ex: 00index.html?$ibd58mvk). But if I start navigating the FTP from there, it drops the random characters immediately. If I go back to the HTTP page, and click the File Libraries link anew, I get a new random sequence.

    It's a timestamp, not random.

    I double checked this behaviour on vert.synchro.net and nix.synchro.net:7070, they have the same thing happening. It is probably what is fueling the never ending Googlebot seige.

    It might explain a never-ending FTP 'get' of the root 00index.html file, but it
    would not explain never-ending FTP 'gets' of the other 00index.html files or any actual linked files.

    I wasn't sure if anyone else noticed this, and I figured I would pass the word along, since if it tends to bug me, it probably bugs other people too, lol.

    We can probably eliminate the initial timestamp from the Web UI. I'll look into
    it.

    digital man

    Synchronet "Real Fact" #38:
    Synchronet first supported Windows NT v6.x (a.k.a. Vista/Win7) w/v3.14a (2006). Norco, CA WX: 86.8øF, 35.0% humidity, 13 mph SE wind, 0.00 inches rain/24hrs

    ---
    þ Synchronet þ Vertrauen þ Home of Synchronet þ telnet://vert.synchro.net
  • From KenDB3@VERT/KD3NET to Digital Man on Saturday, July 04, 2015 13:11:01
    Re: Googlebot (one more time... sorry)
    By: Digital Man to KenDB3 on Fri Jul 03 2015 03:02 pm

    So, I hate to dredge old stuff up, but I noticed something today. I
    have had Googlebot blocked for quite some time now, but it really
    doesn't want to give up. So, I figured, why not try and add a
    ROBOTS.TXT into the FTP to help the situation. Except, I can't add
    that file to the root directory, I get an error of 553 Invalid
    directory. I'm guessing that's because the Index is dynamically
    generated.

    Using the ctrl/ftpalias.cfg file, you actually *can* put a file in the FTP 'root' directory. I have no evidence that Google will adhere to a robots.txt for FTP, but someone recently said they did, so it might be worth a try.

    Thanks DM! Deuce pointed me in that direction too, and right around the time he did I stumbled on it in the docs, lol. I tied it to the robots.txt I have in my web root, and added some lines to get the bots to ignore anything with a "?" in it, and may end up added the html equivalent, "%3F".

    However, I noticed something else while messing around. The File
    Libraries link from the HTTP page still appends the random sequence of
    text after the 00index.html (ex: 00index.html?$ibd58mvk). But if I
    start navigating the FTP from there, it drops the random characters
    immediately. If I go back to the HTTP page, and click the File
    Libraries link anew, I get a new random sequence.

    It's a timestamp, not random.

    That makes a ton of sense actually, though, I probably never would have guessed it on my own.

    I double checked this behaviour on vert.synchro.net and
    nix.synchro.net:7070, they have the same thing happening. It is
    probably what is fueling the never ending Googlebot seige.

    It might explain a never-ending FTP 'get' of the root 00index.html file, but it would not explain never-ending FTP 'gets' of the other 00index.html files or any actual linked files.

    I wasn't sure if anyone else noticed this, and I figured I would pass
    the word along, since if it tends to bug me, it probably bugs other
    people too, lol.

    We can probably eliminate the initial timestamp from the Web UI. I'll look into it.

    Coolness!

    On a side note, I started playing around with something Google has that they call the Search Console. You can prove you own/control a domain by adding a file to the root directory or adding a specific CName DNS record. Once you do, you can try and fine tune Google's crawling of your site. I figured I would play with it a bit to see what's happening. The one good thing is you can slow down the crawl rate, but the lowest setting is 500 seconds per request.

    I had to add the HTTP site and FTP site separately, and confirm ownership of each individually, which was a bit odd. But, one thing I noticed is
    dashboard for the the HTTP site lets me see what Googlebot sees for the robots.txt, but the dashboard for FTP site doesn't show me that at all. It *looks* like it should be seeing the robots.txt, but it doesn't give me any warm and fuzzies that it actually *is* seeing it. I'm starting to doubt it pays attention to the FTP, which is what I wanted to know before I start blocking stuff again.

    Anywho, just figured I would share what I was seeing. Hope everyone (in the US) has a great 4th of July! Enjoy the holiday!

    ~KenDB3

    ---
    þ Synchronet þ KD3net-Rhode Island's only BBS about nothing. http://bbs.kd3.us