Robots.txt

Posted on Tuesday, July 9th, 2013 No Comments


 

robots-txt

I’ve been learning new things about robots.txt today. In my past job I wasn’t the controller of such things, even though I really should have been. I’ve known the basic functionality of it but never had to really do much with it.

Last night I was Googling to try to find examples of websites that had recovered from the Panda update. I didn’t find many examples of website that had, but I came across something that mentioned Googling your site to find out how many results have been omitted because Google considers them too similar.

If you type site:yoursite.com into Google, you’ll see what Google has indexed for your site. Then click to the end of the results and you’ll see something like this:

results-omitted

This example means that Google only considers 519 of the pages to be the “most relevant” and it doesn’t care about the rest.

Now, the main thing that lead me to thinking about robots.txt was that when I looked through the results I got from this, I noticed that some old subdomains were still being indexed. We have newsite.cruisemagic.com and newtest.cruisemagic.com, which were used for developing new website designs in the past. We had installed WordPress on the subdomain so that we could test everything, and then copied it all back over to the regular domain. The problem then is that there is a duplicate of everything at www.cruisemagic.com also on newsite.cruisemagic.com.

I could not just wipe out newsite.cruisemagic.com, because it’s currently the location of our WordPress installation (which is another issue entirely). So the solution is to use robots.txt to tell Google not to index those old subdomains.

I did some research and found out that you need to put a separate robots.txt in each subdomain’s root. So I made one for each that tells Google not to index the entire thing:

User-agent: *
Disallow: /

While I was at it, I used the same technique to tell Google to stop looking at some outdated content on the main website. When I first started working on this site, it was using a WordPress plugin that would translate the website into a variety of different languages. There was this block of little flags in the footer, and you could click on the flag of your country/language, and then it would use Google Translate to translate the page. Well, that plugin suddenly stopped functioning, and we started getting thousands of crawl errors in Webmaster Tools that looked like:

cs/destinations/alaska/        503    5/11/13
el/destinations/europe/        503    5/10/13
zh-TW/destinations/caribbean/    503    5/11/13
da/cruise-lines/royal-caribbean-cruise-line/    503    5/11/13
id/    503    5/11/13

53 being the response code indicating that the service is unavailable. All I can figure is that the plugin stopped talking to Google Translate correctly. Each two-letter folder is short for the language. I disabled the plugin and the errors declined slowly, but got stuck at about 980. I really don’t want 980 crawl errors every day, especially to content that no longer exists, and I don’t know how Google is even getting directed to that stuff now since nothing is pointing to it.

I downloaded the entire table of errors and sorted them by URL so that I could pull out each of those two-letter codes. While doing that I also found some old URL structures that the previous webmaster used for a while and then discarded. I ended up with a list of around 20 folders that Google should no longer be indexing, and added them all to the robots.txt file.

Now I have to wait for Google to download robots.txt again, and then I’ll be able to see the result.

Update 7/11/13

Google has seen the new robots.txt but newsite.cruisemagic.com is still indexed. So I did more research and found this, which says that just disallowing via robots.txt isn’t enough. According to that, you have to add your subdomain in Webmaster Tools as a totally new site. Then you have to submit a removal request. And to keep it removed, you still have to disallow crawling in robots.txt, like I already did.

Update 7/12/13

Some success! newsite.cruisemagic.com has been removed from Google. When I opened Webmaster Tools there was a scary-looking message for me:

newsite-removal

“Severe health issues” … yeah I suppose, since I asked to have the whole thing removed. *eyeroll*

And searching site:cruisemagic.com shows that the subdomain is gone, which is what I wanted.

Those crawl errors related to the old language translation are still happening, though, so I’ll have to think about that further.

Update 7/29/13

Very interesting. As a result of removing duplication from Google’s index, we’ve now seen a huge jump in the number pages that are indexed.

google index jump

Update 8/13/13

Yes! The crawl errors from that defunct plugin are now gone. They dropped a little, and a little more, and now they’re down to 0. Finally!

crawl-errors-gone

Update 8/20/16

Sorry that the screenshots are no longer visible. That was a result of lots of website updates and I can no longer find the files.




Posted in SEO. Tagged with: , ,