web crawlers – Robots.txt for a multilanguage site where root is redirected

I have a site which offers two languages, English and Spanish. When the user navigates to the home page, let’s say www.site.com the page redirects you to either /es if your browser language is Spanish or English otherwise.

At the moment the robots.txt I have is:

User-agent: *
Allow: /

Sitemap: https:// www.site.com/sitemap_index.xml

because I’m defining all hreflang alternate URLs in the sitemap_languages.xml and all URLs are listed also in the sitemap.xml. My question is more towards the configuration of the robots.txt because I’m not sure if I should be allowing any user agent to crawl the / page. As that page always redirects to the home of either /en or /es I believe that should be disallowed.

Should I then do:

User-agent: *
Disallow: /
Allow: /es
Allow: /en

Sitemap: https:// www.site.com/sitemap_index.xml

I’m not sure if that could cause crawl issue of whether there is another way to achieve the same result.

Thanks in advance!

Google image crawler won’t respect my robots.txt entry to not crawl images

I was looking for a way to prevent reverse image searching (namely I didn’t want for people who had a copy of one of my images to upload it to google and discover where it originated from). I created the following robots.txt file at put it at the root of my blogspot blog:

User-agent: *
Disallow: /hide*.jpg$
Disallow: /hide*.jpeg$
Disallow: /hide*.png$

User-agent: Googlebot-Image
Disallow: /hide*.jpg$
Disallow: /hide*.jpeg$
Disallow: /hide*.png$

With it, I was expecting that all jpg and png image files that start with the word hide (eg. hide1023939.jpg) would not appear in Google Images (or any other search engine). I was inspired by the official documentation here and here.

However Google Images keeps showing them, both when reverse searching as well as searching sitewise for any images. I’ve added many new images since I implemented the robots directives but even these new files get crawled.

As an observation the images on blogspot/blogger.com are hosted on http://1.bp.blogspot.com/....file.jpg instead of my own subdomain (http://domain.blogspot.com) and I wonder if this is the cause of the issue?
Any ideas how to solve this?

google search console – How can I fix the indexed issue, although blocked by robots.txt if the reported pages do not exist at all?

I found related information like How to fix Google "Indexed, although blocked by robots.txt" but that doesn't answer my question.

The pages I saw in the report do not exist at all.

enter description of image here

For example, the two links above do not exist at all. I don't know why people add a number or a image?url= to my URL. He's so weird.

How can I tell Google to ignore these URLs?

2013 – How to hide the robots.txt file for SharePoint websites

I think based on the Search engine sitemap settings feature disabled or enabled we can show or hide the robots.txt file. However, if we want to restrct certain types of files, we can add the entry to the file as below:

User agent: *

Prohibit: / _layouts /

Prohibit: / _vti_bin /

Prohibit: / _catalogs /

If you want to allow SharePoint 2010 or 2013 to crawl your website, add the following to your robots.txt file.

User agent: Robot MS Search 6.0

Refuse:

Source:

The right robots.txt settings to allow SharePoint to crawl your site

Forum robots.txt File | Forum promotion

Hey FP,

We're all there when there are so many guests on your forum that you start to wonder what's going on. Of course, most of them are bots and some are harmful – mainly looking for email addresses to trigger spam.

I came across this beast of a robots.txt file and thought I would share it. It's not mine, all the credit goes to mitchellkrogza.

The robots.txt file can be found at https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

crawlers – Ahrefs reports "Recovering from robots.txt took too long"

My robots:

User-agent: *
Disallow: /wp-admin
Disallow: /sistema
Disallow: /site
Disallow: /old

Sitemap: http://www.example.com.br/page-sitemap.xml

In the google test tool: https://www.google.com/webmasters/tools/robots-testing-tool becomes normal: "allowed"

But at ahref I have the feedback:

Recovering robots.txt took too long

I have already asked the hosting staff to see for us if there is a blockage of the robot or ips used by ahrefs and there is none.

I have already disabled all plugins on the site and nothing

301 redirect to robots.txt to another domain, maybe to a cloud front layer

I would like to know if I can put a 301 redirect on the robots.txt file and navigate it to another location?

example site = https://www.example.net/robots.txt redirects to https://differentdomain.net/example/robots.txt

seo – Impact on Google Shopping of the blocking of UTM parameters in the Robots.txt file

I am optimizing the website crawling experience because a large part of the site is not crawled.

Much of the "crawl budget" is used for crawling URLs with utm parameters for Google Shopping.

If I block crawling of these settings in the robots.txt file, will this have an impact on Google Shopping ads?

I am having trouble finding confirmation of uses of Googlebot Google Merchant. I think if it is using Adsbot – Google, I will block the settings only for normal Googbot, but allow Adsbot.

My question;

  • Does Google Merchant use Adsbot?
  • If not, will blocking these Googlebot pages have a negative impact on the purchase campaign?
  • If this is the case, are there any other alternatives to prevent Googlebot from wasting time on these
    pages?

What is robots.txt?

What is robots.txt?

magento2 – robots.txt Prohibit: / checkout / still myweb.com/checkout/cart/ Google is still exploring

Did you add User-agent to your robots file?

Can you see your file when you go directly to it {your_domain} /robots.txt?

Also, don't forget to clear the cache after updating these settings.

To verify that your file is working properly, you can use the Google tool: https://support.google.com/webmasters/answer/6062598?hl=en&ref_topic=6061961