google search console – How to Edit or Remove robots.txt on a WordPress powered website

If this is fresh install of WordPress it could because you have set the privacy settings to stop search engines from crawling the site.

(and there will be no physical robots.txt on the server, as wordpress creates it on the fly)

Go to your settings in wordpress and see if this box is ticked:

enter image description here

If so, uncheck it and then the robots.txt should change to

User-agent: *
Disallow: 

If you still have issues and the robots.txt is still set to block crawlers, then explore the other option as outlined by Facet.

robots.txt – No information is available for this page

Today I search my GitHub page in Google Chrome and Firefox and for the first time, it says no information is available. I read on the internet that it can be due to the Robots.txt file. I follow Google advice at https://support.google.com/webmasters/answer/7489871?hl=en and download the file checker and it is telling me to upload my file to

www.github.com/username/

However, I do not see any place to upload my file (I do not know it is due to do permission, or I cannot find out how to upload my file there).

Other questions on this page, are either talking about personal page or WordPress or something similar. I would like to know can anyone suggest to me how to solve this issue.

magento2.3.5 p1 – Commerce Edition – Website Restrictions Enabled and robots.txt

Is it expected functionality that a website with restriction mode set to “Website Closed” will cause the router to exceed it’s 100 loop limit when trying to reach /robots.txt?

I’m seeing this routing error periodically in logs with regards to a specific site which is in “Website Closed” mode and the requested path being /robots.txt. I get that “Website Closed” should not allow any access; however, shouldn’t it still return a robots.txt of NOINDEX,NOFOLLOW (or just be blank)?

Are there any files or folders (apart from robots.txt and favicon.ico) which MUST go in the root directory?

I think it may be difficult to get an exhaustive list of all possible files which need to live at the root. For one thing, different content management systems may place various files at the root, while others may place those elsewhere, so it depends on what platform you’re using.

Generally, your index.html file will be found at the root, but keep in mind that your root is still a folder, which may have different names, depending on your web host or your CMS.

Then, you have changing standards. The sitemap.xml file used to be commonly be placed at the root, but these days many CMS’s like WordPress (via plugins like Yoast) allow for creation of a sitemap_index.xml file, which then leads to a list of sitemaps broken down by content type. Sometimes, they all live at the root; other times, they’re in a directory. Having them in a directory is okay, as long as the sitemap index file is at the root and the search bots can easily find and crawl that directory. Thus, the sitemap.xml file may not exist on a website at all anymore, replaced by a (slightly) more complex sitemap information architecture. More on WordPress XML sitemaps here.

And then you have specific use cases. If your website is a publisher and actively sells inventory for ads to display on, you need an ads.txt file. This file should be at the root. If you’re an ad exchange or an SSP (sell side platform), you need a sellers.json file, which should also live at the root. Read more about ads.txt and sellers.json.

Perhaps the best way to go about it is, learn more about your CMS, figure out the functionality you’re looking for, and follow the standard, where the documentation will tell you where the crucial files should live.

Updated Bing Webmaster Robots.txt Tester Tool

Bing has a new and updated robots.txt tester tool.

magento2 – Magento 2: Custom Instructions of Robots.txt is not working

We are facing an issue and need your guidance to solve that. The issue is that we have updated the custom instructions for robots.txt file as below

User-agent:*
Disallow:

But in the frontend, its showing as below

User-agent:*
Disallow: /

I have cleared and flushed the cache. So what could be the issue?

Any help will be appreciated!

seo – How to de-index pages from google using robots.txt

Assuming these pages still exist, but you just want them removed from search results…

What is the proper way to de-index pages using robots.txt?

You wouldn’t necessarily use robots.txt to de-index pages. ie. remove already indexed pages from the Google search results. A noindex robots meta tag in the page itself (or X-Robots-Tag HTTP response header) might be preferable instead, in combination with the URL removal tool in Google Search Console (GSC) to speed up the process.

robots.txt specifically blocks crawling (not necessarily “indexing”). By blocking these pages from being crawled, these pages should naturally drop from the search index in time, but this can take a considerable amount of time. However, if these pages are still being linked to then they may not disappear entirely from the search results if these URLs are blocked by robots.txt (you can end up with a URL-only link in the SERPS, with no description).

Using robots.txt to remove the https://www.example.com/getridofthis/ directory…

User-agent: *
Disallow: /getridofthis/

To remove pages entirely from SERPs consider using a noindex meta tag (or X-Robots-tag: noindex HTTP response header) instead of robots.txt. (Which is what it sounds like you are doing already.) Don’t block crawling in robots.txt as this will prevent the crawler from seeing the noindex meta tag.

To expedite the process of de-indexing URLs in Google search you can use the URL removal tool in GSC (formerly Webmaster Tools). For this tool to be effective long-term you need to use the noindex meta tag in the pages themselves. (The original blog article stated that robots.txt could be used as a blocking mechanism with the URL removal tool, however, recent help documents specifically warn against using robots.txt for “permanent removal”.)

Reference:

seo – How to remove site from google index after updating robots.txt?

The correct and only way is to initially allow to crawl the pages. Set the meta tag name="robots" content="noindex,follow" on the affected pages. When the pages were removed from index, THEN add the disallow in the robots.txt.

In your actual setting you are telling google only this: “Please do not access/recrawl these pages”. How should Google know that you want to deindex these pages?

web crawlers – What is difference between robots.txt, sitemap, robots meta tag, robots header tag?

So I am trying to learn SEO and I am honestly confused and have following 8 questions.

  • Do I tell a bot not to visit a certain link through X-Robots-Tag or through robot meta tag or robots.txt?

  • Is it ok to include all 3 (robot.txt, robot meta tag, and X-Robots-Tag header) or I should always only provide 1?

  • Do I get penalized if I show same info in X-Robots-Tag and in robot meta tag and robots.txt?

  • Let’s say for /test1 my robots.txt says Disallow but my robots meta tag says follow,index and my X-Robots-Tag says nofollow,index,noarchive. Do I get penalized if those values are different?

  • Let’s say for /test1 my robots.txt says Disallow but my robots meta tag says follow,index and my X-Robots-Tag says nofollow,index,noarchive. Which rule will be followed by the bot? What is the importance here?

  • Let’s say my robots.txt has a rule saying Disallow: / and Allow: /link_one/link_two and my X-Robots-Tag and robot meta tag for every link except /link_one/link_two says nofollow,noindex,noarchive. From what I understand bot will never get to /link_one/link_two since I prevented it from crawling at root level. Now if I provide a sitemap.xml in the robots.txt that has /link_one/link_two there, will it actually end up being crawled?

  • Will bot crawl into the directory provided by sitemap.(xml/txt) even though it is not accessible through home page or any pages following the home page?

  • And overall I would appreciate some clarification on what is the difference between robots.txt, X-Robots-Tag and robot meta tag and sitemap.(xml/txt). To me they seem like they do the exact same thing.

  • I already saw that there are some questions that answer a small subset of what I asked. But I want the whole big explanation.

    web crawlers – Disallow root but not 4 subdirectories for robots.txt

    I have a project and I would like to disallow everything starting with root.

    From what I understand I think I can do so by doing this

    Disallow: /
    Disallow: /*
    

    However I would like to allow 4 subdirectories and everything under those subdirectories.

    This is how I think it should be done

    Allow: /directory_one/
    Allow: /directory_one/*
    Allow: /directory_two/
    Allow: /directory_two/*
    Allow: /directory_six/
    Allow: /directory_six/*
    Allow: /about/
    Allow: /about/*
    

    So how would I go about disallowing everything starting from root but allowing only those 4 directories and everything under them?

    Also if I want to allow specific directory and everything under it, do I have to declare it twice?

    Will webcrawler be able to navigate to those subdirectories if root is disallowed?