How to simulate googlebot to see which links in a React application would be indexed?

I am developing a React application.

Up to now, indexing coverage has been poor (only the home page has been indexed).

I recently implemented server side rendering (SSR) and the indexing coverage seems to be significantly better.

That being said, I feel like I'm playing the SEO game blindly. Is there a way to simulate Googlebot to see what would be indexed? I would love to see the recursive paths that Googlebot sees.

I know Google Search Console, but I can only inspect one URL at a time.

googlebot – Would adding links between pages help Google further index our site?

We have a hotel search site. We have identified some concepts that we would like Google to index, for example, we have a landing page for "Boutique Hotels in Chicago, IL". Given the number of cities in the United States, we have over 100K of these pages.

However, Google currently indexes our pages at a rate of only ~ 350 pages per day. And more than 100K pages will take a year to be indexed.

Our website can currently be browsed mainly by typing search queries + destination cities. It is similar to other travel search sites. There are almost no internal links between each landing page. In that case, is it important to start putting internal links between pages / to improve navigation (for example, Ariane (town) style of thread >> (some queries)) and does that help improve the speed of exploration?

googlebot – Prohibit loading images to reduce the number of "other errors" in page resources crawled?

We have a hotel search site. Many landing pages on our site load resources asynchronously (e.g. Ajax), but very often (around 25% of the time) when we click on "Show analyzed page" in the URL inspection of the Google Search Console, we can see that some important API calls on the pages are not loaded and display "Other error". This includes API calls to display relevant reviews, which shows our unique value to travelers. We have over 100K of these pages.

"More info" tab of URL inspection

We have read similar articles like this one (https://support.google.com/webmasters/thread/4425254?hl=en) on "Other error" in page resources that are not loaded, and there seems to be a unique element the resource quota (i.e. the limited number of requests that Googlebot wants to make) has determined Googlebot for each website. More specifically for our case, we saw different resources blocked. They are:

  1. Important API calls mentioned above;
  2. Hotel pictures
  3. Google Analytics / collect (displayed as image requests in the Google search console)

My questions are:

  1. Does this harm SEO when I block hotel images in the robots.txt file to reduce other errors in API calls?
  2. Is there a way to block Google Analytics in Google Analytics settings when the user agent is Googlebot?

googlebot – Slow crawl speed for many pages (over 100,000) on a travel website

Context:

We have a relatively new hotel search site where users can freely search their preferences, such as "a child friendly hotel with bunk beds, good breakfast and clean rooms". There will be relevant comments displayed in each of the hotels in the results, according to the concepts mentioned in the request, in this case "child-friendly", "bunk beds", "breakfast", "clean".

We believe our website can offer unique value to travelers, and we can save users time by reading numerous reviews and finding related information. We have identified some concepts that we would like Google to index, for example, we have a referral page for "Boutique Hotels in Chicago, IL". Given the number of cities in the region, we have over 100,000 pages of this type.

however, Google currently indexes our pages at a rate of only ~ 350 pages per day. And more than 100K pages will take a year to be indexed. I would love to hear your suggestions / tips for speeding up the indexing speed.

Currently our ideas to improve the speed of indexing / SEO in general:

  1. Create internal links / improve navigation on the site – Is the creation of internal links important for SEO in this case? how to set up internal links as a hotel search site? (Search seems like a natural way to navigate the results. Perhaps Ariane's thread (city -> concepts, e.g. child-friendly hotel)?)
  2. Add a page on – state our mission and who we are.
  3. Rendered on the server side – the website is currently created in React.js, so Googlebot needs more resources to display each SEO page.

In the long term, we will reach out and build awareness of our website. However, given the current pandemic, we would like to focus more on the website / content itself.

Are there any other suggestions / comments on the above SEO ideas? Thank you very much for your time and your help!

googlebot – Google Search Console Warning: "Indexed, but blocked by robots.txt" (BUG)

Two days ago, I received an email with a warning from Google Search Console, telling me that 3 URLs from my site where indexed but were blocked by robots.txt.

Since I deployed my website for the first time, I have not changed robots.txt file.

robots.txt

User-agent: *
Disallow: /admin
Disallow: /admin/
Disallow: /blog/search
Disallow: /blog/search/

Sitemap: https://www.example.com/sitemap.xml

And these are the pages from which I received the warning:

https://www.example.com/blog/slug-1
https://www.example.com/blog/slug-2
https://www.example.com/blog/slug-3

I think everyone agrees that these pages are NOT be blocked by my robots.txt file, right?

Note: I have other blogs with other slugs and they are fine. No warnings beyond these 3 URLs.

Anyway, I clicked on the Fix Issues button on Google Search Console and they are currently in pending State.

Has this ever happened to someone? What could be the cause?


UPDATE:

In fact, I just inspected one of the 3 URLs and I got this:

enter description of image here

But when I clicked Test Live URL, this is what I get:

enter description of image here

And I am 100% my robots.txt The file has not changed since the first deployment of this website. That is to say: this is 100% a bug in the Googlebot crawling process.

googlebot – Can I use thumbnails of external links in a news aggregator website?

I am building a site where I need to post news (links) on a specific topic.

I will collect and aggregate them on my site.

The question is:

Can I use their thumbnail to promote their own links?

For example: (of example.com):

enter description of image here

Of course, this will be emphasized at the example.com Link URL for this post and Source: example.com will also be displayed.

I will receive news from many sites, not only example.com.

Is it legal? In addition, how will Googlebot react to this, in terms of pageranking?

googlebot – Why does Google create pages to crawl on my site?

For some reason, Google lists a series of pages that don't exist on my website, such as:

https://www.my_domain.com/index.php/about_us.php

He lists them as "Duplicate, Google chose a canonical different from the user" in the Search Console.

Google creates each combination of "real page" and nails on another page at the end.

My page index.php is not a folder, so why is Google crawling it as if it were a folder with all of my pages underneath?

googlebot – get Google to index anchored URLs with a dynamic image by anchor tag

Context

I have a site (https://womenwhocouldneverbevegan.com) which shows a slideshow of images.

The site uses only one in the DOM. When the page loads, the JavaScript loads around thirty images in total in objects that are not attached to the DOM. As the user moves from one image in the slideshow to the next, the JavaScript removes the from the DOM and add the next at the DOM. It also dynamically defines the anchor tag in the URL. For example, if you are on https://womenwhocouldneverbevegan.com/#5 and go to the next image, the URL will become https://womenwhocouldneverbevegan.com/#6.

Additionally, when the page loads start at a particular anchor tag (such as https://womenwhocouldneverbevegan.com/#6), the page shows that as the initial image. That is, external links with image number anchor tags jump to the given image when the page loads.

My understanding is that because all of the content on my page (except the initial title screen image) is loaded and displayed dynamically, the robot will not see pages such as https://womenwhocouldneverbevegan.com/#6. Also, even if I add URLs like https://womenwhocouldneverbevegan.com/#6 to my sitemap.xml, unless I use , the robot will still not see these images because they are not added to the DOM until JavaScript is executed.

The plan

My understanding was that I could use tags with tags in my sitemap.xml to help Google "see" these dynamically loaded images and their associated text (the "alt" attribute is dynamically defined in JavaScript with the "img" attribute).

My sitemap.xml therefore begins:



  
    https://womenwhocouldneverbevegan.com
    
      http://womenwhocouldneverbevegan.com/images/abegin.jpg
      Women Who Could Never Be Vegan
      A poem, with illustrations by Edward Gorey, slightly modified.
    
  
  
    https://womenwhocouldneverbevegan.com/#0
    
      http://womenwhocouldneverbevegan.com/images/abegin.jpg
      Women Who Could Never Be Vegan
      A poem, with illustrations by Edward Gorey, slightly modified.
    
  
  
    https://womenwhocouldneverbevegan.com/#1
    
      http://womenwhocouldneverbevegan.com/images/anne_cheese.jpg
      A is for Anne, who's addicted to cheese.
      A is for Anne, who's addicted to cheese.
    
  
  
    https://womenwhocouldneverbevegan.com/#2
    
      http://womenwhocouldneverbevegan.com/images/beth_soy.jpg
      B is for Beth, who says soy makes her wheeze.
      B is for Beth, who says soy makes her wheeze.
    
  

My goal here was to help Google see, for example, that https://womenwhocouldneverbevegan.com/#1 is a unique URL on my site and that it contains the image http: // womenwhocouldneverbevegan.com/images/anne_cheese.jpg, and that the text associated with this image is "A is for Anne, who is addicted to cheese." So, in theory, if a user searched Google for "cheese addicts", the search results could include this page and this image.

The question

What I see in the coverage report in Google Search Console is that a URL (https://womenwhocouldneverbevegan.com/) is "Submitted and indexed" with "Status: valid", and that the other 29 URLs (https: / /womenwhocouldneverbevegan.com/#1, https://womenwhocouldneverbevegan.com/#2, etc.) are "Discovered – currently not indexed" with "Status: Excluded".

I know it's up to Google to determine which URLs they want to index and when. And if it's only a matter of time before indexing those other 29 URLs, that's fine.

But from what I've read, it seems more likely that these URLs have been excluded because for Googlebot, they appear to contain content identical to the main URL (that is, i.e. the basic HTML before the JavaScript added the particular image to the DOM based on the anchor tag).

If that sounds right, my question is: is there a better way for me to structure my HTML, JavaScript and sitemap.xml to get the desired result, that is, google will treat my site as if it had 30 pages, each with a unique image, where each image has a unique associated text?

I could of course make each image its own actual HTML page, but I don't like that as it would cause the slideshow to flicker and skip when each page is loaded.

Another idea would be to use URL parameters instead of anchors. If I used https://womenwhocouldneverbevegan.com/?page=1 instead of https://womenwhocouldneverbevegan.com/#1, would it work better? I may be answering my own question here. I guess it makes sense that, by definition, URLs that differ only in anchor tags are the same page …

Googlebot ignores robots.txt – Webmasters Stack Exchange

Two months ago, I prohibited certain directories from exploring the robots.txt file. Since then, I have been monitoring the log files every day and I realize that Googlebot absolutely ignores the robots.txt file.

Really, it scans every URL it crawls before the directories are banned for scanning.

The search console test is displayed, because all URLs of banned directories would be allowed! Only live search console tests show URLs as banned – that said, the banning rules are correct and working.

According to the unauthorized browsing URLs of unauthorized directories, they should appear in the index without extracts. However, URLs from unauthorized directories appear with snippets, despite their caching date this week.

All the main rules that I know about Google and websites don't work there.

Any ideas, what could be going on here?

seo – URL modified for a page indexed by Googlebot. Will redirect 301 from the old URL to the new one. But what to do with my sitemap?

I plan to change a URL for one of the pages of my site.

Example:

Of: https://www.example.com/old-post-slug

AT: https://www.example.com/new-post-slug

The point is, Google has already indexed the old URL: https://www.example.com/old-post-slug

And from these DOCs, we see that to avoid losing page ranking, we have to answer with a 301 - Moved permanently from the old URL pointing to the new URL.

https://support.google.com/webmasters/answer/6033049?hl=en

enter description of image here

QUESTION

I get that I should redirect (301) from the old URL to the new one. So when Google re-explores, it will see this change. But what should appear on my site map? The old URL or the new one? Or both?

I tend to think that it would be better to keep only the new URL on my sitemap. But what if Google crawls the new URL before seeing the redirect from the old one? Wouldn't the new page URL start as a new page (from a Google index point of view) with zero ranking points? How does Googlebot handle this? What is the recommended practice?