SEO - Does anyone block bots from dynamic pages?

Working with Google Search Console and Ubersuggest, I get the impression that I should be blocking the bots in robots.txt from files ending in single.php?id= and index.php

These files are being categorized by Google Search Console as “Duplicate, Google chose different canonical than user”.

Ubersuggest nags me with “duplicate content” issues as well.

On a slight different but related subject… Is anyone blocking the /backlight directory from the bots?

Thank you for any assistance.

Perry J.

My robots.txt recommendations are a slow work-in-progress, but this is relevant to your question:

I think the preferred URL format for single-image pages should be:

https://../galleries/_recent/mc-20210307-165302-single.php

So yes, if Google is also picking up URLs matching that format, then I would block indexing of the single.php?=filename addresses.

My robots.txt file for my personal site is:

User-agent: *
Disallow: /backlight
Disallow: /*/thumbnails/*.jpg$
Disallow: /*/single.php

Thanks, Matt!

Are you blocking thumbnails from the galleries? If so, why?

This is what I am using/trying in my robots.txt file at the moment:

Sitemap: https://www.mysite.com/sitemap.xml.gz
User-agent: *
Disallow: /backlight/
Disallow: ?id=
Disallow: *index.php

I am blocking indexing of thumbnail images, yes. Reason being I see no purpose for such tiny images to be showing up in Google Images searches. Also, they’re not watermarked; my large images are.

Hi Perry, if your URL-rewriting is working, then the browser and search engines shouldn’t see pages like single.php?id=. They should instead appear with names like 20210510-0001-single.php. Can you share an album URL, either here or via direct message, so that I can take a look?

@Ben I don’t know how, but I recall that Google’s tools were seeing both versions of the URL for my personal site. That may or may not have been related to the sitemap generator I used.

FWIW:

I use https://www.xml-sitemaps.com/ for generating sitemaps. I am thinking about running it as a CRON job on a daily basis.

In the configuration for the crawler for the sitemap generator, I use the following for exclusions:

?id=
index.php
/backlight

I keep jpg files in a separate images sitemap file and do not include them in the main sitemap file.

Google search console was able to see that I was not submitting the ?id= and index.php files but the ultimate resolution for staying away from the “duplicate content” issue is muddy. Ultimately, I have settled for the robots.txt solution. We’ll see what Google Search Console has to say about it in a month.

Ubersuggest kept nagging me with duplicate content problems, and while I could have ignored it, it made the reports I get cloudy. This is what finally drove me to the robots.txt file.

The only other nag I am getting is thin content and some short titles which I plan on cleaning up.

As a side note: If the canonical for the id=? files pointed to the “static” version single.php file, that may have been the better solution and/or to the robot.txt file.

Perry J.

Thanks for the link, Perry. I can’t see any references to single.php?id= links either in the albums or the single pages they link to. So I’m not sure where Google is picking up these kinds of links.

As a result of blocking ?id=, Google is warning me:

“Indexed, though blocked by robots.txt”

Here’s the explanation.

Right now, Google is only pointing at a few pages, but I believe If I leave things alone, many more will follow. These pages have ?id= parameter in the url so the block in robots.txt is working ad advertised.

At the moment, I plan on leaving the current robots.txt intact because I don’t want those pages indexed for fear they might appear as duplicate content to the bots.

Puts me back to wondering if these dynamic pages had canonical calls to the “non-dynamic” pages if things would work better for the bots. Based on my read, that seems like what the bots prefer.

In the end, as long as Google bot doesn’t flag the dynamic pages as duplicate, I’m reasonably comfortable.

Also important to note: I had to remove the block to the /backlight directory. If google can’t see the css code and whatever else in that directory and associated directories, the gallery pages start getting tagged for being mobile unfriendly.

Feel free to share thoughts. Thanks for the notes.

Perry J.

Here’s an address to one of my images:

https://campagna.photography/galleries/nature-outdoors/mc-20200924-3185-single.php

If you visit that address and inspect the source code, you will see that same address being used as the canonical location for that page.

This is the expected behavior for Backlight’s albums and single-image pages, and if your site is correctly configured to use URLs in this format, then the ?id= addresses should not be appearing.

Also important to note: I had to remove the block to the /backlight directory. If google can’t see the css code and whatever else in that directory and associated directories, the gallery pages start getting tagged for being mobile unfriendly.

This makes no sense to me. I’ve seen no indication of there being any issue with having the /backlight folder blocked. Google should be indexing public-facing pages, not background assets.

I ran both of these urls from your site:

https://campagna.photography/galleries/nature-outdoors/single.php?id=mc-20200924-3185
https://campagna.photography/galleries/nature-outdoors/mc-20200924-3185-single.php

Both work but neither one was indexed. I believe if you find one of your URLs that is the non-dynamic version and it IS indexed by Google, you will also find the dynamic version is indexed too.

I tried a few others from you galleries and they came back as “not indexed.” The same tool reliably reports any of my pages, both the non-dynamic and dynamic page versions, are indexed.

I used: https://smallseotools.com/google-index-checker/

It is important to note that I don’t use that tool on any regular basis. I use Google Search Console. But in order to look at someone else’s property, and look at multiple domains or pages at once, I use the aforementioned tool.

In terms of blocking /backlight directory, I am sure of what I am seeing. 18 pages were tagged with “Clickable elements too close together” and 14 were tagged with “Content wider than screen” within 24 hours of my blocking it (I generate a new sitemap file every time I fiddle with the live version of my site – the generator automatically pings Google to let the bots know there is an update.) This happened within 24 hours of my blocking /backlight. As soon as I unblocked /backlight, problem went away within 24 hours. The dates in the Google report agreed with what I was seeing.

On another note: Once I blocked the ?id=, now I am getting “Valid with warnings” for 14 pages with the following explanation (I anticipate getting a lot more in due course.):

“Indexed, though blocked by robots.txt” with this explanation:

https://support.google.com/webmasters/answer/7440203#indexed_though_blocked_by_robots_txt

Bottom line: If one doesn’t want Google to index a page, it wants you to use “noindex”. This brings us back to whether a canonical fix would work as implied by Google. If I were able to change the code, based on what I’ve read on Google’s webmasters pages, I would want to try “canonical” fix first. However, if this is impossible to do from a coding standpoint, please let me know so I can move past this notion.

Perry J.

Backlight’s pages already include a “canonical” URL on every page.

I guess I’ll have to wait 24+ hours for Google to complain or not, but updating my sitemap configuration to exclude “index.php” seems to be having the desired impact in my site map, of preventing duplicate URLs such as:

https://domain.com/folder/
https://domain.com/folder/index.php

Perry, can you provide any link to any of your pages where links in the format id= can be seen either by hovering your mouse over a link or in the page source?

In my mind, the canonical in the dynamic page should point to the non-dynamic. Example:

For your page:

https://campagna.photography/galleries/nature-outdoors/single.php?id=mc-20200924-3185

The current canonical is:

<link rel="canonical" href="https://campagna.photography/galleries/nature-outdoors/single.php?id=mc-20200924-3185" />

I believe it should be this:

<link rel="canonical" href="https://campagna.photography/galleries/nature-outdoors/mc-20200924-3185-single.php" />

Perry J.

Not exactly sure what you are driving at. Check out my latest reply to Matt.

Yeah, it is. Seems to be dependent on how you get there. When I click through from the album, I see this as canonical.

https://campagna.photography/galleries/nature-outdoors/mc-20200924-3185-single.php

We have found, yesterday, that visiting an album with /index.php in the URL causes the page to use the ?id= form of the URLs, which is something we need to look into.

Makes sense to me, Matt. I think you are on the right track.

To further my notes, Google also reports these dynamic pages as:

“Duplicate, Google chose different canonical than user”

Seems the fix is to be sure the canonical for any dynamic page refers to the non-dynamic page.

See my message above about excluding /index.php URLs from the sitemap. Should help to stop Google’s complaining in the meantime.