SEO - Does anyone block bots from dynamic pages?

Matthew · May 17, 2021, 5:51pm

Also, I think you can get away with something like this for robots.txt:

User-agent: *
Disallow: /backlight
Allow: /backlight/publisher
Disallow: /*/single.php
Disallow: /*/thumbnails/*.jpg$

At least, that’s what I’m experimenting with now. Difference being that we’re allowing /backlight/publisher, which is where we cache album assets, rather than allowing the entire /backlight folder.

LifeIsABeach · May 17, 2021, 5:58pm

That’s what I have been doing since the beginning. Here’s the crawler configuration; not to be confused with what’s in robots.txt. The following is excluded from my sitemap:

?id=
index.php
/backlight

Google bot can also parse what I have submitted in robots.txt versus what it has found.

Once you get your pages indexed, see if your pages start getting reported as “Mobile Usability - Error” from Google Search Console when blocking /backlight in robots.txt.

Perry J.

Matthew · May 17, 2021, 10:15pm

Ben and I have synced. The observed behavior when entering a page with /index.php is the desired behavior. That’s the fallback mechanism that guarantees Backlight works regardless of whether URL rewriting is available and enabled.

Therefore, you should ensure that you’re providing both Google and your sitemap generator the correct entry point – https://domain.com, omitting /index.php, with the correct protocol, and eliminating any other factors that may cause a redirect.

I don’t know how or why the XML Sitemap Generator sees /index.php when it should not, but you can keep them out of your sitemap by excluding them, as discussed above. This should ensure a clean sitemap for Google, which is what I am seeing for my own site. Then you’ll need to wait for Google to update its indexed pages, which may take between 24 hours and 14 years.

LifeIsABeach · May 18, 2021, 5:46pm

It’s important to note that the gallery is a part of an existing website – Backlight is not the core of the site. The primary index page for the domain is a .htm file. I have never submitted “index.php”. Google is finding it on its own.

XML sitemap does not include index.php.

So I am sure we are in synch:

Are you only interested in what happens with the sitemap file and not what happens in Google Search Console?

Is it out of the question to implement the canonical fix I’ve mentioned? My thinking is if you did this, the user would not have to worry about generating a sitemap file and the Google Console Reports would be much cleaner and easier to use. You also would not have to worry about what is in robots.txt. Otherwise, users would want to know there’s a dependency on these peripheral files to be properly deployed in order to satisfy Google.

Many thanks for taking the time to look at this.

Perry J.

Matthew · May 18, 2021, 5:52pm

There is no “canonical fix” to implement. We’re using canonical URLs on all pages, and they are populated appropriately, based upon how one is navigating the site.

We’ve found one place last night where Backlight is posting /index.php URLs to the page, and we’re going to be fixing it in the next update. Apart from that, ensure your site is linking to Backlight with correctly formatted paths (e.g. no /index.php).

And the robots.txt file is entirely optional. No place in our materials do we imply it is anything else. Mostly it’s just been me messing around with it, and the documentation exists because others might find it interesting or useful.

LifeIsABeach · May 19, 2021, 6:39pm

Thanks, Matthew.

Just to clarify, if you were to use the canonical non-dynamic URL in a dynamic page, it would mess things up? As I read it from a number of SEO sources, the dynamic should point to the non-dynamic. Here are the notes from Moz:

Matthew · May 19, 2021, 9:29pm

If someone’s site isn’t configured for “clean URLs”, then the address format that you’re asking for would always be 404, and would utterly break SEO for that site.

If we had to make the canonical URLs static, then the only safe way to do so would be to use the ?id= format URLs, which are the ones you do not want.

If I were setting this up for a singular website – just my own – then yeah, I could do that. But Backlight is a market product, capable of running in a variety of unknown environments, mostly for users who would rather not be bothered with any of this.

LifeIsABeach · May 21, 2021, 6:13pm

“If someone’s site isn’t configured for “clean URLs”, then the address format that you’re asking for would always be 404, and would utterly break SEO for that site.”

I’m not sure I understand. For example, if your page:

https://campagna.photography/galleries/nature-outdoors/single.php?id=mc-20200924-3185

Had this for a canonical:

https://campagna.photography/galleries/nature-outdoors/mc-20200924-3185-single.php

It would break SEO? Or break the site?

What does “clean URL” mean in this context?

Matt – I want you to know how much I appreciate this conversation. Unfortunately, I’m in that minority who cares about SEO to the degree is affects the revenue stream for the site. Backlight works really well in my “SEO” optimized environment. So well, in fact, I’m only off by a few for #1 positions for primary keyphrases. One might say “why isn’t that good enough?” The answer is it makes a substantial difference in bottom line revenue – hence my motivation to understand.

Matthew · May 21, 2021, 10:26pm

“Clean URLs” means not the ?id= form of the URL.

When I visit the queried version of the address –

https://campagna.photography/galleries/nature-outdoors/single.php?id=mc-20200924-3185

– I see exactly that same address as canonical.

When I visit the clean version of the address –

https://campagna.photography/galleries/nature-outdoors/mc-20200924-3185-single.php

– that is what I see as canonical.

This is the desired behavior. So that if a site is not using clean URLs, then the ?id= formatted URLs are correctly used as canonical. These are the “natural” URLs, while the “clean” URLs are .htaccess-mapped. Meaning, there is no case where the ?id= version of the URLs should not work; it’s just a matter of whether or not they are discoverable, based on your .htaccess file being present and properly configured for clean URLs.

If you provide Google with the correct form of your address, via sitemap or in whatever way, then it should consistently crawl your site using that form of URLs.

As I said, we found one bug on the single-image pages which causes the crawler to re-index pages using the ?id= format of the address, which we are fixing in the 4.0.2 release. In the meantime, you can exclude the index.php addresses from your sitemap as previously discussed, and that should suffice.

LifeIsABeach · May 22, 2021, 1:05am

Can’t thank you enough for the clarity. That makes sense. I get it now.

I am going to stop blocking the “?id=” parameter in robots.txt and will stop blocking index.php once 4.0.2 is released.

I will continue to block them both in the sitemap generator I use just to be sure I don’t submit them.

Thanks again, Matt!!!

Perry J.

LifeIsABeach · June 2, 2021, 8:00pm

For the sake of this conversation, I am pleased to report that V4.0.2, as posted by Matt, cleaned up the issues under discussion. I am no longer blocking anything in robots.txt.

Many thanks to Matt and the Backlight team. Great work.

Perry J.