We Blocked Pages in robots.txt. Why Are They Still Showing in Google?

DANA: Here’s one that comes in as a complaint, not a question. A language-learning platform blocked a batch of low-quality pages in robots.txt to keep them out of Google. Weeks later, those URLs are still showing up in search results, just with no description. The developer is convinced robots.txt is broken. It isn’t. So what do we tell them, and what should they have done?

ELENA: The tool did exactly what it’s built to do, it just isn’t the tool they wanted. A robots.txt disallow controls crawling, not indexing. It tells search engines “don’t fetch this page.” It does not tell them “don’t list this URL.” Those are two different things, and the developer collapsed them into one. So the pages aren’t crawled, which is why there’s no description, but the URLs can still appear, because indexing and crawling are separate stages.

HANNAH: And I can be firm on this because it’s documented behavior, not a guess. If a blocked URL is linked from somewhere search engines can see, an internal link, an external one, a sitemap, they can index that URL without ever fetching it. They know it exists, they just can’t read it. That’s the no-description listing the developer is staring at. The cruel irony is the block is the reason it looks broken, because it stopped the crawler from reading the one instruction that would have actually removed it.

MARCUS: Let me take the developer’s side, because the confusion is honest. From the outside, “block in robots.txt” sounds exactly like “hide from Google.” The names practically invite the mistake. So they’re not being lazy, they reached for the tool whose label matched their goal. The real failure is that the label and the function don’t line up. Fine. But that’s an explanation, not an excuse to leave it broken. What removes the page, actually?

SOFIA: And there’s a trap in the obvious fix that we have to flag before anyone runs at it. The instruction that removes a page is a noindex tag on the page itself. But noindex only works if the crawler can reach the page and read that tag. If the page is still blocked in robots.txt, the crawler never fetches it, never sees the noindex, and the page stays listed. So the two tools actively cancel each other. People add noindex, leave the disallow in place, and wonder why nothing changes.

NOAH: There’s the pattern, and it’s the same shape we keep seeing. The developer applied one tool as a blanket solution without checking what it actually does, the same way “self-canonical everything” or “delete it all” showed up. The tell here is treating two separate mechanisms, crawl control and index control, as one switch. And if they made this mistake on this batch, the same disallow-as-hide assumption is probably sitting in the robots.txt rules for other sections too. Worth checking the whole file, not just these URLs.

GRACE: Small but real content angle. Those no-description listings aren’t neutral, they’re a bad look. A user searching sees the platform’s URL with a blank or a “no information available” line under it. That’s worse for trust than either a proper listing or no listing at all. So this isn’t only a technical tidiness issue. Every day those naked URLs sit in results, they’re a slightly broken-looking storefront.

PRIYA: And the strategic question nobody’s asked yet, why are these pages being hidden at all? If they’re genuinely low value, the cleaner long-term answer might not be hiding them, it might be the same keep-improve-merge-remove judgment we’d apply to any weak page. Suppressing a page from search is sometimes right, a thank-you page, an internal search result, a duplicate. But “low quality” usually isn’t a hiding problem, it’s a quality problem. Reaching for noindex on content that should either be fixed or deleted just sweeps it under the rug.

THEO: Process layer, and the reusable test is clean here. For any page you don’t want in search, ask two questions in order. First, do you want it crawled? Second, do you want it indexed? Those are separate dials. If you want it gone from results, you leave it crawlable and add noindex, and you wait for it to be recrawled before the listing drops. If you truly never want it fetched, that’s robots.txt, but accept that it can still be listed as a bare URL. The mistake is always conflating the two dials into one.

AIKO: Systems note. robots.txt and indexation directives are exactly the kind of thing set once and misunderstood forever, because the consequences show up weeks later and look like a different problem. The durable fix is a short internal reference that states the crawl-versus-index distinction in one line each, and a rule that any robots.txt change gets paired with the question “are we trying to block crawling or block indexing,” because those answers point at completely different tools.

MARCUS: Closing my loop. Their goal, keep junk pages out of search, is legitimate. We honor it. But there’s no tradeoff to weigh on the method, because the method they used can’t achieve the goal, by design, and currently makes it look worse. So this isn’t “robots.txt versus noindex, pick one.” It’s “you used the crawl dial to do an index job.” We unblock the pages so the crawler can reach them, add noindex so they actually drop, and confirm removal once they’re recrawled.

DANA: That’s the answer, and the sequence matters because getting it backwards is how people stay stuck. We tell the developer robots.txt isn’t broken, it controls crawling, and the listings they’re seeing are URLs Google knows about but was told not to read. To actually remove them, the order is, remove the robots.txt block first so the page can be fetched, then add a noindex tag to the page, then wait for a recrawl, at which point the listing drops. Doing noindex while the block is still in place is the one move that guarantees nothing happens, so we call that out explicitly. We check the rest of the robots.txt for the same assumption, because Noah’s right it’s probably not isolated. And we ask Priya’s question on the batch itself, whether these pages should be hidden or actually fixed or removed, because hiding low quality is often the wrong frame. The instinct to clean up search results was right. The dial they turned was the wrong one.

HANNAH: Which is the honest version of “it’s not broken.” The tool worked perfectly. It just wasn’t the tool that does what they wanted.

We Blocked Pages in robots.txt. Why Are They Still Showing in Google?

Related posts: