How to Create and Maintain a robots.txt File

Learn how to create, configure, and maintain a robots.txt file that guides search engines while protecting sensitive areas of your website. Includes examples, common mistakes, and monitoring tips.

TL;DR: A robots.txt file tells search engine crawlers which parts of your website they can and cannot access. Place it at your domain root (example.com/robots.txt), use User-agent to target specific crawlers, and Disallow or Allow directives to control access. Always test changes before deploying, and monitor your robots.txt to catch unauthorized modifications that could hurt your SEO.

Every day, search engines like Google, Bing, and DuckDuckGo send automated programs called "crawlers" or "bots" to explore and index websites. Your robots.txt file is essentially a set of instructions for these visitors, telling them which areas of your site are open for business and which are off-limits.

What is robots.txt? {#definition}

Robots.txt is a plain text file that sits at the root of your website and follows the Robots Exclusion Protocol. When search engine crawlers visit your site, they check for this file first to understand your crawling preferences. It's not a security mechanism (crawlers can ignore it), but reputable search engines respect these directives, making it an essential tool for controlling how your site appears in search results.

Why robots.txt Matters for Indie Hackers

You might wonder if a simple text file really matters for your SaaS or side project. It does, for several important reasons.

Without proper robots.txt configuration, search engines might index pages you don't want public, like admin panels, staging environments, or duplicate content pages. This can confuse users who find these pages in search results and potentially expose information you'd rather keep private.

Search engines allocate limited resources to crawling each site, a concept called "crawl budget." A well-configured robots.txt ensures crawlers spend their time on your important pages rather than wasting resources on utility pages, infinite calendar archives, or parameter variations that create duplicate content.

Accidentally exposing staging URLs or unfinished features in search results is embarrassing at best and potentially harmful at worst. While robots.txt isn't a security measure (anyone can still access those URLs directly), it provides a first layer of protection against accidental indexing.

For sites on limited hosting plans, reducing unnecessary crawling also decreases server load from bot traffic. When Googlebot isn't hammering your server trying to index every URL parameter variation, your real users get better performance.

How to Create a robots.txt File

Creating the File

Create a plain text file named exactly robots.txt in lowercase with no variations. This file must be placed at your domain's root directory, accessible at https://example.com/robots.txt or https://www.example.com/robots.txt. Placing it anywhere else (like in a subdirectory or with different capitalization) means crawlers won't find it.

Understanding the Basic Syntax

A robots.txt file consists of one or more "records" that specify rules for crawlers. Each record needs a User-agent line specifying which crawler the rules apply to, followed by directive lines with the actual rules.

The simplest possible robots.txt that allows all crawlers to access everything looks like this:

User-agent: *
Allow: /

And here's one that blocks all crawlers from your entire site:

User-agent: *
Disallow: /

Writing Your Rules

To block specific directories from crawling:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

To block specific file types:

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$

To allow a subdirectory within a blocked directory:

User-agent: *
Disallow: /api/
Allow: /api/public/

To target specific crawlers with different rules:

# Rules for Google
User-agent: Googlebot
Disallow: /internal/
Crawl-delay: 1

# Rules for Bing
User-agent: Bingbot
Disallow: /internal/
Crawl-delay: 2

# Rules for everyone else
User-agent: *
Disallow: /internal/

Adding Your Sitemap

Always include a reference to your XML sitemap at the end of your robots.txt. This helps search engines discover all your important pages quickly:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Testing and Deploying

Before deploying changes, validate your syntax using Google Search Console's robots.txt Tester. Test specific URLs to ensure they're blocked or allowed as intended, then deploy to production and verify by visiting your robots.txt directly in a browser.

robots.txt Best Practices

Be specific with paths by using trailing slashes for directories. Writing /admin/ rather than /admin ensures you're blocking the directory and its contents rather than potentially blocking files that happen to start with "admin" in their name.

When multiple rules could apply to the same URL, most crawlers use the most specific matching rule. However, placing more specific rules before general ones makes your file easier to read and ensures predictable behavior across different crawlers.

Add comments with # to document why certain rules exist. When you come back to this file in six months, you'll appreciate knowing why you blocked that particular path:

# Block admin area from all crawlers
User-agent: *
Disallow: /admin/

# Block API documentation (internal use only)
Disallow: /api/docs/

Never block CSS and JavaScript files. Search engines need these to render your pages properly and understand your content. Blocking them actively hurts your rankings.

Keep your robots.txt simple. Complex files are hard to maintain and easy to misconfigure. Start with minimal rules and add more only when you have a specific need.

Common robots.txt Mistakes to Avoid

Using robots.txt for security is a fundamental misunderstanding of what it does. The file is publicly accessible (anyone can view it), and it only works on crawlers that choose to respect it. Malicious bots, scrapers, and security scanners typically ignore robots.txt completely. Never rely on it to protect sensitive data. Use authentication instead.

Accidentally blocking your entire site happens more often than you'd think. A single misplaced Disallow: / without the right User-agent context can remove your entire site from search results. Always test changes carefully.

Each subdomain needs its own robots.txt file. Rules for example.com don't apply to api.example.com or blog.example.com. If you have subdomains that need crawl control, create separate robots.txt files for each.

Never skip testing after making changes. A syntax error can invalidate your entire file, and the results might not be obvious until you notice a drop in search traffic weeks later.

Blocking images, CSS, or JavaScript prevents search engines from rendering your pages correctly. Modern search engines need to see your pages as users do, so blocking these resources actively hurts your rankings.

Outdated rules create confusion and potential issues. Old rules blocking pages that no longer exist clutter your file, while rules that should have been updated to block new sensitive pages leave gaps in your configuration.

robots.txt Examples for Common Scenarios

Basic Website

User-agent: *
Allow: /

# Block utility pages
Disallow: /search
Disallow: /filter
Disallow: /*?*sort=
Disallow: /*?*page=

Sitemap: https://example.com/sitemap.xml

SaaS Application

User-agent: *

# Allow marketing pages
Allow: /
Allow: /features/
Allow: /pricing/
Allow: /blog/

# Block application routes
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/
Disallow: /admin/

# Block authentication pages
Disallow: /login
Disallow: /register
Disallow: /password-reset

Sitemap: https://example.com/sitemap.xml

E-commerce Site

User-agent: *
Allow: /

# Block checkout and cart
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/

# Block filtered and sorted pages (duplicate content)
Disallow: /*?*filter=
Disallow: /*?*sort=
Disallow: /*?*color=
Disallow: /*?*size=

# Allow category and product pages
Allow: /products/
Allow: /categories/

Sitemap: https://example.com/sitemap.xml

WordPress Site

User-agent: *
Allow: /

# Block WordPress admin
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# Block common WordPress utility paths
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /*?replytocom=
Disallow: /*?s=
Disallow: /*?p=
Disallow: /tag/*/page/
Disallow: /author/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/post-sitemap.xml
Sitemap: https://example.com/page-sitemap.xml

How to Monitor Your robots.txt

Your robots.txt file is critical for SEO, and unauthorized changes can have devastating effects. Imagine if a misconfigured deployment or a compromised server changed your robots.txt to block Googlebot. Your traffic could disappear overnight with no obvious cause.

Regular monitoring should verify that the file exists at the correct URL, check for unexpected changes to the content, test that important pages remain crawlable, and review Search Console for crawl errors related to blocked resources.

How SecurityBot Helps with robots.txt

SecurityBot monitors your robots.txt file continuously and alerts you when something changes. You get change detection alerts when your robots.txt is modified, diff reports showing exactly what changed, missing file alerts if your robots.txt disappears entirely, and SEO impact warnings if changes could hurt your search visibility.

One small change to robots.txt can dramatically impact your search rankings. Don't leave it unmonitored.

Start your free 14-day trial - all features included, no credit card required.

Frequently Asked Questions

Does robots.txt stop all bots?

No. Robots.txt only works on crawlers that choose to respect it. Malicious bots, scrapers, and security scanners typically ignore robots.txt completely. For security, use authentication and access controls.

How long until Google sees my robots.txt changes?

Google typically checks robots.txt every 24-48 hours, but it can take longer. You can request a refresh in Google Search Console if you need faster updates.

Should I block AI crawlers?

This is your choice. If you want to prevent AI training on your content, you can add rules for common AI crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

What's the difference between Disallow and noindex?

Disallow in robots.txt prevents crawling, meaning the search engine won't even visit the page. A noindex meta tag allows crawling but tells search engines not to include the page in results. If a page is linked from other sites, search engines might still index a blocked page based on those external links even without crawling it.

Can I use robots.txt to remove pages from Google?

Not effectively. Robots.txt prevents crawling, not indexing. If a page is already indexed, blocking it won't remove it from search results. Use the noindex meta tag or Google Search Console's URL removal tool instead.

Last updated: January 2026 | Written by Jason Gilmore, Founder of SecurityBot