By Clarity SEOPublished March 15, 2026Updated April 15, 2026

How to Create a Robots.txt File

Summary: To create a robots.txt file, place a plain text file named "robots.txt" at the root of your domain (e.g., yourdomain.com/robots.txt) containing User-agent and Disallow directives that tell search engine crawlers which pages to skip. A well-configured robots.txt file helps you manage crawl budget, prevent duplicate content indexing, and ensure Google spends its limited crawl resources on the pages that actually matter for your rankings.

Your robots.txt file tells search engine crawlers which pages they're allowed to visit. Get it right and you direct crawl budget to your best content. Get it wrong — one bad line — and you can accidentally block Google from crawling your entire site. It happens more often than you'd think.

According to a ContentKing study, approximately 26% of websites have robots.txt configuration errors that could negatively impact their SEO performance. The most common mistake? Accidentally blocking entire sections of a site from being crawled — including CSS and JavaScript files that Google needs to render and evaluate mobile-friendliness.

What Is a Robots.txt File?

Robots.txt is a plain text file placed at the root of your website (e.g., https://yourdomain.com/robots.txt). It uses the Robots Exclusion Protocol — a standard that has been in use since 1994 — to communicate crawling instructions to search engine bots.

A basic robots.txt looks like this:

User-agent: * Disallow: /admin/ Disallow: /private/ Allow: / Sitemap: https://yourdomain.com/sitemap.xml

Every major search engine checks robots.txt before crawling a site. The file is public — anyone can view it by visiting yourdomain.com/robots.txt. It is a directive, not a security measure: it tells crawlers what to skip, but it doesn't prevent access by bad actors.

Key directives:

User-agent: — specifies which bot the rules apply to (* means all bots)

Disallow: — blocks the specified path from being crawled

Allow: — explicitly permits a path (used to override a broader Disallow)

Sitemap: — tells bots where to find your XML sitemap

Crawl-delay: — requests a delay between requests (honoured by Bing, not Google)

How Robots.txt Actually Works (Technical Deep Dive)

Understanding the mechanics helps you avoid costly mistakes:

Order of matching: When multiple rules match a URL, Google uses the most specific rule. For example, if you have Disallow: /directory/ and Allow: /directory/page.html, Google will allow the specific page because the Allow rule is more specific.

Wildcard patterns: Google supports * (match any sequence) and $ (match end of URL) in robots.txt paths. This enables powerful rules:

# Block all PDF files Disallow: /*.pdf$ # Block all URLs with query parameters Disallow: /*? # Block specific parameter patterns Disallow: /*?sort= Disallow: /*?ref= Disallow: /*&page=

Caching behaviour: According to Google's official documentation, Googlebot caches your robots.txt file for up to 24 hours. Changes you make won't take effect immediately. If the file becomes unreachable (returns a 5xx error), Google continues using the cached version for up to 30 days before treating the site as having no robots.txt restrictions.

File size limit: Google enforces a maximum robots.txt file size of 500 kibibytes (512 KB). Content beyond this limit is ignored. For large sites with complex rules, consolidate directives or use patterns instead of listing individual URLs.

Why It Matters for SEO

Crawl budget management: Google's crawl budget is the number of URLs it crawls on your site per day. For large sites (10,000+ pages), crawl budget becomes a significant factor. Blocking low-value pages (login pages, admin areas, duplicate content) frees budget for pages you want ranked. According to Moz's SEO guide, efficient crawl budget management can improve indexation rates by 30-50% for large sites.

Preventing duplicate content: Pages like ?sort=price, ?ref=partner, or print versions of pages can be blocked in robots.txt to prevent them from diluting your content quality signals.

Protecting sensitive areas: Admin dashboards, staging subdirectories, and user account pages shouldn't be indexed. Robots.txt keeps them out of search results.

Sitemap discovery: The Sitemap: directive in robots.txt is one of the ways Google discovers your sitemap — even if you haven't submitted it manually in Search Console.

AI crawler management: With the rise of AI systems like GPTBot, CCBot, and Google-Extended, robots.txt is increasingly used to control which AI crawlers can access your content. This is a rapidly evolving area — many publishers now add specific rules for AI-related bots.

How to Check Your Robots.txt

Clarity SEO's free Robots Generator creates a valid, well-structured robots.txt for your site and checks your existing file for common errors.

→ Generate your robots.txt with Clarity SEO

The full Report Card also checks for a missing or misconfigured robots.txt as part of its 29-point audit.

→ Get your free SEO Report Card

You can also view your current robots.txt by visiting https://yourdomain.com/robots.txt in your browser.

How to Create and Configure Your Robots.txt

For HTML/Generic

Step 1: Create the file.

Create a plain text file named robots.txt (lowercase, no extension). It must live at the root of your domain — not in a subfolder.

Step 2: Write the rules.

Here's a solid starting template for most websites:

# Allow all crawlers to access the entire site User-agent: * Disallow: /admin/ Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /cart/ Disallow: /checkout/ Disallow: /account/ Disallow: /login/ Disallow: /private/ Disallow: /staging/ Allow: /wp-admin/admin-ajax.php # Block specific bots you don't want User-agent: AhrefsBot Disallow: / User-agent: SemrushBot Disallow: / # Block AI crawlers (optional — depends on your preference) User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: / # Point to your sitemap Sitemap: https://yourdomain.com/sitemap.xml

Step 3: Upload the file.

Upload robots.txt to your site's root directory via FTP, SFTP, or your hosting file manager. Verify it's live by visiting https://yourdomain.com/robots.txt.

Step 4: Test it.

Use Google Search Console's Robots.txt Tester to verify your rules work as intended before Googlebot encounters them.

Robots.txt Templates by Website Type

Different sites need different robots.txt configurations. Here are optimised templates:

E-commerce site:

User-agent: * Disallow: /cart/ Disallow: /checkout/ Disallow: /account/ Disallow: /wishlist/ Disallow: /compare/ Disallow: /*?sort= Disallow: /*?filter= Disallow: /*&page= Disallow: /search? Allow: / Sitemap: https://store.com/sitemap.xml

Blog / content site:

User-agent: * Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /tag/ Disallow: /author/ Disallow: /*?replytocom= Allow: /wp-admin/admin-ajax.php Allow: / Sitemap: https://blog.com/sitemap.xml

SaaS / web application:

User-agent: * Disallow: /app/ Disallow: /dashboard/ Disallow: /api/ Disallow: /settings/ Disallow: /login Disallow: /signup Allow: / Sitemap: https://saas.com/sitemap.xml

For WordPress

WordPress automatically creates a virtual robots.txt if no physical file exists in the root directory. The default allows all crawlers.

Method 1: Plugin (recommended)

Using Yoast SEO:

Go to Yoast SEO → Tools → File editor.

Edit the robots.txt file directly in the browser.

Save — Yoast writes the physical file.

Using Rank Math:

Go to Rank Math → General Settings → Edit robots.txt.

Edit and save.

Method 2: Manual upload

Create your robots.txt file locally.

Upload it to your WordPress root directory (same folder as wp-config.php) via FTP or your hosting control panel.

A physical file takes precedence over WordPress's virtual one.

⚠️ Critical warning: In WordPress → Settings → Reading, there is a checkbox: "Discourage search engines from indexing this site". If this is checked, WordPress adds Disallow: / to your robots.txt — blocking all crawlers from your entire site. Check this setting first if your site isn't being indexed.

For Shopify

Shopify automatically generates a robots.txt file for your store. You can customise it using a robots.txt.liquid template:

Go to Online Store → Themes → Edit code.

Under Templates, look for robots.txt.liquid.

If it doesn't exist, click Add a new template → robots.txt.

Add custom rules using Liquid:

{% assign shopify_robots = true %} {{ content_for_header }} User-agent: * {% for rule in robots.default_groups %} {% for rule_item in rule.rules %} {{ rule_item.directive }}: {{ rule_item.value }} {% endfor %} {% endfor %} # Custom additions Disallow: /collections/*?sort_by= Disallow: /search? Sitemap: {{ routes.root_url }}sitemap.xml

For Wix / Squarespace / Webflow

Wix: Wix manages robots.txt automatically. Custom robots.txt editing requires a Business or higher plan. Go to Settings → SEO → Robots.txt to customise.

Squarespace: Squarespace generates its own robots.txt and does not allow full custom editing on standard plans. You can customise some crawling behaviour via Settings → Advanced → External API Keys for Google Search Console integration.

Webflow: Webflow generates robots.txt automatically. Custom robots.txt is only available on paid hosting plans. Edit via Project Settings → SEO → Robots.txt.

Robots.txt vs Noindex: When to Use Which

This is one of the most misunderstood concepts in SEO. Robots.txt and the noindex meta tag serve different purposes:

Robots.txt Disallow: Prevents crawling. The page won't be crawled, but if other sites link to it, Google may still index the URL (showing just the URL with no snippet or description).

Noindex meta tag: Prevents indexing. The page gets crawled (Google reads it), but it won't appear in search results. This is the correct way to remove a page from Google's index.

Critical gotcha: If you block a page in robots.txt AND add a noindex tag to it, Google can't see the noindex tag (because it can't crawl the page). The page might still get indexed if external links point to it. If you want to prevent indexing, don't block the page in robots.txt — let Google crawl it and read the noindex directive.

Use robots.txt when: You want to save crawl budget (large sites with many low-value pages), prevent crawling of resource-heavy pages, or block specific bots entirely.

Use noindex when: You want to guarantee a page doesn't appear in search results — thank you pages, internal search results, user profile pages, archived content.

Managing AI Crawlers in 2024

With the rise of AI-powered search and content training, managing AI crawlers in your robots.txt has become a significant concern for publishers. Here are the most common AI-related user agents:

# OpenAI's crawler (used for training data) User-agent: GPTBot Disallow: / # Common Crawl (used by many AI companies) User-agent: CCBot Disallow: / # Google's AI training crawler (separate from search) User-agent: Google-Extended Disallow: / # Anthropic's crawler User-agent: anthropic-ai Disallow: / # Bytedance / TikTok User-agent: Bytespider Disallow: /

Important distinction: Blocking Google-Extended prevents Google from using your content for AI training (Gemini/Bard), but does NOT affect Google Search indexing. Your pages will still appear in regular Google search results.

Common Mistakes to Avoid

Disallow: / with no Allow rules: This blocks all crawlers from your entire site. One of the most catastrophic SEO mistakes possible. Always verify your file doesn't contain this unless intentional.

Forgetting the Sitemap directive: Missing the Sitemap: line means Google must find your sitemap via Search Console alone — a wasted opportunity.

Using robots.txt as a security tool: Robots.txt is publicly visible. It tells crawlers to skip a path, but it doesn't prevent access. Use server authentication for sensitive files.

Blocking CSS and JavaScript: Blocking /wp-content/ or similar directories prevents Google from rendering your pages properly, which hurts mobile usability assessments and Core Web Vitals scoring.

Case sensitivity errors: The path in Disallow: /Admin/ is case-sensitive on some servers. /admin/ and /Admin/ are different paths on Linux servers.

Not testing after changes: Always validate with Google Search Console's robots.txt tester after any changes.

Placing robots.txt in a subfolder: A robots.txt file at yourdomain.com/blog/robots.txt is completely ignored by crawlers. It must be at the domain root.

Using the wrong encoding: Robots.txt must be UTF-8 or ASCII. Using other encodings (like UTF-16) may cause crawlers to misread your rules.

How Robots.txt Connects to Other SEO Elements

Your robots.txt file doesn't exist in isolation — it interacts with several other SEO configurations:

Mobile friendliness: Blocking CSS/JS files in robots.txt prevents Google from assessing your mobile-friendly layout.

Structured data: If you block pages that contain structured data markup, Google can't read your schema and you lose rich result eligibility.

Open Graph tags: While social crawlers often ignore robots.txt, some follow it. Blocking pages can prevent proper Open Graph tag parsing for social previews.

SEO score: Your overall SEO score is directly affected by robots.txt configuration — both missing files and misconfigured files will lower your score.

Robots.txt vs Noindex: When to Use Each

This distinction trips up even experienced webmasters. Robots.txt and the noindex meta tag solve different problems, and using the wrong one can backfire spectacularly.

Scenario	Use Robots.txt Disallow	Use Noindex Meta Tag
Save crawl budget on large sites	✓ Best option	Not ideal (uses crawl budget)
Remove a page from search results	❌ Won't guarantee removal	✓ Best option
Block AI crawlers from training on content	✓ Best option	Not applicable
Hide internal search results pages	Either works	✓ More reliable
Block staging or development environments	✓ Good choice	Also works (use both for safety)
Prevent indexing of a page with external backlinks	❌ Page may still appear	✓ Only reliable option

The golden rule: If your goal is to save crawl budget and you don't care whether the URL appears in search results with no snippet, use Disallow. If your goal is to guarantee a page never appears in search results, use noindex and make sure the page is NOT blocked by robots.txt (Google needs to crawl it to read the noindex directive).

For a deeper understanding of indexing and how Google discovers your pages in the first place, see our guide on how to submit your website to Google.

How Major Websites Structure Their Robots.txt

Looking at how large, well-optimized sites configure their robots.txt can teach you a lot about best practices. Here's what some of the biggest sites do:

Google (google.com/robots.txt)

Google's own robots.txt is extensive — over 500 lines. Key patterns:

Blocks crawlers from internal tools and admin areas (/search, /sdch, /groups)

Uses wildcard patterns extensively to block query parameter variations

Lists multiple sitemaps for different products (News, Images, Video)

Does NOT block all bots — only specific paths for specific user agents

Amazon (amazon.com/robots.txt)

Amazon's robots.txt focuses heavily on preventing crawl waste from their massive product catalog:

Blocks all search result pages (/s?) — these are faceted navigation pages that create billions of URL combinations

Blocks cart, wishlist, and account pages

Blocks affiliate tracking parameters

Allows individual product pages to be crawled freely

Lesson: For e-commerce sites, blocking faceted navigation and search results pages is critical for crawl budget. Let Google crawl your product pages, category pages, and content — block everything else.

Facebook (facebook.com/robots.txt)

Facebook takes an aggressive approach:

Blocks most crawlers from most paths by default

Only allows specific paths for Google and Bing (public profiles, pages, events)

Uses specific User-agent rules for different crawlers

Blocks AI crawlers comprehensively

Lesson: If you have user-generated content or private areas, consider a whitelist approach — block everything by default and explicitly allow only public, SEO-relevant paths.

Robots.txt for Different CMS Platforms

Each CMS handles robots.txt differently. Here's what you need to know for each major platform:

WordPress

WordPress generates a virtual robots.txt by default. For most WordPress sites, this optimised configuration works well:

User-agent: * Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /wp-includes/ Disallow: /trackback/ Disallow: /xmlrpc.php Disallow: /*?s= Disallow: /*?replytocom= Disallow: /tag/ Disallow: /author/ Allow: /wp-admin/admin-ajax.php Allow: /wp-content/uploads/ Sitemap: https://yourdomain.com/sitemap_index.xml

Use Yoast SEO or Rank Math to edit this directly from your WordPress dashboard without FTP access. Both plugins provide a built-in robots.txt editor and validation.

Shopify

Shopify's auto-generated robots.txt is actually quite well-optimised out of the box. It blocks cart, checkout, and account pages while allowing products and collections. If you need customisation, use the robots.txt.liquid template. Key additions for Shopify stores:

Block collection sort URLs: Disallow: /collections/*?sort_by=

Block internal search: Disallow: /search?

Block tag filter combinations that create thin content pages

Wix

Wix manages robots.txt automatically and limits customisation on free plans. On Business plans and higher, you can edit via Settings → SEO → Robots.txt. Wix's default robots.txt is generally adequate for small sites, but if you have a large Wix site with many pages, consider upgrading to get full control over crawl directives.

Next.js

For Next.js sites (including this one), you have full control. Create a robots.txt file in your public/ directory, or use Next.js App Router's built-in robots.ts file for dynamic generation:

// app/robots.ts
import { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: {
      userAgent: '*',
      allow: '/',
      disallow: ['/api/', '/admin/', '/_next/'],
    },
    sitemap: 'https://yourdomain.com/sitemap.xml',
  };
}

The TypeScript approach gives you the advantage of type safety and the ability to generate rules dynamically based on environment variables (e.g., blocking all crawlers on staging deployments).

Testing Your Robots.txt

Never deploy robots.txt changes without testing. One wrong character can block Google from your entire site. Here's how to test safely:

Google Search Console Robots.txt Tester

Google provides a dedicated Robots.txt Testing Tool within Search Console. It lets you:

View your current live robots.txt as Google sees it

Test specific URLs against your rules to see if they'd be blocked or allowed

Edit the robots.txt in the tool and test changes before deploying

See exactly which line in your robots.txt matches each test URL

Third-Party Testing Tools

Google's robots.txt specification page includes the complete syntax reference for validating your rules

robotstxt.org — the original Robots Exclusion Protocol documentation, useful for understanding the full specification

Moz's robots.txt guide includes an interactive testing component

Screaming Frog (desktop app) can validate robots.txt against your full URL list to catch unintended blocks

Testing Process

Before deploying any robots.txt changes:

Back up your current robots.txt — save a copy before making changes

Test in Google Search Console's tester first — paste your new rules and test against 10+ important URLs

Deploy to staging — if possible, test on a staging environment first

Deploy to production — then immediately verify the live file at yourdomain.com/robots.txt

Monitor for 48 hours — check Search Console's Pages report for any new crawl errors or unexpected drops in indexed pages

Remember that Google caches your robots.txt for up to 24 hours, so changes won't take effect immediately. If you see problems, don't panic — revert and wait for the cache to refresh. For more on how crawling and indexing interact, see our guides on creating an XML sitemap and improving your overall website SEO. Understanding how Core Web Vitals interact with crawl efficiency is also important — a slow site gets crawled less frequently.

FAQ

Q: What is a robots.txt file?

A robots.txt file is a plain text file at the root of a website that tells search engine crawlers which pages or sections they are allowed or not allowed to crawl and index. It follows the Robots Exclusion Protocol standard, which has been the web standard for crawler communication since 1994. Every major search engine — Google, Bing, Yahoo, Yandex — checks for this file before crawling a site.

Q: Does robots.txt prevent pages from appearing in Google?

Robots.txt prevents crawling, but not always indexing. If other sites link to a blocked page, Google can still index it (showing a URL with no description). To prevent indexing, use a noindex meta tag — and don't block those pages in robots.txt, because Googlebot won't be able to read the noindex tag on a blocked page. This is one of the most common robots.txt mistakes in SEO.

Q: Where should the robots.txt file be located?

The robots.txt file must be in the root directory of your domain — https://yourdomain.com/robots.txt. A robots.txt at https://yourdomain.com/blog/robots.txt is not valid and will be ignored by crawlers. Note that subdomains need their own robots.txt files — blog.yourdomain.com/robots.txt controls crawling for the blog subdomain only.

Q: Can I have different rules for different search engines?

Yes. Use separate User-agent: blocks for each bot. For example, User-agent: Googlebot applies rules only to Google's crawler. User-agent: * applies to all crawlers not covered by a specific rule. This is commonly used to block aggressive SEO tool crawlers (AhrefsBot, SemrushBot) while allowing search engines full access.

Q: What happens if I don't have a robots.txt file?

Crawlers will crawl your entire site by default. Most crawlers check for robots.txt first — if none exists, they proceed without restrictions. This isn't inherently bad for small sites, but it means you're not directing crawl budget and potentially wasting it on admin, login, and cart pages. For sites with more than a few hundred pages, having a robots.txt is strongly recommended for crawl efficiency.

Q: How long does it take for robots.txt changes to take effect?

Google caches robots.txt for up to 24 hours, so changes don't take effect immediately. In practice, most changes are picked up within a few hours. If you need Google to re-fetch your robots.txt sooner, you can use the URL Inspection tool in Google Search Console to request a re-crawl, though this doesn't guarantee faster processing of the robots.txt itself.

Summary

A well-structured robots.txt file takes less than ten minutes to create and helps Google spend its crawl budget on the pages that actually matter for your rankings. Block admin areas, duplicate parameter pages, and staging content — and always point to your sitemap. With 26% of sites having robots.txt errors, getting this right puts you ahead of a quarter of the web.

Generate a clean, validated robots.txt file now with the free Clarity SEO tool.