How to Create a Robots.txt File
Summary: To create a robots.txt file, place a plain text file named "robots.txt" at the root of your domain (e.g., yourdomain.com/robots.txt) containing User-agent and Disallow directives that tell search engine crawlers which pages to skip. A well-configured robots.txt file helps you manage crawl budget, prevent duplicate content indexing, and ensure Google spends its limited crawl resources on the pages that actually matter for your rankings.
Your robots.txt file tells search engine crawlers which pages they're allowed to visit. Get it right and you direct crawl budget to your best content. Get it wrong — one bad line — and you can accidentally block Google from crawling your entire site. It happens more often than you'd think.
According to a ContentKing study, approximately 26% of websites have robots.txt configuration errors that could negatively impact their SEO performance. The most common mistake? Accidentally blocking entire sections of a site from being crawled — including CSS and JavaScript files that Google needs to render and evaluate mobile-friendliness.
What Is a Robots.txt File?
Robots.txt is a plain text file placed at the root of your website (e.g., https://yourdomain.com/robots.txt). It uses the Robots Exclusion Protocol — a standard that has been in use since 1994 — to communicate crawling instructions to search engine bots.
A basic robots.txt looks like this:
User-agent: * Disallow: /admin/ Disallow: /private/ Allow: / Sitemap: https://yourdomain.com/sitemap.xmlEvery major search engine checks robots.txt before crawling a site. The file is public — anyone can view it by visiting yourdomain.com/robots.txt. It is a directive, not a security measure: it tells crawlers what to skip, but it doesn't prevent access by bad actors.
Key directives:
User-agent: — specifies which bot the rules apply to (* means all bots)Disallow: — blocks the specified path from being crawledAllow: — explicitly permits a path (used to override a broader Disallow)Sitemap: — tells bots where to find your XML sitemapCrawl-delay: — requests a delay between requests (honoured by Bing, not Google)How Robots.txt Actually Works (Technical Deep Dive)
Understanding the mechanics helps you avoid costly mistakes:
Order of matching: When multiple rules match a URL, Google uses the most specific rule. For example, if you have Disallow: /directory/ and Allow: /directory/page.html, Google will allow the specific page because the Allow rule is more specific.
Wildcard patterns: Google supports * (match any sequence) and $ (match end of URL) in robots.txt paths. This enables powerful rules:
# Block all PDF files Disallow: /*.pdf$ # Block all URLs with query parameters Disallow: /*? # Block specific parameter patterns Disallow: /*?sort= Disallow: /*?ref= Disallow: /*&page=Caching behaviour: According to Google's official documentation, Googlebot caches your robots.txt file for up to 24 hours. Changes you make won't take effect immediately. If the file becomes unreachable (returns a 5xx error), Google continues using the cached version for up to 30 days before treating the site as having no robots.txt restrictions.
File size limit: Google enforces a maximum robots.txt file size of 500 kibibytes (512 KB). Content beyond this limit is ignored. For large sites with complex rules, consolidate directives or use patterns instead of listing individual URLs.
Why It Matters for SEO
?sort=price, ?ref=partner, or print versions of pages can be blocked in robots.txt to prevent them from diluting your content quality signals.Sitemap: directive in robots.txt is one of the ways Google discovers your sitemap — even if you haven't submitted it manually in Search Console.How to Check Your Robots.txt
Clarity SEO's free Robots Generator creates a valid, well-structured robots.txt for your site and checks your existing file for common errors.
→ Generate your robots.txt with Clarity SEO
The full Report Card also checks for a missing or misconfigured robots.txt as part of its 29-point audit.
→ Get your free SEO Report Card
You can also view your current robots.txt by visiting https://yourdomain.com/robots.txt in your browser.
How to Create and Configure Your Robots.txt
For HTML/Generic
Step 1: Create the file.
Create a plain text file named robots.txt (lowercase, no extension). It must live at the root of your domain — not in a subfolder.
Step 2: Write the rules.
Here's a solid starting template for most websites:
# Allow all crawlers to access the entire site User-agent: * Disallow: /admin/ Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /cart/ Disallow: /checkout/ Disallow: /account/ Disallow: /login/ Disallow: /private/ Disallow: /staging/ Allow: /wp-admin/admin-ajax.php # Block specific bots you don't want User-agent: AhrefsBot Disallow: / User-agent: SemrushBot Disallow: / # Block AI crawlers (optional — depends on your preference) User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: / # Point to your sitemap Sitemap: https://yourdomain.com/sitemap.xmlStep 3: Upload the file.
Upload robots.txt to your site's root directory via FTP, SFTP, or your hosting file manager. Verify it's live by visiting https://yourdomain.com/robots.txt.
Step 4: Test it.
Use Google Search Console's Robots.txt Tester to verify your rules work as intended before Googlebot encounters them.
Robots.txt Templates by Website Type
Different sites need different robots.txt configurations. Here are optimised templates:
E-commerce site:
User-agent: * Disallow: /cart/ Disallow: /checkout/ Disallow: /account/ Disallow: /wishlist/ Disallow: /compare/ Disallow: /*?sort= Disallow: /*?filter= Disallow: /*&page= Disallow: /search? Allow: / Sitemap: https://store.com/sitemap.xmlBlog / content site:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /tag/ Disallow: /author/ Disallow: /*?replytocom= Allow: /wp-admin/admin-ajax.php Allow: / Sitemap: https://blog.com/sitemap.xmlSaaS / web application:
User-agent: * Disallow: /app/ Disallow: /dashboard/ Disallow: /api/ Disallow: /settings/ Disallow: /login Disallow: /signup Allow: / Sitemap: https://saas.com/sitemap.xmlFor WordPress
WordPress automatically creates a virtual robots.txt if no physical file exists in the root directory. The default allows all crawlers.
Method 1: Plugin (recommended)
Using Yoast SEO:
Using Rank Math:
Method 2: Manual upload
robots.txt file locally.wp-config.php) via FTP or your hosting control panel.⚠️ Critical warning: In WordPress → Settings → Reading, there is a checkbox: "Discourage search engines from indexing this site". If this is checked, WordPress adds Disallow: / to your robots.txt — blocking all crawlers from your entire site. Check this setting first if your site isn't being indexed.
For Shopify
Shopify automatically generates a robots.txt file for your store. You can customise it using a robots.txt.liquid template:
robots.txt.liquid.{% assign shopify_robots = true %} {{ content_for_header }} User-agent: * {% for rule in robots.default_groups %} {% for rule_item in rule.rules %} {{ rule_item.directive }}: {{ rule_item.value }} {% endfor %} {% endfor %} # Custom additions Disallow: /collections/*?sort_by= Disallow: /search? Sitemap: {{ routes.root_url }}sitemap.xmlFor Wix / Squarespace / Webflow
Wix: Wix manages robots.txt automatically. Custom robots.txt editing requires a Business or higher plan. Go to Settings → SEO → Robots.txt to customise.
Squarespace: Squarespace generates its own robots.txt and does not allow full custom editing on standard plans. You can customise some crawling behaviour via Settings → Advanced → External API Keys for Google Search Console integration.
Webflow: Webflow generates robots.txt automatically. Custom robots.txt is only available on paid hosting plans. Edit via Project Settings → SEO → Robots.txt.
Robots.txt vs Noindex: When to Use Which
This is one of the most misunderstood concepts in SEO. Robots.txt and the noindex meta tag serve different purposes:
Disallow: Prevents crawling. The page won't be crawled, but if other sites link to it, Google may still index the URL (showing just the URL with no snippet or description).Critical gotcha: If you block a page in robots.txt AND add a noindex tag to it, Google can't see the noindex tag (because it can't crawl the page). The page might still get indexed if external links point to it. If you want to prevent indexing, don't block the page in robots.txt — let Google crawl it and read the noindex directive.
Use robots.txt when: You want to save crawl budget (large sites with many low-value pages), prevent crawling of resource-heavy pages, or block specific bots entirely.
Use noindex when: You want to guarantee a page doesn't appear in search results — thank you pages, internal search results, user profile pages, archived content.
Managing AI Crawlers in 2024
With the rise of AI-powered search and content training, managing AI crawlers in your robots.txt has become a significant concern for publishers. Here are the most common AI-related user agents:
# OpenAI's crawler (used for training data) User-agent: GPTBot Disallow: / # Common Crawl (used by many AI companies) User-agent: CCBot Disallow: / # Google's AI training crawler (separate from search) User-agent: Google-Extended Disallow: / # Anthropic's crawler User-agent: anthropic-ai Disallow: / # Bytedance / TikTok User-agent: Bytespider Disallow: /Important distinction: Blocking Google-Extended prevents Google from using your content for AI training (Gemini/Bard), but does NOT affect Google Search indexing. Your pages will still appear in regular Google search results.
Common Mistakes to Avoid
Disallow: / with no Allow rules: This blocks all crawlers from your entire site. One of the most catastrophic SEO mistakes possible. Always verify your file doesn't contain this unless intentional.Sitemap: line means Google must find your sitemap via Search Console alone — a wasted opportunity./wp-content/ or similar directories prevents Google from rendering your pages properly, which hurts mobile usability assessments and Core Web Vitals scoring.Disallow: /Admin/ is case-sensitive on some servers. /admin/ and /Admin/ are different paths on Linux servers.yourdomain.com/blog/robots.txt is completely ignored by crawlers. It must be at the domain root.How Robots.txt Connects to Other SEO Elements
Your robots.txt file doesn't exist in isolation — it interacts with several other SEO configurations:
Robots.txt vs Noindex: When to Use Each
This distinction trips up even experienced webmasters. Robots.txt and the noindex meta tag solve different problems, and using the wrong one can backfire spectacularly.
| Scenario | Use Robots.txt Disallow | Use Noindex Meta Tag |
|---|---|---|
| Save crawl budget on large sites | ✓ Best option | Not ideal (uses crawl budget) |
| Remove a page from search results | ❌ Won't guarantee removal | ✓ Best option |
| Block AI crawlers from training on content | ✓ Best option | Not applicable |
| Hide internal search results pages | Either works | ✓ More reliable |
| Block staging or development environments | ✓ Good choice | Also works (use both for safety) |
| Prevent indexing of a page with external backlinks | ❌ Page may still appear | ✓ Only reliable option |
The golden rule: If your goal is to save crawl budget and you don't care whether the URL appears in search results with no snippet, use Disallow. If your goal is to guarantee a page never appears in search results, use noindex and make sure the page is NOT blocked by robots.txt (Google needs to crawl it to read the noindex directive).
For a deeper understanding of indexing and how Google discovers your pages in the first place, see our guide on how to submit your website to Google.
How Major Websites Structure Their Robots.txt
Looking at how large, well-optimized sites configure their robots.txt can teach you a lot about best practices. Here's what some of the biggest sites do:
Google (google.com/robots.txt)
Google's own robots.txt is extensive — over 500 lines. Key patterns:
/search, /sdch, /groups)Amazon (amazon.com/robots.txt)
Amazon's robots.txt focuses heavily on preventing crawl waste from their massive product catalog:
/s?) — these are faceted navigation pages that create billions of URL combinationsLesson: For e-commerce sites, blocking faceted navigation and search results pages is critical for crawl budget. Let Google crawl your product pages, category pages, and content — block everything else.
Facebook (facebook.com/robots.txt)
Facebook takes an aggressive approach:
Lesson: If you have user-generated content or private areas, consider a whitelist approach — block everything by default and explicitly allow only public, SEO-relevant paths.
Robots.txt for Different CMS Platforms
Each CMS handles robots.txt differently. Here's what you need to know for each major platform:
WordPress
WordPress generates a virtual robots.txt by default. For most WordPress sites, this optimised configuration works well:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-login.php Disallow: /wp-includes/ Disallow: /trackback/ Disallow: /xmlrpc.php Disallow: /*?s= Disallow: /*?replytocom= Disallow: /tag/ Disallow: /author/ Allow: /wp-admin/admin-ajax.php Allow: /wp-content/uploads/ Sitemap: https://yourdomain.com/sitemap_index.xmlUse Yoast SEO or Rank Math to edit this directly from your WordPress dashboard without FTP access. Both plugins provide a built-in robots.txt editor and validation.
Shopify
Shopify's auto-generated robots.txt is actually quite well-optimised out of the box. It blocks cart, checkout, and account pages while allowing products and collections. If you need customisation, use the robots.txt.liquid template. Key additions for Shopify stores:
Disallow: /collections/*?sort_by=Disallow: /search?Wix
Wix manages robots.txt automatically and limits customisation on free plans. On Business plans and higher, you can edit via Settings → SEO → Robots.txt. Wix's default robots.txt is generally adequate for small sites, but if you have a large Wix site with many pages, consider upgrading to get full control over crawl directives.
Next.js
For Next.js sites (including this one), you have full control. Create a robots.txt file in your public/ directory, or use Next.js App Router's built-in robots.ts file for dynamic generation:
// app/robots.ts
import { MetadataRoute } from 'next';
export default function robots(): MetadataRoute.Robots {
return {
rules: {
userAgent: '*',
allow: '/',
disallow: ['/api/', '/admin/', '/_next/'],
},
sitemap: 'https://yourdomain.com/sitemap.xml',
};
}The TypeScript approach gives you the advantage of type safety and the ability to generate rules dynamically based on environment variables (e.g., blocking all crawlers on staging deployments).
Testing Your Robots.txt
Never deploy robots.txt changes without testing. One wrong character can block Google from your entire site. Here's how to test safely:
Google Search Console Robots.txt Tester
Google provides a dedicated Robots.txt Testing Tool within Search Console. It lets you:
Third-Party Testing Tools
Testing Workflow
Before deploying any robots.txt changes:
Remember that Google caches your robots.txt for up to 24 hours, so changes won't take effect immediately. If you see problems, don't panic — revert and wait for the cache to refresh. For more on how crawling and indexing interact, see our guides on creating an XML sitemap and improving your overall website SEO. Understanding how Core Web Vitals interact with crawl efficiency is also important — a slow site gets crawled less frequently.
FAQ
Q: What is a robots.txt file?
A robots.txt file is a plain text file at the root of a website that tells search engine crawlers which pages or sections they are allowed or not allowed to crawl and index. It follows the Robots Exclusion Protocol standard, which has been the web standard for crawler communication since 1994. Every major search engine — Google, Bing, Yahoo, Yandex — checks for this file before crawling a site.
Q: Does robots.txt prevent pages from appearing in Google?
Robots.txt prevents crawling, but not always indexing. If other sites link to a blocked page, Google can still index it (showing a URL with no description). To prevent indexing, use a noindex meta tag — and don't block those pages in robots.txt, because Googlebot won't be able to read the noindex tag on a blocked page. This is one of the most common robots.txt mistakes in SEO.
Q: Where should the robots.txt file be located?
The robots.txt file must be in the root directory of your domain — https://yourdomain.com/robots.txt. A robots.txt at https://yourdomain.com/blog/robots.txt is not valid and will be ignored by crawlers. Note that subdomains need their own robots.txt files — blog.yourdomain.com/robots.txt controls crawling for the blog subdomain only.
Q: Can I have different rules for different search engines?
Yes. Use separate User-agent: blocks for each bot. For example, User-agent: Googlebot applies rules only to Google's crawler. User-agent: * applies to all crawlers not covered by a specific rule. This is commonly used to block aggressive SEO tool crawlers (AhrefsBot, SemrushBot) while allowing search engines full access.
Q: What happens if I don't have a robots.txt file?
Crawlers will crawl your entire site by default. Most crawlers check for robots.txt first — if none exists, they proceed without restrictions. This isn't inherently bad for small sites, but it means you're not directing crawl budget and potentially wasting it on admin, login, and cart pages. For sites with more than a few hundred pages, having a robots.txt is strongly recommended for crawl efficiency.
Q: How long does it take for robots.txt changes to take effect?
Google caches robots.txt for up to 24 hours, so changes don't take effect immediately. In practice, most changes are picked up within a few hours. If you need Google to re-fetch your robots.txt sooner, you can use the URL Inspection tool in Google Search Console to request a re-crawl, though this doesn't guarantee faster processing of the robots.txt itself.
Summary
A well-structured robots.txt file takes less than ten minutes to create and helps Google spend its crawl budget on the pages that actually matter for your rankings. Block admin areas, duplicate parameter pages, and staging content — and always point to your sitemap. With 26% of sites having robots.txt errors, getting this right puts you ahead of a quarter of the web.
Generate a clean, validated robots.txt file now with the free Clarity SEO tool.