Sitemap, Robots.txt and Crawling: Technical SEO Fundamentals

How Google Finds Your Website

Google's crawler (also called Googlebot) systematically searches the internet for new and updated pages. This process is called crawling. The crawler follows links from already known pages to discover new ones. Once it has found and read a page, it's added to Google's index — and can appear in search results.

For a new website, this means: Without active measures, it can take weeks for Google to find you at all. With the right technical foundations, you can accelerate this process to just a few days.

Key Takeaway

Crawling is the first step to visibility. If Google can't find and read your pages, you don't exist for the search engine — no matter how good your content is.

XML Sitemap: Your Table of Contents for Google

An XML sitemap is a file that lists all important URLs of your website in a machine-readable format. It's like a table of contents you hand directly to Google — so the crawler doesn't miss any page.

What a Sitemap Looks Like

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://proofofreach.de/</loc>
    <lastmod>2026-03-25</lastmod>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://proofofreach.de/blog/was-ist-aio.html</loc>
    <lastmod>2026-03-25</lastmod>
    <priority>0.8</priority>
  </url>
</urlset>

How to Create a Sitemap

For a static website like ours, you create the sitemap manually: An XML file with all URLs, saved as sitemap.xml in the root directory of your website. Every time you publish a new article, you add the URL. For larger websites, there are tools and CMS plugins that do this automatically.

Submitting the Sitemap to Google

Go to Google Search Console, select your website, and navigate to "Sitemaps." Enter the URL of your sitemap (e.g., https://proofofreach.de/sitemap.xml) and click "Submit." Google then begins to systematically crawl and index your pages.

Robots.txt: Crawling Instructions

The robots.txt is a text file in your website's root directory that tells search engine crawlers which areas they may visit and which not. It controls crawling — not indexing.

A Simple Robots.txt

User-agent: *
Allow: /

Sitemap: https://proofofreach.de/sitemap.xml

This robots.txt says: All crawlers may visit all areas, and here's the link to the sitemap. For most websites, this is sufficient.

Blocking Areas

You can exclude certain directories or files from crawling. For example, internal drafts, duplicate content, or technical files. But be careful not to accidentally block important pages — and avoid duplicate content — that's one of the most common technical SEO mistakes.

Important: Robots.txt Doesn't Prevent Indexing

A common misconception: Robots.txt only blocks crawling, not indexing. If a blocked page is linked from another page, Google can still include it in the index — just without knowing the content. If you truly want to remove a page from the index, use a noindex meta tag.

Crawl Budget and Crawl Frequency

Google has a so-called crawl budget for every website — the number of pages the crawler visits in a given time period. For small websites with under 1,000 pages, crawl budget is rarely an issue. For large websites with tens of thousands of pages, it becomes a strategic factor.

The crawl frequency — how often Google visits your pages — depends on several factors: How often do you update your content? How much authority does your domain have? How good is your technical infrastructure? Websites that regularly publish new or updated content are crawled more frequently.

Practical Checklist for New Websites

1. Create robots.txt. Create a robots.txt in the root directory with Allow: / and the reference to your sitemap.

2. Create XML sitemap. List all published pages, save as sitemap.xml in the root directory.

3. Set up Google Search Console. Verify your website, submit the sitemap, and check for crawling errors.

4. Check internal linking. Every important page should be linked from at least one other page. Orphaned pages (without internal links) are harder for Google to find.

5. Monitor indexing. Check in Search Console under "Pages" which of your pages are indexed and which aren't. Respond to problems immediately.

Sources

Google Search Central: Official documentation on search engine optimization best practices. developers.google.com

FAQ

What is an XML sitemap?

An XML sitemap is a file that lists all important URLs of your website. It helps search engines find and index your pages faster. You submit it through Google Search Console.

What is robots.txt?

Robots.txt is a text file in your website's root directory that tells search engine crawlers which areas they may visit and which not. It controls crawling, not indexing.

How often does Google crawl my website?

The crawl frequency depends on the size, authority, and update frequency of your website. Regular content updates and a clean sitemap increase the crawl frequency.

Last updated: March 25, 2026