Search Engine Working: Crawler, Sitemap & Robots.txt

Understanding search engine working is essential if you want to rank your website on search engines like Google, Microsoft Bing and Yahoo.

In this guide, you will learn search engine working, including crawling, indexing, ranking, sitemap, and robots.txt.

Search engine working refers to the complete process used by search engines to discover, store, and display web pages in search results.

There are 3 main steps in search engine working:

1. Crawling

Crawling is the first step in search engine working, where bots scan websites.

2. Indexing

Indexing is the second step in search engine working, where data is stored.

3. Ranking

Ranking is the final step in search engine working, where results are shown.

Search Engine Working Flow : Crawling → Indexing → Ranking

Step	Role in Search Engine Working
Crawling	Finds pages
Indexing	Stores pages
Ranking	Shows best results

Simple Example

A crawler (web crawler or spider) is a software program used by search engines like Google and Microsoft Bing to automatically browse websites and collect information.

How a Crawler Works

Importance of Crawlers

Crawling vs Indexing vs Ranking

Factor	Crawling	Indexing	Ranking
Purpose	Discover pages	Store data	Show results
Step	First	Second	Final
Visibility	❌	❌	✅

A sitemap is a file that lists all important pages of your website, helping search engines like Google easily find and crawl them.

How Sitemap Works

Example Sitemap

<urlset>
  <url>
    <loc>https://example.com/</loc>
  </url>
  <url>
    <loc>https://example.com/blog</loc>
  </url>
</urlset>

Benefits of Sitemap

The robots.txt file controls search engine working by telling crawlers what to access.

How robots.txt Works

Example robots.txt

User-agent: *
Disallow: /admin/
Allow: /blog/

Basic robots.txt (Recommended)

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# Allow important assets
Allow: /wp-content/uploads/
Allow: /wp-content/themes/
Allow: /wp-content/plugins/

# Block unnecessary files
Disallow: /readme.html
Disallow: /license.txt

# Sitemap
Sitemap: https://yourdomain.com/sitemap_index.xml

How This Works

Advanced Version (More Optimized)

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# Block search & query URLs
Disallow: /?s=
Disallow: /search/

# Block author pages (optional)
Disallow: /author/

# Allow core assets
Allow: /wp-content/uploads/
Allow: /wp-content/themes/
Allow: /wp-content/plugins/

# Sitemap
Sitemap: https://yourdomain.com/sitemap_index.xml

Sitemap vs robots.txt

Feature	Sitemap	robots.txt
Purpose	What to crawl	What NOT to crawl
Type	XML file	Text file
Role	Discovery	Control

Leave a Reply Cancel reply

Tutorials

Examples

Projects

Information

Follow Us

Search Engine Working: Crawler, Sitemap & robots.txt