System Design of Search Engine

September 30, 2024

Search looks simple from the outside: type a query, get ten blue links. Under the hood it is one of the most interesting large-scale system design problems: crawling, distributed storage, low-latency retrieval, machine-learned ranking, spam resistance, freshness, personalization, and evaluation all have to work together.

This note sketches how I would design a modern web search engine, using public examples from Google and Yandex as reference points.

The product contract

A search engine optimizes for a deceptively hard promise:

understand what the user means
find candidate documents from a huge corpus
rank them by usefulness, trust, and intent match
return results in hundreds of milliseconds
keep the index fresh as the web changes
learn from feedback without overfitting to noisy clicks

The core architecture usually separates into two worlds:

Offline systems that crawl, parse, index, train models, compute document signals, and build serving artifacts.
Online systems that understand a query, retrieve candidates, rank them, blend verticals, and render the result page.

High-level architecture

Web / feeds / APIs
      ↓
Crawler → fetcher → parser → canonicalizer
      ↓
Document store + link graph + feature store
      ↓
Inverted index + vector index + freshness index
      ↓
Query understanding
      ↓
Candidate retrieval
      ↓
Multi-stage ranking
      ↓
Blending, snippets, safety/spam filters
      ↓
Search results page
      ↓
Logs, judgments, experiments, model training

The important design choice is that ranking is not one model. It is a cascade.

Web Crawl Parse Index

Query Retrieve Rank Blend

Crawling and indexing

The crawler starts with known URLs, sitemaps, feeds, backlinks, and user-discovered URLs. A frontier service prioritizes what to fetch next based on:

page importance
expected freshness
host politeness limits
duplicate probability
historical content quality
crawl budget

After fetching, pages go through parsing and normalization:

HTML cleanup
language detection
boilerplate removal
canonical URL selection
duplicate and near-duplicate detection
text extraction
structured data extraction
link graph updates

The output is stored in several indexes:

an inverted index for lexical retrieval
a forward index for document features and snippets
a link graph for authority-style signals
a vector index for semantic retrieval
a freshness index for newsy or rapidly changing pages

Classic search still needs the inverted index. Dense embeddings help with semantic recall, but lexical matching remains hard to beat for exact names, rare terms, code, product IDs, and navigational queries.

Query understanding

Before retrieval, the system rewrites and interprets the query:

spelling correction
tokenization and normalization
language and region detection
entity recognition
intent classification
query expansion
synonym matching
freshness detection
vertical detection, such as web, images, maps, shopping, news

This is where modern language models matter. Google publicly describes systems such as RankBrain, neural matching, and BERT as helping Search understand concepts, word relationships, and intent. BERT is especially useful for compositional meaning: small words like “to” or “for” can change what the user is asking.

Example:

2019 brazil traveler to usa need a visa

A bag-of-words system can match pages about U.S. citizens traveling to Brazil. A better language-understanding model should understand the direction: a traveler from Brazil going to the U.S.

Candidate retrieval

At web scale, the ranking model cannot score every document. The system first retrieves a few thousand or tens of thousands of candidates from multiple retrievers:

BM25 / lexical retrieval
phrase and proximity matchers
anchor-text retrieval
entity and knowledge graph retrieval
semantic vector retrieval
freshness/news retrieval
personalized or local candidates

The retrieval stage optimizes recall. It is okay if many candidates are mediocre, as long as the good ones are not missed.

Ranking as a cascade

A practical ranking stack often has several stages:

100k+ raw matches

10k retrieved candidates

1k main ranker

100 neural reranker

10 final page

1. Lightweight ranker

Scores many candidates using cheap features:

term match
title match
URL/domain signals
document language
freshness
PageRank-like link authority
spam score
basic user/location match

2. Main ranking model

Scores hundreds or thousands of candidates with richer features:

query-document text features
click and satisfaction aggregates
authority and trust signals
freshness and topicality
document quality
result diversity
intent match
vertical-specific features

Yandex is a useful public example here. Yandex developed MatrixNet, a gradient-boosted decision-tree system used in ranking, and later open-sourced CatBoost, which Yandex describes as a successor to MatrixNet and useful for ranking tasks. Gradient boosting is popular in ranking because it handles heterogeneous tabular features very well.

3. Neural reranker

Scores the top candidates with more expensive models:

cross-encoder BERT-style query-document models
passage-level relevance models
answer extraction models
freshness-aware or vertical-specific neural models

The neural reranker has high precision but is expensive, so it runs late in the cascade.

4. Blending and final page construction

The final page is not just a ranked list. It may blend:

organic web results
maps
images
videos
news
shopping
knowledge panels
direct answers
related questions

The blender decides which result types to show and where. This is a ranking problem too: for a restaurant query, maps may deserve the top slot; for a breaking event, news may dominate; for a programming query, documentation or Stack Overflow-like pages may be better.

Weborganic pages

Mapslocal intent

Newsfreshness intent

Answersdirect response

Feature examples

Useful ranking features might include:

BM25 score
title exact-match score
query term proximity
anchor text score
language match
country/region match
freshness score
document quality score
spam probability
site authority
PageRank-like graph score
click-through rate, debiased
long-click or satisfaction proxy
query intent class
entity overlap
embedding similarity
BERT cross-encoder score

No single feature is enough. A good search engine survives because it combines weak signals into a robust system.

Learning to rank

Training data comes from several sources:

human relevance judgments
click logs
dwell time and reformulation signals
interleaving experiments
A/B tests
synthetic labels from stronger models

Click data is useful but biased. Top-ranked results get more clicks because they are top-ranked. A search engine needs counterfactual logging, randomized exploration, interleaving, or debiasing methods to avoid learning its own previous mistakes.

Common ranking objectives:

pairwise ranking losses
listwise ranking losses
NDCG optimization
click/satisfaction prediction
multi-objective utility functions

The hard part is not just training the model. It is choosing the metric. A model that increases clicks can still make search worse if it rewards sensational pages, clickbait, or quick pogo-sticking.

Google-style design notes

Public Google Search documentation describes multiple ranking systems working together rather than one monolithic algorithm:

RankBrain helps relate words to concepts.
Neural matching helps match broader representations of queries and pages.
BERT helps understand how combinations of words express meaning and intent.

System-design lesson: language understanding is part of retrieval and ranking, not a separate “NLP feature.” It changes which candidates are found, how they are scored, and which result types are blended into the page.

Yandex-style design notes

Yandex is a strong example of industrial learning-to-rank:

MatrixNet was used to construct ranking formulas.
CatBoost evolved from this family of gradient-boosted decision-tree systems and is public.
Ranking models in this style are powerful because search has many non-neural signals: text statistics, graph signals, user behavior aggregates, freshness, geography, and quality features.

System-design lesson: even in the neural era, boosted trees are still extremely practical for ranking pipelines because they are fast, interpretable enough, and strong on tabular features.

Latency budget

A search request might have a budget like:

query parsing              5-15 ms
candidate retrieval       20-60 ms
feature fetching          10-40 ms
first-stage ranking       10-30 ms
neural reranking          20-80 ms
blending/snippets         10-40 ms
network/rendering         remaining budget

query parsing5-15 ms

candidate retrieval20-60 ms

feature fetching10-40 ms

ranking10-80 ms

blending/snippets10-40 ms

This forces tradeoffs:

precompute document features offline
cache popular queries
shard indexes carefully
use early exits
run expensive models only on top candidates
degrade gracefully when a service is slow

Reliability and freshness

Search has several freshness layers:

real-time or near-real-time indexing for news and urgent pages
frequent recrawling for important pages
slower background crawling for stable pages
deletion handling and canonical updates

Serving must tolerate partial failure. If the vector retrieval service is down, lexical search should still work. If the neural reranker times out, first-stage ranking should still return reasonable results.

Evaluation

Offline metrics:

NDCG
MRR
precision at k
recall at k
calibration
latency
freshness lag
spam rate

Online metrics:

long clicks
reformulation rate
abandonment rate
successful session rate
manual quality review
guardrail metrics for spam, safety, and diversity

The most important evaluation question is qualitative: did the user finish the task?

A minimal version

If I had to build a small search engine first:

Crawl a controlled corpus.
Build an inverted index with BM25.
Add document parsing, canonicalization, and snippets.
Add a feature store and a small learning-to-rank model.
Add vector retrieval for semantic recall.
Add a neural reranker for the top 100 documents.
Add click/judgment logging and an evaluation loop.
Add freshness, spam detection, and vertical blending.

The beautiful thing about search is that every layer can be improved independently, but every layer also changes the others. Retrieval affects ranking. Ranking affects clicks. Clicks affect training. Training affects what the system believes users want. Search is not just an information retrieval problem; it is a living feedback system.