Deep-Dive: Search and Knowledge Sources

This module explains how content gets into Airgentic and how the AI finds relevant information when answering questions. Read this when you want to understand content indexing, search configuration, or troubleshoot why certain answers aren't working.

Who this is for

Platform administrators who manage content indexing
Content owners who need to understand what the AI knows
Anyone troubleshooting answer quality issues

What you'll learn

How content gets from your website into the AI's knowledge base
The difference between web crawl and uploaded documents
How search works when answering questions
How to configure search categories and synonyms
How to diagnose and fix content-related answer problems

The content pipeline

Content flows through several stages before the AI can use it:

Source → Crawl/Upload → Processing → Indexing → Search → Answer

Source — Your website or uploaded documents
Crawl/Upload — Content is fetched or uploaded
Processing — Text is extracted and structured
Indexing — Content is stored in a searchable format
Search — Relevant content is retrieved for a question
Answer — The AI generates a response using the search results

Understanding this pipeline helps you diagnose problems at the right stage.

Knowledge sources

Web crawl

The web crawler automatically fetches and processes pages from your website.

What you control:
- Seed URLs — Starting points for the crawl
- Crawl scope — Include/exclude patterns that determine which pages are crawled
- Politeness delay — How fast the crawler requests pages
- Field mappings — How metadata is extracted from pages

When content updates:
Content changes on your website aren't automatically reflected. You must run a data sync to re-crawl and re-index.

Related: Crawler Settings, Data Sync

Uploaded documents

For content not available via web crawl, upload files directly.

Supported formats:
- PDF documents
- Microsoft Word (.doc, .docx)
- HTML files

Public vs secure:
- Public documents — Anyone can view citations
- Secure documents — Only authenticated users can view citations

When content updates:
After uploading, click Index Documents to make new content searchable.

Related: Upload Documents

How search works

The search process

When a user asks a question:

Query formulation — The AI creates a search query based on the user's question
Search execution — The query is run against your indexed content
Result ranking — Results are scored by relevance
Result filtering — Search scope, categories, and boosts are applied
Content delivery — Top results are passed to the AI for answer generation

What affects search quality

Factor	Impact
Content coverage	If the information isn't indexed, it can't be found
Content quality	Well-structured pages with clear text rank better
Query matching	The AI's query must match how content is written
Category configuration	Boosts and mappings affect ranking
Synonym rules	Help match user terminology to content terminology

Search configuration

Synonyms

Synonym rules modify user queries before search:

Rule type	Effect
Add	Search for both terms (X and Y)
Replace	Search for Y instead of X
Delete	Remove X from the query

Examples:
- Add: "FAQ" → also search "frequently asked questions"
- Replace: "T&Cs" → "terms and conditions"
- Delete: "pdf" (removes noise word)

When to use synonyms:
- Users search for abbreviations not in your content
- Your content uses different terminology than users
- Common words are cluttering searches

Search scope by agent

Agents can be restricted to search only specific URL prefixes:

https://www.example.com/products/
https://www.example.com/support/

Use this to ensure specialist agents only see relevant content.

Related: Search Settings

Crawler configuration

Crawl scope

Control which pages the crawler fetches:

Setting	Purpose
Seed URLs	Starting points for the crawl
Maximum pages	Limit how many pages are crawled
Include patterns	URLs that should be crawled (regex)
Exclude patterns	URLs that should not be crawled (regex)
URL parameters	How to handle query parameters

Common configurations:
- Crawl only specific sections of a large site
- Exclude login pages, print versions, or redundant content
- Handle pagination correctly

Field mappings

Control how metadata is extracted from pages:

Field	What it's used for
Title	Document title in search results and citations
Description	Summary text
Image	Thumbnail in search results
Date	Publication date for filtering or display
Custom fields	Product model, category, etc. for filtering

Field mappings use XPath or selectors to locate content in the page HTML.

Related: Crawler Settings Overview

Diagnosing content problems

The answer is wrong or outdated

Check:
1. Is the content on your website correct and current?
2. When was the last data sync? (Check Data Sync status)
3. Is the page being crawled? (Check crawler logs)

Fix:
- Update the website content
- Run a data sync to re-index

The AI doesn't know about something

Check:
1. Is the content on a page within crawler scope?
2. Is the page excluded by include/exclude patterns?
3. For uploaded documents, was Index Documents clicked?

Fix:
- Adjust crawler scope to include the content
- Upload the document if it's not web-accessible
- Run indexing after uploads

The AI uses the wrong page

Check:
1. Use Admin Chat Trace Log to see which pages were retrieved
2. Check relevance scores — is the wrong page ranking higher?
3. Is there duplicate or conflicting content?

Fix:
- Add a curated answer to pin the correct source
- Adjust category boosts to favour the right content
- Clean up duplicate content on your website

Search isn't matching user terms

Check:
1. How does the user phrase the question vs how content is written?
2. Are there abbreviations or alternate terms users might use?

Fix:
- Add synonym rules
- Update content to use user-friendly terminology
- Add curated answers for specific phrasings

Best practices

Content quality

Clear titles — Page titles should accurately describe content
Good structure — Use headings, paragraphs, and lists
Direct answers — Put answers near the top of pages, not buried in long text
Unique pages — Avoid duplicate content across multiple URLs

Keeping content fresh

Scheduled syncs — Set up automatic sync schedules for regular updates
Post-publish sync — Run a sync after significant content changes
Monitor coverage — Check indexed page counts periodically

Monitoring search quality

Review Search Insights — See what users search for and what they click
Check failed searches — High-volume queries with low clicks indicate problems
Test in Admin Chat — Verify search results for important topics

Data Sync — Running and scheduling syncs
Crawler Settings Overview — Crawler configuration
Search Settings — Categories and synonyms
Upload Documents — Adding documents directly
Curated Answers — Overriding search results
Search Insights — Search analytics

Back to: Optional Deep-Dives

Help

Onboarding Deep Dive Search And Knowledge Sources