Help

Onboarding Deep Dive Search And Knowledge Sources

Airgentic Help

Deep-Dive: Search and Knowledge Sources

This module explains how content gets into Airgentic and how the AI finds relevant information when answering questions. Read this when you want to understand content indexing, search configuration, or troubleshoot why certain answers aren't working.


Who this is for

  • Platform administrators who manage content indexing
  • Content owners who need to understand what the AI knows
  • Anyone troubleshooting answer quality issues

What you'll learn

  • How content gets from your website into the AI's knowledge base
  • The difference between web crawl and uploaded documents
  • How search works when answering questions
  • How to configure search categories and synonyms
  • How to diagnose and fix content-related answer problems

The content pipeline

Content flows through several stages before the AI can use it:

Source → Crawl/Upload → Processing → Indexing → Search → Answer
  1. Source — Your website or uploaded documents
  2. Crawl/Upload — Content is fetched or uploaded
  3. Processing — Text is extracted and structured
  4. Indexing — Content is stored in a searchable format
  5. Search — Relevant content is retrieved for a question
  6. Answer — The AI generates a response using the search results

Understanding this pipeline helps you diagnose problems at the right stage.


Knowledge sources

Web crawl

The web crawler automatically fetches and processes pages from your website.

What you control:
- Seed URLs — Starting points for the crawl
- Crawl scope — Include/exclude patterns that determine which pages are crawled
- Politeness delay — How fast the crawler requests pages
- Field mappings — How metadata is extracted from pages

When content updates:
Content changes on your website aren't automatically reflected. You must run a data sync to re-crawl and re-index.

Related: Crawler Settings, Data Sync

Uploaded documents

For content not available via web crawl, upload files directly.

Supported formats:
- PDF documents
- Microsoft Word (.doc, .docx)
- HTML files

Public vs secure:
- Public documents — Anyone can view citations
- Secure documents — Only authenticated users can view citations

When content updates:
After uploading, click Index Documents to make new content searchable.

Related: Upload Documents


How search works

The search process

When a user asks a question:

  1. Query formulation — The AI creates a search query based on the user's question
  2. Search execution — The query is run against your indexed content
  3. Result ranking — Results are scored by relevance
  4. Result filtering — Search scope, categories, and boosts are applied
  5. Content delivery — Top results are passed to the AI for answer generation

What affects search quality

Factor Impact
Content coverage If the information isn't indexed, it can't be found
Content quality Well-structured pages with clear text rank better
Query matching The AI's query must match how content is written
Category configuration Boosts and mappings affect ranking
Synonym rules Help match user terminology to content terminology

Search configuration

Categories

Categories group indexed pages into named buckets (e.g., Products, Support, News).

AI Auto-Categorization: When enabled, Airgentic analyses pages and assigns categories automatically during indexing.

Manual Mappings: Human-defined rules that match pages based on URL patterns or metadata.

Category boosts: Adjust ranking for entire categories. A boost of +3 promotes pages in that category higher in results.

When to use categories:
- When you want to label result types in the widget
- When you want to boost certain content types
- When you want agents to search specific categories

Synonyms

Synonym rules modify user queries before search:

Rule type Effect
Add Search for both terms (X and Y)
Replace Search for Y instead of X
Delete Remove X from the query

Examples:
- Add: "FAQ" → also search "frequently asked questions"
- Replace: "T&Cs" → "terms and conditions"
- Delete: "pdf" (removes noise word)

When to use synonyms:
- Users search for abbreviations not in your content
- Your content uses different terminology than users
- Common words are cluttering searches

Search scope by agent

Agents can be restricted to search only specific URL prefixes:

https://www.example.com/products/
https://www.example.com/support/

Use this to ensure specialist agents only see relevant content.

Related: Search Settings


Crawler configuration

Crawl scope

Control which pages the crawler fetches:

Setting Purpose
Seed URLs Starting points for the crawl
Maximum pages Limit how many pages are crawled
Include patterns URLs that should be crawled (regex)
Exclude patterns URLs that should not be crawled (regex)
URL parameters How to handle query parameters

Common configurations:
- Crawl only specific sections of a large site
- Exclude login pages, print versions, or redundant content
- Handle pagination correctly

Field mappings

Control how metadata is extracted from pages:

Field What it's used for
Title Document title in search results and citations
Description Summary text
Image Thumbnail in search results
Date Publication date for filtering or display
Custom fields Product model, category, etc. for filtering

Field mappings use XPath or selectors to locate content in the page HTML.

Related: Crawler Settings Overview


Diagnosing content problems

The answer is wrong or outdated

Check:
1. Is the content on your website correct and current?
2. When was the last data sync? (Check Data Sync status)
3. Is the page being crawled? (Check crawler logs)

Fix:
- Update the website content
- Run a data sync to re-index

The AI doesn't know about something

Check:
1. Is the content on a page within crawler scope?
2. Is the page excluded by include/exclude patterns?
3. For uploaded documents, was Index Documents clicked?

Fix:
- Adjust crawler scope to include the content
- Upload the document if it's not web-accessible
- Run indexing after uploads

The AI uses the wrong page

Check:
1. Use Admin Chat Trace Log to see which pages were retrieved
2. Check relevance scores — is the wrong page ranking higher?
3. Is there duplicate or conflicting content?

Fix:
- Add a curated answer to pin the correct source
- Adjust category boosts to favour the right content
- Clean up duplicate content on your website

Search isn't matching user terms

Check:
1. How does the user phrase the question vs how content is written?
2. Are there abbreviations or alternate terms users might use?

Fix:
- Add synonym rules
- Update content to use user-friendly terminology
- Add curated answers for specific phrasings


Best practices

Content quality

  • Clear titles — Page titles should accurately describe content
  • Good structure — Use headings, paragraphs, and lists
  • Direct answers — Put answers near the top of pages, not buried in long text
  • Unique pages — Avoid duplicate content across multiple URLs

Keeping content fresh

  • Scheduled syncs — Set up automatic sync schedules for regular updates
  • Post-publish sync — Run a sync after significant content changes
  • Monitor coverage — Check indexed page counts periodically

Monitoring search quality

  • Review Search Insights — See what users search for and what they click
  • Check failed searches — High-volume queries with low clicks indicate problems
  • Test in Admin Chat — Verify search results for important topics


Back to: Optional Deep-Dives

You have unsaved changes