What is PageIndex?
Traditional vector-based RAG relies on semantic similarity. However, similarity ≠ relevance — and in retrieval, relevance is what truly matters. While vector-based RAG efficiently identifies broad thematic content, it often fails to retrieve the exact information required, particularly in specialized domains where many sections share similar language but differ in critical details.
Inspired by AlphaGo, we developed PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search. PageIndex has the following two components:
- PageIndex Tree Generation: generates tree indexes for documents.
- PageIndex Retrieval: conducts tree search for retrieval.
We discuss these two components in detail below.
🌲 PageIndex Tree Generation
PageIndex generates a hierarchical "table of contents" tree that maintains the original document's logical flow and organization structure. This LLM-optimized "table of contents" enables precise navigation and is ready for reasoning-based RAG.
-
No Vector DB Required: Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
-
No Chunking Required: Preserves natural document structure without artificial text splitting for better context retention.
-
Node Summary with Precise Page Referencing: Provides exact page references and summaries for precise information extraction.
-
Optimized for Long Documents: Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
Here is an example output. See more example documents and generated trees.
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
...
🔎 PageIndex Retrieval
Once documents are transformed into hierarchical tree structures, the PageIndex retrieval module extracts relevant context from these trees. It leverages both LLM-based tree search and value-based tree search to perform efficient and accurate retrieval.
Specifically, given a query and a tree, the retrieval module performs a tree search and returns the most relevant nodes, with relevant paragraphs and corresponding tree search trajectories. The retrieval process has the following properties:
-
No Top-K Selection Required Tree search automatically identifies all relevant tree nodes, avoiding manual parameter tuning and arbitrary cutoffs in retrieval.
-
Transparent Node Trajectories Returns the complete search path through the tree structure, offering transparency and rich contextual information.
-
Exact Page References Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.
-
LLM-Ready Output Format Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.
Here is an example response from the PageIndex Retrieval API.
{
"title": "Monetary Policy and Economic Developments",
"node_id": "0004",
"nodes": [
{
"title": "March 2024 Summary",
"node_id": "0005",
"relevant_contents": [{
"physical_index": 10,
"relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
}]
},
{
"title": "June 2023 Summary",
"node_id": "0006",
"relevant_contents": [{
"physical_index": 15,
"relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
}]
}]
}