What is PageIndex?
Traditional vector-based RAG relies on semantic similarity. However, similarity โ relevance โ and in retrieval, relevance is what truly matters, which requires reasoning. While vector-based RAG efficiently identifies broad thematic content, it often fails to retrieve the exact information required, particularly in specialized domains where many sections share similar language but differ in critical details.
Inspired by AlphaGo, we developed PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search.
PageIndex consists of three main components:
- PageIndex OCR: converts PDFs to markdown with global structure preserved, ready for tree generation.
- PageIndex Tree Generation: generates hierarchical tree indexes for documents.
- PageIndex Retrieval: conducts retrieval via tree search.
We discuss these components in detail below.
๐ PageIndex OCR
Classic OCR systems analyze each page in isolation โ dividing it into blocks, processing each block independently, and ultimately returning a flat, fragmented output with structural errors and loss of document hierarchy. PageIndex OCR leverages the context window of large vision-language models and treats the entire document as a cohesive, structured whole. It can not only generate accurate page-level markdown content, but can also preserve the hierarchical organization of content โ titles, sections, subsections, bullet lists, tables, references โ across page boundaries.
-
Accurate Page-level Markdown Content: PageIndex OCR can transform each page into LLM-ready markdown text.
-
Preserving Multi-page Structure: PageIndex OCR preserves the hierarchical structure of the whole document. Significantly improves markdown rendering and document representation.
-
Fast Processing: PageIndex OCR handles long documents efficiently and scales to long context window without compromising speed.
๐ฒ PageIndex Tree Generation
PageIndex generates a hierarchical "table of contents" tree that maintains the original document's logical flow and organization structure. This LLM-optimized "table of contents" enables precise navigation and is ready for reasoning-based RAG.
-
No Vector DB Required: Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
-
No Chunking Required: Preserves natural document structure without artificial text splitting for better context retention.
-
Node Summary with Precise Page Referencing: Provides exact page references and summaries for precise information extraction.
-
Optimized for Long Documents: Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
Here is an example output. See more example documents and generated trees.
...
{
"title": "Financial Stability",
"node_id": "0006",
"page_index": 21,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"page_index": 22,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"page_index": 28,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
...
๐ PageIndex Retrieval
Once documents are transformed into hierarchical tree structures, the PageIndex retrieval module extracts relevant context from these trees. It leverages both LLM-based tree search and value-based tree search to perform efficient and accurate retrieval.
Specifically, given a query and a tree, the retrieval module performs a tree search and returns the most relevant nodes, with relevant paragraphs and corresponding tree search trajectories. The retrieval process has the following properties:
-
No Top-K Selection Required Tree search automatically identifies all relevant tree nodes, avoiding manual parameter tuning and arbitrary cutoffs in retrieval.
-
Transparent Node Trajectories Returns the complete search path through the tree structure, offering transparency and rich contextual information.
-
Exact Page References Every retrieved node includes precise page numbers and locations from the original document for verifiable information retrieval.
-
LLM-Ready Output Format Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.
Here is an example response from the PageIndex Retrieval API.
{
"title": "Monetary Policy and Economic Developments",
"node_id": "0004",
"nodes": [
{
"title": "March 2024 Summary",
"node_id": "0005",
"relevant_contents": [{
"page_index": 10,
"relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
}]
},
{
"title": "June 2023 Summary",
"node_id": "0006",
"relevant_contents": [{
"page_index": 15,
"relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
}]
}]
}