Overview
Finance Agent requires two main types of financial data:
Earnings Transcripts : Quarterly earnings call transcripts stored with full text and vector embeddings
10-K Filings : Annual SEC filings with sophisticated section extraction and table identification
Both are stored in PostgreSQL with the pgvector extension for semantic search.
Prerequisites
Database Setup
Ensure PostgreSQL with pgvector extension is configured: # Set in your .env file
PG_VECTOR = postgresql://username:password@host:5432/database
API Keys
Configure required API keys in .env: OPENAI_API_KEY = sk-your-key-here # For embeddings
API_NINJAS_KEY = your-key-here # For downloading transcripts
S3 Bucket (Optional)
For full transcript storage (enables transcript viewer): RAILWAY_BUCKET_ENDPOINT = your-s3-endpoint
RAILWAY_BUCKET_ACCESS_KEY_ID = your-access-key
RAILWAY_BUCKET_SECRET_KEY = your-secret-key
RAILWAY_BUCKET_NAME = your-bucket-name
Ingesting Earnings Transcripts
Step 1: Download Transcripts
Download earnings call transcripts using API Ninjas:
cd agent/rag/data_ingestion
python download_transcripts.py --ticker AAPL
Single Ticker
Multiple Tickers
With Quarter Range
python download_transcripts.py --ticker AAPL
Transcripts are saved as JSON files in agent/rag/data_ingestion/earnings_transcripts/.
Step 2: Process and Store in Database
Create embeddings and store chunks in PostgreSQL:
python ingest_with_structure.py --ticker AAPL
Single Company
Multiple Companies
Batch Processing
python ingest_with_structure.py --ticker AAPL
What happens during ingestion:
Text Chunking
Transcripts are split into chunks (1000 chars, 200 char overlap) for semantic search
Embedding Generation
Each chunk is embedded using all-MiniLM-L6-v2 (384-dimensional vectors)
Database Storage
Chunks and embeddings are stored in the transcript_chunks table with metadata (ticker, year, quarter)
Hybrid Search Index
Creates both vector (IVFFlat) and keyword (GIN) indexes for hybrid search
Step 3: Upload Full Transcripts to S3 (Optional)
For the transcript viewer feature, upload full transcripts to S3 with character offsets:
python ingest_transcripts_to_bucket.py --tickers AAPL MSFT GOOGL
Specific Tickers
By Market Cap
Custom Directory
python ingest_transcripts_to_bucket.py --tickers AAPL MSFT
This script:
Uploads full transcript text to S3 (transcripts/{ticker}/{year}_Q{quarter}.txt)
Updates complete_transcripts table with bucket_key
Computes char_offset and chunk_length for each chunk to enable precise navigation
Without S3 upload, the transcript viewer won’t be able to display full transcripts with highlighted chunks. The RAG system will still work using database chunks.
Ingesting 10-K Filings
Download and Process 10-K Filings
The 10-K ingestion uses sophisticated processing with section extraction and table identification:
cd agent/rag/data_ingestion
python ingest_10k_to_database.py --ticker AAPL
Single Company
Multiple Companies
FinanceBench Companies
With Lookback Period
Parallel Processing
python ingest_10k_to_database.py --ticker AAPL
Advanced Options
# Process all US companies (9500+)
python ingest_10k_to_database.py --all-tickers --workers 8
# Resume from specific position
python ingest_10k_to_database.py --all-tickers --skip-first 100
# Limit number of tickers
python ingest_10k_to_database.py --all-tickers --max-tickers 50
# Set timeout per ticker (in minutes)
python ingest_10k_to_database.py --ticker AAPL --timeout 45
# Retry failed tickers
python ingest_10k_to_database.py --all-tickers --retry-failed
10-K Processing Pipeline
The ingestion performs sophisticated document processing:
Download from SEC EDGAR
Uses datamule library to fetch 10-K filings for the specified fiscal years
Hierarchical Section Extraction
Identifies SEC sections (Item 1, Item 1A, Item 7, etc.) and preserves document structure
Table Extraction
Extracts tables and identifies financial statements (balance sheet, income statement, cash flow)
Contextual Chunking
Chunks preserve section context in metadata for better retrieval
Embedding Generation
Creates 384-dimensional embeddings using all-MiniLM-L6-v2
Database Storage
Stores in three tables:
ten_k_chunks: Text chunks with embeddings and section metadata
ten_k_tables: Extracted tables with financial statement flags
complete_sec_filings: Full document text with section offsets
Database Schema
transcript_chunks
CREATE TABLE transcript_chunks (
id SERIAL PRIMARY KEY ,
ticker VARCHAR ( 10 ),
year INTEGER ,
quarter INTEGER ,
chunk_index INTEGER ,
chunk_text TEXT ,
embedding vector ( 384 ),
char_offset INTEGER , -- For S3 navigation
chunk_length INTEGER , -- For S3 navigation
metadata JSONB
);
ten_k_chunks
CREATE TABLE ten_k_chunks (
id SERIAL PRIMARY KEY ,
ticker VARCHAR ( 10 ),
fiscal_year INTEGER ,
filing_type VARCHAR ( 10 ),
chunk_index INTEGER ,
chunk_text TEXT ,
embedding vector ( 384 ),
sec_section VARCHAR ( 50 ), -- e.g., "item_1", "item_1a"
sec_section_title VARCHAR ( 200 ), -- e.g., "Business", "Risk Factors"
chunk_type VARCHAR ( 50 ), -- e.g., "text", "table"
path_string TEXT , -- Hierarchical path
metadata JSONB
);
ten_k_tables
CREATE TABLE ten_k_tables (
id SERIAL PRIMARY KEY ,
table_id VARCHAR ( 100 ) UNIQUE ,
ticker VARCHAR ( 10 ),
fiscal_year INTEGER ,
content TEXT ,
table_data JSONB,
is_financial_statement BOOLEAN ,
statement_type VARCHAR ( 50 ), -- e.g., "balance_sheet", "income_statement"
priority VARCHAR ( 20 ) -- e.g., "HIGH", "NORMAL"
);
Verification
After ingestion, verify the data:
Check Transcript Data
-- Count chunks per ticker
SELECT ticker, year , quarter , COUNT ( * ) as chunk_count
FROM transcript_chunks
GROUP BY ticker, year , quarter
ORDER BY year DESC , quarter DESC ;
-- Verify embeddings
SELECT COUNT ( * ) as total_chunks,
COUNT (embedding) as chunks_with_embeddings
FROM transcript_chunks;
Check 10-K Data
-- Count 10-K chunks per company
SELECT ticker, fiscal_year, COUNT ( * ) as chunk_count
FROM ten_k_chunks
GROUP BY ticker, fiscal_year
ORDER BY ticker, fiscal_year DESC ;
-- Check financial statements
SELECT ticker, fiscal_year, statement_type, COUNT ( * )
FROM ten_k_tables
WHERE is_financial_statement = true
GROUP BY ticker, fiscal_year, statement_type
ORDER BY ticker, fiscal_year DESC ;
Test Search
# Test the RAG agent
python -c "
import asyncio
from agent.rag.rag_agent import RAGAgent
agent = RAGAgent()
result = asyncio.run(agent.execute_rag_flow(
'What was AAPL revenue in Q1 2025?',
conversation_id='test'
))
print(result['answer'])
"
Troubleshooting
Common Issues:
Missing pgvector extension : Run CREATE EXTENSION vector; in PostgreSQL
API key errors : Verify OPENAI_API_KEY and API_NINJAS_KEY in .env
Memory issues during 10-K ingestion : Reduce --workers or increase timeout
Duplicate data : The scripts automatically skip already-processed filings
Parallel workers : Use --workers 4-8 for faster processing
Batch size : Process 50-100 tickers at a time for optimal memory usage
Resume capability : Use --skip-first N to resume interrupted batch jobs
Timeout : Set --timeout 45 for companies with large 10-K filings
Next Steps
After ingestion:
Configure RAG settings for optimal performance
Learn how to ask effective questions
Set up authentication for production deployment