Vector search transforms text into high-dimensional numerical representations (embeddings) that capture semantic meaning. Documents with similar meanings have vectors that are close together in vector space.
Analogy: Imagine plotting words in 3D space where “dog” and “puppy” are close together, but “dog” and “car” are far apart. Vector search works in 384 dimensions instead of 3.
FAISS is a library for efficient similarity search in large-scale vector databases:
from langchain_community.vectorstores import FAISS# Create vector store from documentsvectorstore = FAISS.from_documents(docs, embeddings)# Convert to retriever for RAG pipelineretriever = vectorstore.as_retriever()
Why FAISS?
⚡ Speed: Searches millions of vectors in milliseconds
📊 Scalability: Handles datasets that don’t fit in RAM
🎯 Accuracy: Multiple index types optimized for precision/speed tradeoffs
💰 Cost: Open-source and runs locally (no API costs)
from langchain_community.document_loaders import PyPDFLoader# Load a student's CVloader = PyPDFLoader("CV_Estudiante_4_Fernanda_Paredes.pdf")docs = loader.load()print(docs[0].page_content[:200])
Output:
FERNANDA PAREDESData Analyst Trainee[email protected] | +51 912 345 678 | Lima, PerúPERFIL DE ESTUDIANTEEstudiante de 9no ciclo con interés en Desarrollo de Software y Datos.Manejo de herramientas como Python...
LangChain automatically splits documents into manageable chunks:
# Automatic chunking by PyPDFLoader# Each PDF page becomes a Document objectfor i, doc in enumerate(docs): print(f"Page {i}: {len(doc.page_content)} characters")
HuggingFace converts each chunk into a 384-dimensional vector:
# What happens under the hood:text = "Estudiante con experiencia en Python y FastAPI"vector = embeddings.embed_query(text)print(f"Vector dimensions: {len(vector)}")# Output: 384print(f"First 5 values: {vector[:5]}")# Output: [0.023, -0.145, 0.267, -0.089, 0.334]
Understanding the 384 Dimensions
Each dimension captures a different semantic feature:
Some dimensions respond to technical skills
Others encode experience level
Some capture domain (backend vs frontend)
Others represent soft skills
The model learned these features from training on millions of sentence pairs.
# Create searchable indexvectorstore = FAISS.from_documents( documents=docs, # List of Document objects embedding=embeddings # HuggingFaceEmbeddings instance)# FAISS builds an index structure for fast retrievalprint(f"Indexed {vectorstore.index.ntotal} vectors")
# Traditional approachkeywords = ["API", "development", "experience"]matches = []for cv in cvs: score = sum(1 for kw in keywords if kw in cv) matches.append((cv, score))
Results:
CV
Contains “API”?
Contains “development”?
Score
CV_1
❌
✅
1
CV_2
✅
❌
1
CV_3
❌
❌
0
Problem: CV_3 says “Built RESTful web services with FastAPI” but scores 0 because it doesn’t contain the exact word “API development”.
# Semantic approachquery = "Students with API development experience"retriever = vectorstore.as_retriever(search_kwargs={"k": 3})relevant_docs = retriever.invoke(query)
Results:
CV
Similarity Score
Matched Text
CV_3
0.89
”Built RESTful web services with FastAPI”
CV_2
0.85
”Created API endpoints for financial management”
CV_1
0.72
”Developed backend using Spring Boot”
Success: CV_3 ranks highest because the embedding understands that “RESTful web services” is semantically equivalent to “API development”.
FAISS offers multiple index types for different use cases:
# Default: Flat index (exact search, slower but accurate)vectorstore = FAISS.from_documents(docs, embeddings)# For larger datasets, you can use approximate search:import faiss# Create IVF index (faster, slight accuracy tradeoff)index = faiss.IndexIVFFlat( quantizer=faiss.IndexFlatL2(384), # 384 dimensions d=384, nlist=100 # Number of clusters)
Index Types Comparison
Index Type
Speed
Accuracy
Best For
Flat
Slow
100%
< 10k vectors
IVF
Fast
~95%
10k - 1M vectors
HNSW
Very Fast
~99%
> 1M vectors
The system uses Flat index since we’re dealing with small CV databases (< 1000 candidates).
import globfrom langchain_community.document_loaders import PyPDFLoaderfrom langchain_community.vectorstores import FAISS# 1. LOAD ALL CVsall_docs = []cv_files = glob.glob("cvs_estudiantes_final/*.pdf")print(f"Loading {len(cv_files)} CVs...")for cv_path in cv_files: loader = PyPDFLoader(cv_path) docs = loader.load() # Add source metadata for doc in docs: doc.metadata["source"] = cv_path.split("/")[-1] all_docs.extend(docs)print(f"Total documents: {len(all_docs)}")# 2. CREATE UNIFIED VECTOR STOREvectorstore = FAISS.from_documents(all_docs, embeddings)retriever = vectorstore.as_retriever(search_kwargs={"k": 5})# 3. SEMANTIC SEARCHquery = "Estudiantes con experiencia en desarrollo de APIs RESTful"results = retriever.invoke(query)# 4. DISPLAY RESULTSfor result in results: print(f"\nSource: {result.metadata['source']}") print(f"Content: {result.page_content[:150]}...")
Output:
Loading 5 CVs...Total documents: 5Source: CV_Estudiante_4_Fernanda_Paredes.pdfContent: • Creación de una API RESTful para gestión financiera usando Python y FastAPI...Source: CV_Estudiante_2_Ximena_Rios.pdfContent: • Automatización de reportes en Excel usando scripts de Python y Pandas...Source: CV_Estudiante_3_Nicolas_Paredes.pdfContent: • Implementación de base de datos relacional normalizada para e-commerce...
# Save to diskvectorstore.save_local("cv_index")# Load from disk (much faster than re-indexing)from langchain_community.vectorstores import FAISSloaded_vectorstore = FAISS.load_local( "cv_index", embeddings, allow_dangerous_deserialization=True # Required for pickle loading)retriever = loaded_vectorstore.as_retriever()
Benefits:
⚡ Skip re-embedding (saves minutes for large datasets)