@_avichawla: A simple technique makes RAG ~...
@_avichawla
9 views
Aug 04, 2025
1
A simple technique makes RAG ~32x memory efficient!
- Perplexity uses it in its search index
- Azure uses it in its search pipeline
- HubSpot uses it in its AI assistant
Let's understand how to use it in RAG systems (with code):
- Perplexity uses it in its search index
- Azure uses it in its search pipeline
- HubSpot uses it in its AI assistant
Let's understand how to use it in RAG systems (with code):
2
Today, let's build a RAG system that queries 36M+ vectors in <30ms using Binary Quantization.
Tech stack:
- @llama_index for orchestration
- @milvusio as the vector DB
- @beam_cloud for serverless deployment
- @Kimi_Moonshot Kimi-K2 as the LLM hosted on Groq
Let's build it!
Tech stack:
- @llama_index for orchestration
- @milvusio as the vector DB
- @beam_cloud for serverless deployment
- @Kimi_Moonshot Kimi-K2 as the LLM hosted on Groq
Let's build it!
3
Here's the workflow:
- Ingest documents and generate binary embeddings.
- Create a binary vector index and store embeddings in the vector DB.
- Retrieve top-k similar documents to the user's query.
- LLM generates a response based on additional context.
Let's implement this!
- Ingest documents and generate binary embeddings.
- Create a binary vector index and store embeddings in the vector DB.
- Retrieve top-k similar documents to the user's query.
- LLM generates a response based on additional context.
Let's implement this!
11
7οΈβ£ Run the app
Beam launches the container and deploys our streamlit app as an HTTPS server that can be easily accessed from a web browser.
Check this demo π
Beam launches the container and deploys our streamlit app as an HTTPS server that can be easily accessed from a web browser.
Check this demo π
12
Moving on, to truly assess the scale and inference speed, we test the deployed setup over the PubMed dataset (36M+ vectors).
Our app:
- queried 36M+ vectors in <30ms.
- generated a response in <1s.
Check this demoπ
Our app:
- queried 36M+ vectors in <30ms.
- generated a response in <1s.
Check this demoπ
13
Done!
We just built the fastest RAG stack leveraging BQ for efficient retrieval and
using ultra-fast serverless deployment of our AI workflow.
Here's the workflow again for your reference π
We just built the fastest RAG stack leveraging BQ for efficient retrieval and
using ultra-fast serverless deployment of our AI workflow.
Here's the workflow again for your reference π
14
That's a wrap!
If you found it insightful, reshare it with your network.
Find me β @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
If you found it insightful, reshare it with your network.
Find me β @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
View Tweet






