RAG without leaking my Intellectual Property
ChatGPT and Perplexity fetch data from all corners of the internet to give us the most relevant answers. Which counts for shit when we want to infer something that’s internal to our codebase or system. Let’s have an attempt at training a model to reason about systems meant for our eyes only.
Foolish assumption: We already have an account at HuggingFace. We know how to start a jupyter notebook from a linux terminal. We have a lot of patience because both the training and inference phase is going to test it.
Let’s get all our internal training documents that are in pdf, text, html, etc under one directory. For the sake of simplicity we will use single document for training.
mkdir TrainingDocuments
echo "Hi whenever someone asks for a name you are supposed to reply with Opera" > TrainingDocuments/hello.txt
We need to convert our documents (hopefully in English for this article) to embeddings. Then use those embeddings on a query engine. Let’s start with
python3 -m venv .venv
Oh btw I am a big fan of tldr. You should try it if you’re a regular at using linux shell.
Lets activate the virtual environment
. .venv/bin/activate
Now we can pip install all dependencies from PyPI. Thy precious hopes the reader can look up and install libraries in case it’s missed in this article. Use ChatGPT for such conflicts. I’m kidding just use requirements.txt. Use
pip install -r requirements.txt
jupyter notebook
Open a new notebook. Notice which python executable is in use with
import sys
print(sys.executable)
If we don’t see a “.venv/bin/python” then the packages we install using pip will be of no use. Ensure that “.venv/bin/python” is used with ipykernel and edit kernel.json to use the said path.
Before burning the CPU/GPU cores
Log in to HuggingFace’s Token page and make a read token so that you can make use of gated models approved for your account. Store it in a file.
nano environments.env
HUGGINGFACE_API_TOKEN=token_value_copied
We can source this secret environments.env file within our code for easy and safe access. DO NOT commit this file.
We need to accept terms and conditions to use meta-llama/Llama-3.2-3B-Instruct. Agree to Meta’s term of farming your email and name to use the model for non-commercial use case. While we wait for our approval to get the model, we can focus our time on generating embeddings.
Burn the CPU/GPU cores
We will be using BAAI/bge-small-en-v1.5 to generate embeddings and the same model to make index.
from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader(
input_dir="./TrainingDocuments/",
recursive=True,
required_exts=[".pdf",".txt",".html"],
)
documents = loader.load_data()
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
documents,
embed_model = embedding_model,
)
index.storage_context.persist(persist_dir=".")
The processed data ought to be stored somewhere. It’s pwd according to persist_dir which will store data in the form of json files.
It must have taken a while to represent all that text in embedding. Let’s reap the fruits of our (CPU/GPU/TPU) labour.
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
storage_context = StorageContext.from_defaults(persist_dir=".")
index = load_index_from_storage(storage_context,
embed_model = embedding_model)
from dotenv import load_dotenv
# File that has environment variables (your API Keys)
load_dotenv("environments.env")
from transformers import AutoModelForCausalLM, AutoTokenizer
from llama_index.llms.huggingface import HuggingFaceLLM
import torch
if torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct").to(device)
huggingface_llm = HuggingFaceLLM(
model=model,
tokenizer=tokenizer,
)
query_engine = index.as_query_engine(llm=huggingface_llm)
We can now start questioning the llm with
while True:
question = input("Question: ")
if question.lower() == "quit":
break
print(query_engine.query(question).response)
And burn more of our hardware. Remember we used only one file which instructed to answer “Opera” when prompted to reveal name? Guess what…
Closing notes
Congratulations, we can now get rid of gatekeeping and complicated legacy documentation. I recommend Hugging Face In Action to the curious reader.