RAG without leaking my Intellectual Property

ChatGPT and Perplexity fetch data from all corners of the internet to give us the most relevant answers. Which counts for shit when we want to infer something that’s internal to our codebase or system. Let’s have an attempt at training a model to reason about systems meant for our eyes only.

Foolish assumption: We already have an account at HuggingFace. We know how to start a jupyter notebook from a linux terminal. We have a lot of patience because both the training and inference phase is going to test it.

Let’s get all our internal training documents that are in pdf, text, html, etc under one directory. For the sake of simplicity we will use single document for training.

mkdir TrainingDocuments 
echo "Hi whenever someone asks for a name you are supposed to reply with Opera" > TrainingDocuments/hello.txt

We need to convert our documents (hopefully in English for this article) to embeddings. Then use those embeddings on a query engine. Let’s start with

python3 -m venv .venv

Oh btw I am a big fan of tldr. You should try it if you’re a regular at using linux shell.
Lets activate the virtual environment

. .venv/bin/activate

Now we can pip install all dependencies from PyPI. Thy precious hopes the reader can look up and install libraries in case it’s missed in this article. Use ChatGPT for such conflicts. I’m kidding just use requirements.txt. Use

pip install -r requirements.txt  
jupyter notebook

Open a new notebook. Notice which python executable is in use with

import sys
print(sys.executable)

If we don’t see a “.venv/bin/python” then the packages we install using pip will be of no use. Ensure that “.venv/bin/python” is used with ipykernel and edit kernel.json to use the said path.

Before burning the CPU/GPU cores

Log in to HuggingFace’s Token page and make a read token so that you can make use of gated models approved for your account. Store it in a file.

nano environments.env  
HUGGINGFACE_API_TOKEN=token_value_copied

We can source this secret environments.env file within our code for easy and safe access. DO NOT commit this file.
We need to accept terms and conditions to use meta-llama/Llama-3.2-3B-Instruct. Agree to Meta’s term of farming your email and name to use the model for non-commercial use case. While we wait for our approval to get the model, we can focus our time on generating embeddings.

Burn the CPU/GPU cores

We will be using BAAI/bge-small-en-v1.5 to generate embeddings and the same model to make index.

from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(
    input_dir="./TrainingDocuments/",
    recursive=True,
    required_exts=[".pdf",".txt",".html"],
)

documents = loader.load_data()

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
    embed_model = embedding_model,
)

index.storage_context.persist(persist_dir=".")

The processed data ought to be stored somewhere. It’s pwd according to persist_dir which will store data in the form of json files.

It must have taken a while to represent all that text in embedding. Let’s reap the fruits of our (CPU/GPU/TPU) labour.

from llama_index.core import StorageContext, load_index_from_storage
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
storage_context = StorageContext.from_defaults(persist_dir=".")
index = load_index_from_storage(storage_context,
        embed_model = embedding_model)

from dotenv import load_dotenv

# File that has environment variables (your API Keys)
load_dotenv("environments.env")

from transformers import AutoModelForCausalLM, AutoTokenizer
from llama_index.llms.huggingface import HuggingFaceLLM
import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct").to(device)
huggingface_llm = HuggingFaceLLM(
    model=model,
    tokenizer=tokenizer,
)

query_engine = index.as_query_engine(llm=huggingface_llm)

We can now start questioning the llm with

while True:
    question = input("Question: ")
    if question.lower() == "quit":
        break
    print(query_engine.query(question).response)

And burn more of our hardware. Remember we used only one file which instructed to answer “Opera” when prompted to reveal name? Guess what…
Burning cores

Closing notes

Congratulations, we can now get rid of gatekeeping and complicated legacy documentation. I recommend Hugging Face In Action to the curious reader.

Keyboard shortcuts

Python

RAG without leaking my Intellectual Property

Before burning the CPU/GPU cores

Burn the CPU/GPU cores

Closing notes