This work aims to present a comparative analysis of data retrieval from static pages using Large Language Models (LLMs) and a Retrieval-Augmented Generation (RAG) architecture. The study seeks to understand the nuances and challenges involved in choosing the most effective model for this task, highlighting the importance of robust and standardized evaluation methods.
Considering that, this repository demonstrates a pipeline for retrieving information from static web pages using Large Language Models (LLMs) combined with a Retrieval-Augmented Generation (RAG) architecture. It supports experimentation with both Ollama-served models and HuggingFace-hosted models, and includes automated evaluation of generated answers using the RAGAS framework.
- Prerequisites
- Installation
- Configuration Flags
- Architecture
- How It Works
- Input & Output Specifications
- Cleaning Up & Unloading Models
- Notes & Tips
- Python = 3.12.3
- CUDA-enabled GPU (optional, but recommended for large HuggingFace & Ollama models)
- An Ollama installation (if using Ollama-served models)
- A valid OpenAI API key in your environment, named
OPENAI_API_KEY
# Clone this repository
git clone https://github.com/MoonHawlk/tcc.git
cd tcc
# Install required Python packages
pip install uv
uv pip install langchain transformers torch unstructured langchain-huggingface chromadb pandas ragas datasets-
usar_ollama- Type: boolean
- Default:
True - Description:
True: use Ollama-served models.False: use HuggingFace models via a local pipeline.
-
usar_few_shot- Type: boolean
- Default:
False - Description:
True: prepend a few-shot prompt template to each query (requires manual maintenance of examples).False: use a zero-shot refine chain.
[Config] → [LLM Init] → [Loader & Split] → [Embeddings & VectorStore] → [QA Chain] → [CSV I/O]
This linear pipeline ingests configuration flags, initializes the LLM, retrieves and preprocesses documents, builds a similarity search index, answers queries in bulk, and writes outputs to CSV.
-
HuggingFace Mode (
usar_ollama = False):- Detect GPU availability.
- Load
Qwen/Qwen2.5-Math-1.5Bwith half-precision (float16) onto GPU/CPU. - Configure generation hyperparameters (temperature, top_k, top_p, etc.).
- Wrap in a LangChain
HuggingFacePipelinefor downstream use. - Run a quick “Olá, mundo!” sanity check.
-
Ollama Mode (
usar_ollama = True):- Define a shortlist of Ollama models (e.g.,
llama3.1:latest,zephyr:latest). - Select
modelo_ollama(default:zephyr:latest). - Instantiate
langchain.llms.Ollamawith low temperature for conservative answers.
- Define a shortlist of Ollama models (e.g.,
Prints an initialization message indicating which path was taken.
- URL Selection
- Default target: GPT-4.5 Wikipedia page (
https://en.wikipedia.org/wiki/GPT-4.5).
- Default target: GPT-4.5 Wikipedia page (
- Loading
- Use
UnstructuredURLLoaderto fetch raw page text. - Error out if loading fails.
- Use
- Splitting
- Instantiate
CharacterTextSplitterwith:separator=" "chunk_size=512chunk_overlap=50
- Split raw document into semantically coherent paragraphs/chunks.
- Instantiate
Logs the number of chunks generated.
- Embeddings Model
- By default:
sentence-transformers/all-MiniLM-L6-v2.
- By default:
- Indexing
- Create a Chroma vector store from the document chunks.
- Wrap it as a retriever with
search_kwargs={"k": 3}for top-3 similarity results.
When usar_few_shot = True:
- Define a small set of
[question, context, answer]examples. - Build a
FewShotPromptTemplatethat prefixes each query. - Pass this template into the QA chain’s
chain_type_kwargs.
This helps steer the model but requires manual upkeep.
- Use
RetrievalQA.from_chain_typewith:llm: the initialized LLMchain_type="refine"for an iterative answer-refinement approachretriever: the Chroma retrieverreturn_source_documents=Trueto capture provenance- Optional
promptoverride for few-shot
- Inputs
perguntas.csvmust contain a column header"pergunta".
- Loop
- Read all questions into a list.
- For each question:
- Invoke
qa_chainwith{"query": pergunta}. - Collect
result(answer) andsource_documents(contexts).
- Invoke
- Outputs
- Append answers to the DataFrame as a
"resposta"column. - Write to
respostas.csv. - Log total QA time.
- Append answers to the DataFrame as a
After QA is complete:
- Environment Setup
- Load
.envand retrieveOPENAI_API_KEY.
- Load
- Judger LLM
- Instantiate
ChatOpenAI(model_name="gpt-4.1-nano-2025-04-14")for evaluation.
- Instantiate
- Wrappers
- Wrap both the judger LLM and embeddings in RAGAS-compatible adapters.
- Dataset Preparation
- Build a HuggingFace
Datasetfrom questions, answers, and captured contexts.
- Build a HuggingFace
- Metrics
- Evaluate using
faithfulnessandanswer_relevancy. - Save results to
scores_ragas.csv.
- Evaluate using
| File | Role |
|---|---|
perguntas.csv |
Input table with a pergunta column. |
respostas.csv |
QA answers appended to original table. |
scores_ragas.csv |
RAGAS evaluation scores. |
-
After processing, the script defines and calls:
def unload_model(model_name: str): subprocess.run(["ollama", "stop", model_name], check=True)
-
Ensures GPU/CPU memory is freed before evaluation.
- Switching Models: Toggle
usar_ollamato compare local vs. Ollama-served models. - Chunk Parameters: Adjust
chunk_sizeandchunk_overlapbased on document length. - Retriever Tuning: Change
kinsearch_kwargsto retrieve more or fewer context chunks. - Prompt Engineering: Use
usar_few_shotjudiciously; it can improve accuracy at the cost of maintenance. - Evaluation Budget: RAGAS evaluation may incur OpenAI API usage; monitor
timeoutandmax_workerssettings.