CHAPTER 3
REFLECTIONS
3.1 Solutions
The research-oriented internship at Samsung R&D Institute – Bangalore posed a number of
technical and operational challenges, many of which were addressed through structured
experimentation, mentorship support, and the application of best practices in AI development.
The following solutions were implemented to successfully complete the assigned tasks:
1. Custom Dataset Creation
Problem Addressed: Lack of existing training datasets specific to promotional offers.
Solution:
• Developed a custom dataset comprising over 13,000 smartphone promotional offers
by web scraping commercial sources.
• Ensured consistency in format, logical phrasing, numeric accuracy, and brand-
specific constraints using normalization and data-cleaning routines.
• Applied manual tagging and filtering to enhance contextual relevance and instruction-
following quality.
2. Model-wise Evaluation and Selection
Problem Addressed: Difficulty in determining which fine-tuned model performed best.
Solution:
• Conducted structured evaluation of each model (Mistral-7B-instruct-v0.1/v0.2/v0.3,
LLaMA-3.1-8B, LLaMA-3.2-1B, LLaMA-3.2-3B) using consistent benchmarks.
• Metrics included: accuracy, perplexity, instruction adherence, context retention,
response time, and token efficiency.
• Finalized Mistral-7B-instruct-v0.3 as the best-performing model due to its 93.2%
accuracy and 94.2% instruction adherence.
3. Infrastructure and Execution Strategy
Problem Addressed: Limited compute availability for training large models.
Solution:
B.E,Dept. of CSE, CITech 2024-25 Page 10
Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Reflections
• Used Samsung’s lab GPU server for high-memory, resource-heavy training tasks via
secure Tailscale access.
• Conducted evaluation and inference using Google Colab, which helped separate
training and testing workflows.
• Enabled 4-bit quantization with BitsAndBytes to reduce memory usage and speed up
inference.
4. Fine-Tuning Optimization with Hugging Face AutoTrain
Problem Addressed: Managing complex training configurations and multiple model
checkpoints.
Solution:
• Executed Hugging Face AutoTrain locally to allow full control over fine-tuning
parameters (epochs, batch size, warm-up ratio, learning rate scheduler, gradient
norm).
• Implemented parameter-efficient fine-tuning techniques such as:
o Low-Rank Adaptation (LoRA)
o Prefix-tuning
o Adapter layers
• Ensured model reproducibility with defined config files and logging.
5. Prompt Engineering and Output Validation
Problem Addressed: Inconsistent model responses or failure to follow prompt
instructions.
Solution:
• Designed and iteratively refined structured prompts for each model during evaluation.
• Evaluated generated outputs against original dataset for instruction compliance and
semantic accuracy.
• Used custom formulas to validate generated offer values (e.g., Discount = (50 ×
Minimum Purchase Value) × K), allowing precise rule adherence during testing.
These solutions collectively contributed to building a scalable, accurate, and instruction-
aligned system for personalized promotional content generation using fine-tuned LLMs.
B.E,Dept. of CSE, CITech 2024-25 Page 11
Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Reflections
3.2 Experimental Results and Model Evaluation Tables
Table 3.1 Model Comparison Matrix – Mistral vs. LLaMA (Fine-Tuned Variants)
Table 3.1 presents a comprehensive evaluation of six fine-tuned instruction-based LLMs,
including Mistral-7B-instruct variants (v0.1, v0.2, v0.3) and LLaMA models (LLaMA-3.1–8B,
LLaMA-3.2–1B, LLaMA-3.2–3B). The models are compared across eight key performance
metrics that reflect their suitability for real-time, instruction-aligned promotional offer generation
tasks.
Key Observations:
1. Accuracy:
• Mistral-7B-instruct-v0.3 achieved the highest accuracy (93.2%), indicating superior
ability to generate correct, structured outputs.
• LLaMA-3.2–1B recorded the lowest accuracy (88.7%), highlighting a trade-off for its
speed.
2. Perplexity:
• The lowest perplexity score (7.5) was observed in Mistral-7B-instruct-v0.3,
demonstrating confident and fluent token predictions.
B.E,Dept. of CSE, CITech 2024-25 Page 12
Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Reflections
• LLaMA-3.2–1B exhibited the highest perplexity (10.2), implying less certainty in
output.
3. Latency & Response Time:
• LLaMA-3.2–1B offered the lowest latency (89 ms) and faster response time (3.08s),
making it ideal for speed-sensitive scenarios.
• Mistral-7B-instruct-v0.1 showed the highest latency (120 ms) and slowest response
(5.84s).
4. Token Efficiency:
• Mistral-7B-instruct-v0.3 achieved the highest token efficiency (91%), indicating
efficient use of input tokens for generating relevant content.
• LLaMA-3.2–1B had the lowest token efficiency (84%).
5. Throughput:
• LLaMA-3.2–1B leads with 72 tokens/sec throughput, suitable for batch generation
use cases.
• LLaMA-3.1–8B had the lowest throughput (48 tokens/sec), reflecting its
computational overhead.
6. Memory & FLOPs:
• LLaMA-3.1–8B required the most memory (18 GB) and compute (5.4T FLOPs),
which may affect scalability.
• LLaMA-3.2–1B was the most resource-efficient with only 9 GB memory usage and
2.1T FLOPs.
Table 3.2 Output Matching Evaluation of Fine-Tuned Models Against Dataset
Model Response Accuracy Response Time
Llama-3.1-8B ₹250 (₹7,500 min) Correct 4.91s
Llama-3.2-1B ₹2500 (₹1,25,000 min) Incorrect 3.08s
Llama-3.2-3B ₹2500 (₹75,000 min) Incorrect 2.28s
Mistral-7B-v0.1 ₹1200 (₹15,000 min) Correct 5.84s
Mistral-7B-v0.2 ₹250 (₹1,25,000 min, 9M EMI) Correct 5.07s
Mistral-7B-v0.3 ₹1250 (₹30,000 min, Non-EMI) Correct 4.76s
B.E,Dept. of CSE, CITech 2024-25 Page 13
Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Reflections
Table 3.2 presents a qualitative and quantitative evaluation of how closely the generated outputs
from fine-tuned models aligned with real promotional content in the dataset. The aim was to
validate whether the models could replicate human-like, logically structured offers based on
minimal input prompts.
Objective:
The primary objective of this evaluation was to assess the ability of fine-tuned models—
specifically Mistral-7B and LLaMA variants—to generate promotional offers that matched
ground truth entries in terms of discount logic, numeric thresholds, and structural coherence.
Methodology:
• A subset of test prompts from the original dataset was provided to each fine-tuned model.
• Generated outputs were then compared to real promotional entries using two main
criteria:
1. Instruction Adherence: Did the model follow format rules like “Get ₹X off on Y”
or “₹X cashback on minimum purchase of ₹Y”?
2. Numerical Accuracy: Were the values for discounts, purchase limits, and bonus
conditions correctly derived from the rules?
• A formula was also used for offer value calculation, e.g.:
Discount = (50 × Minimum Purchase Value) × K
Where K ≈
o 2.5 for EMI Transactions
o 1.5 for Non-EMI Transactions
o Additional ₹150–₹250 for Long-term EMI (9+ months)
Analysis:
• Mistral-7B-instruct-v0.3 demonstrated both logical and contextual alignment, generating
a Non-EMI-based discount with precise calculation and condition adherence.
• Llama-3.1–8B showed reliable performance with correct formatting and value estimation.
• Llama-3.2–1B and 3.2–3B generated disproportionately large discount values, indicating
weak adherence to conditional logic.
• Mistral-7B-v0.1 and v0.2 produced correctly formatted and context-aware responses,
showing model maturity in understanding instructions and context.
B.E,Dept. of CSE, CITech 2024-25 Page 14
Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Reflections
3.3 Screenshots
Figure 3.1 AutoTrain Interface for LLM SFT (Supervised Fine-Tuning) on Hugging Face
Figure 3.1 shows the AutoTrain interface on Hugging Face used for supervised fine-tuning (SFT)
of large language models, highlighting model selection, parameter configuration, and dataset
mapping options.
Figure 3.2 Tokenized dataset preview used for model training, after applying data cleaning and formatting.
B.E,Dept. of CSE, CITech 2024-25 Page 15
Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Reflections
Figure 3.2 shows the tokenized dataset preview used for model training, after applying data
cleaning and formatting to structure prompt–response pairs effectively.
Mistral-7B v0.1
Mistral-7B v0.2
Mistral-7B v0.3
Llama3.2-1B
Llama3.2-3B
Llama-3.1 1B
B.E,Dept. of CSE, CITech 2024-25 Page 16
Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Reflections
Figure 3.3 Google Colab inference results for prompt testing, highlighting model response quality and
formatting adherence.
Figure 3.3 shows Google Colab inference results for prompt testing, highlighting the response
quality, accuracy, and formatting adherence of the fine-tuned models.
B.E,Dept. of CSE, CITech 2024-25 Page 17