Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views6 pages

Rohan Task Performed

The internship at Samsung Research Institute – Bangalore focused on fine-tuning Large Language Models (LLMs) for promotional offer generation, providing hands-on experience with model training, dataset engineering, and evaluation metrics. Key skills acquired included configuring models using Hugging Face's AutoTrain, data preprocessing, and prompt engineering, alongside overcoming challenges related to memory constraints and model compatibility. The experience culminated in the successful development of a high-volume dataset and the benchmarking of multiple LLM variants.

Uploaded by

Rohan Rajashekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Rohan Task Performed

The internship at Samsung Research Institute – Bangalore focused on fine-tuning Large Language Models (LLMs) for promotional offer generation, providing hands-on experience with model training, dataset engineering, and evaluation metrics. Key skills acquired included configuring models using Hugging Face's AutoTrain, data preprocessing, and prompt engineering, alongside overcoming challenges related to memory constraints and model compatibility. The experience culminated in the successful development of a high-volume dataset and the benchmarking of multiple LLM variants.

Uploaded by

Rohan Rajashekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CHAPTER 2

TASK PERFORMED

2.1 Learning Experiences


The internship at Samsung Research Institute – Bangalore (SRI-B), conducted under the PRISM
(Preparing and Inspiring Student Minds) initiative, provided a rigorous, hands-on introduction to
the process of building, fine-tuning, and evaluating instruction-based Large Language Models
(LLMs). This experience enabled direct exposure to enterprise-grade model development
environments and workflows.
Initially, the learning curve was steep due to the technical complexity involved in fine-tuning
transformer-based LLMs. Without prior experience in training high-parameter models,
understanding the intricacies of model architecture, configuration parameters, and the impact of
different quantization techniques was particularly challenging. Furthermore, managing multiple
model variants such as Mistral-7B (versions v0.1, v0.2, and v0.3) and Meta’s LLaMA series
(3.1–8B, 3.2–1B, 3.2–3B) introduced the challenge of evaluating comparative performance
across multiple configurations.
A significant part of the learning involved gaining proficiency with Hugging Face’s AutoTrain
framework, which was executed locally rather than via the Hugging Face platform. This
involved configuring the training pipeline manually including epochs, learning rate schedules,
quantization types (e.g., 4-bit low precision), gradient norms, warm-up ratios, and checkpoint
intervals — all while ensuring optimal memory usage.
Additionally, the internship required secure remote access to Samsung’s internal training
infrastructure, which was achieved using Tailscale. Learning to operate within this environment
demanded a working knowledge of virtual environments, CUDA GPU resource allocation, and
distributed training using frameworks like DeepSpeed.
Despite these challenges, consistent guidance from Samsung researchers, structured
documentation, and weekly review sessions helped accelerate the learning process. The outcome
was a progressively deeper understanding of end-to-end model training and deployment
pipelines, grounded in both theory and application.

B.E,Dept. of CSE, CITech 2024-25 Page 4


Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Task Performed

2.2 Knowledge Acquired


The internship provided a broad and in-depth understanding of the technical, operational, and
methodological aspects of LLM fine-tuning for real-world tasks. Key areas of knowledge
acquired include:
1. Fine-Tuning of Instruction-Based LLMs:
o Gained hands-on experience in fine-tuning models like Mistral-7B and LLaMA
using web-crawled datasets of over 13,000 real promotional offers.
o Implemented parameter-efficient tuning strategies such as Low-Rank Adaptation
(LoRA), prefix tuning, and adapter layers.
o Understood and applied 4-bit quantization using BitsAndBytes to reduce memory
usage and enhance computational efficiency.
2. Dataset Engineering:
o Learned to collect raw promotional data via automated web crawling tools such as
Selenium and BeautifulSoup4.
o Applied data-cleaning techniques to remove noise, duplicates, and formatting
inconsistencies.
o Split datasets into training, validation, and testing subsets (80:10:10) for robust
model evaluation.
3. Model Evaluation and Benchmarking:
o Developed proficiency in evaluating model performance using metrics such as:
▪ Accuracy (% of correct outputs)
▪ Perplexity (confidence in next-token prediction)
▪ Response Time (in seconds)
▪ Instruction Adherence (%)
▪ Context Retention (%)
▪ Token Efficiency (% of useful tokens used)
▪ Latency and Throughput (tokens/sec)
o Compared model versions using structured evaluation matrices to determine the
best model for deployment. Mistral-7B-instruct-v0.3 emerged as the top-
performing model with 93.2% accuracy and 94.2% instruction adherence.

B.E,Dept. of CSE, CITech 2024-25 Page 5


Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Task Performed

4. Infrastructure and Deployment:


o Understood the process of setting up local training environments using virtual
environments, PyTorch, Hugging Face Transformers, AutoTrain Advanced, and
CUDA.
o Leveraged Tailscale for remote access to secure Samsung R&D servers.
o Used Google Colab for final output validation and prompt testing due to its
flexible GPU runtime.
5. Research-Oriented Skills:
o Understood how to structure research workflows for reproducibility and
scalability.
o Learned to write structured training algorithms and code for fine-tuning
configurations.
o Acquired foundational knowledge in prompt engineering and its role in model
behavior optimization.

2.3 Skills Learned


The internship under Samsung PRISM fostered a wide range of technical and analytical skills
essential for working in the domain of AI and Natural Language Processing (NLP). These
include:
1. AI Model Fine-Tuning with Hugging Face AutoTrain:
o Developed the capability to configure and fine-tune instruction-based large
language models (LLMs) such as Mistral-7B and LLaMA variants using Hugging
Face’s AutoTrain framework (executed locally).
o Gained experience in manually tuning hyperparameters such as learning rate,
number of epochs, warm-up ratio, quantization configuration (e.g., 4-bit), and
gradient clipping norms.
2. Data Engineering and Preprocessing:
o Learned to build a structured, custom dataset of over 13,000 smartphone-related
promotional offers from raw, unstructured web data using tools such as Selenium
and BeautifulSoup.

B.E,Dept. of CSE, CITech 2024-25 Page 6


Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Task Performed

o Applied cleaning, deduplication, normalization, and tokenization strategies to


ensure dataset quality and consistency.
3. Model Evaluation and Metrics Benchmarking:
o Proficient in evaluating models based on key performance indicators such as:
▪ Accuracy
▪ Perplexity
▪ Instruction Adherence
▪ Response Time
▪ Token Efficiency
▪ Latency (ms)
▪ Throughput (tokens/sec)
▪ Context Retention
o Built structured comparison matrices to identify optimal models.
4. Infrastructure & Tooling:
o Used secure remote access tools like Tailscale to connect to Samsung's internal
GPU-enabled lab servers.
o Implemented models in virtualized Python environments with dependencies
including transformers, accelerate, bitsandbytes, and DeepSpeed.
5. Prompt Engineering:
o Acquired the ability to design and validate input prompts tailored for specific
outcomes, such as generating promotional offers with contextual and numerical
precision.
6. Collaboration and Research Communication:
o Learned to document training logs, evaluation results, and insights in a format
suitable for both technical presentations and academic papers.
o Engaged in weekly sync-ups with mentors for progress reviews and technical
guidance.

B.E,Dept. of CSE, CITech 2024-25 Page 7


Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Task Performed

2.4 The Most Challenging Task Performed


The most demanding and technically intensive task during the internship was designing,
curating, and validating a completely novel, high-volume dataset for fine-tuning LLMs in the
domain of promotional offer generation.
Key Challenges:
• Dataset Design: The internship required creating a domain-specific dataset from scratch,
involving over 13,000 unique promotional offers across varied formats and product types.
The complexity lay in ensuring uniformity, relevance, and semantic richness across
entries.
• Model Compatibility: Adapting multiple models (e.g., Mistral and LLaMA series) to this
dataset proved difficult, as each model demanded specific input formatting, tokenization
styles, and batch size optimizations.
• Training Constraints: Operating in a constrained compute environment meant limited
access to high-memory GPUs. Training had to be memory-efficient, which required
integrating low-bit quantization (4-bit) and reducing batch sizes without compromising
performance.
• Evaluation Complexity: Comparing multiple models across several metrics — while
ensuring each metric was evaluated fairly and consistently — required scripting custom
evaluation routines and setting up structured validation pipelines.
Despite these hurdles, the task resulted in the successful training and benchmarking of multiple
LLM variants, and the development of an efficient pipeline for dynamic, instruction-based offer
generation.

2.5 Problems Identified


Throughout the internship, several key challenges and bottlenecks were encountered that
required technical resolution or strategic trade-offs:
1. Memory Constraints:
o High memory usage during model training and checkpoint storage was a recurring
issue. Efficient memory handling became necessary, especially when training
7B+ parameter models.

B.E,Dept. of CSE, CITech 2024-25 Page 8


Fine-Tuning LLMs for Prompt Generation Using Web-Crawled Data
– Samsung Research Institute Bangalore (PRISM Program) Task Performed

o Solution: Employed 4-bit quantization and gradient accumulation to reduce


memory load.
2. Model Compatibility Issues:
o Some models (e.g., specific LLaMA variants) were not fully compatible with
default AutoTrain configurations.
o Solution: Custom fine-tuning scripts and manual parameter overrides were
implemented to support these models.
3. Dataset Standardization:
o Maintaining a consistent format across thousands of offer templates proved
challenging.
o Solution: Implemented rigorous preprocessing steps for token alignment,
placeholder normalization, and semantic validation.
4. Inference Bottlenecks:
o Resource limitations on platforms like Google Colab caused execution
slowdowns during inference benchmarking, particularly for batch evaluations.
o Solution: Split inference jobs and ran serial evaluations to bypass batch size
limits.
5. Instruction Adherence and Prompt Variability:
o Models occasionally ignored detailed instructions or provided structurally
incorrect outputs.
o Solution: Applied prompt engineering techniques and reinforced training on well-
structured examples to improve model alignment.

B.E,Dept. of CSE, CITech 2024-25 Page 9

You might also like