Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views43 pages

Unibud

Uploaded by

abha bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views43 pages

Unibud

Uploaded by

abha bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Q1.

A city government aims to improve urban traffic flow and reduce congestion in
key areas. By analyzing real-time traffic data collected from sensors, traffic
cameras, and GPS devices installed in vehicles, the government seeks to
predict traffic congestion during peak hours and adjust traffic signal timings
accordingly. As part of an engineering course, you will walk through the data
science life cycle to develop a predictive model.

Q2.
Clarify the concept of Business Intelligence (BI) and its role in
decision-making within an organization. Discuss various BI tools and
techniques used to transform raw data into meaningful insights. Additionally,
explain how BI integrates with Data Science and how both fields complement
each other in modern business analytics.

Q3.
Netflix is one of the largest streaming services today. Should Netflix use
business intelligence tools to enhance the user experience and expand its user
base, or would data science methods be more appropriate? Provide your views
and recommend tools for the task.

Q4.
Consider that you have been hired by a company that is working with weather
data. You have been asked to collect data about cyclones for the past 100 years
and develop a model. As a data scientist describe in detail the process that you
will follow in developing a model for the weather forecasting

Q5.
Explain various tools and skill required for data scientists.

Q6.
Explain the significance of Data Science and its applications.
Q7.
You're part of a team at VITCodeGuardian, a company managing a large,
complex code repository that's accumulated technical debt, including issues
like outdated dependencies, duplicate code, and inconsistent practices. Your
task is to use data science to analyze the codebase, reduce technical debt, and
improve
code quality. How would you approach this, from extracting and processing the
code data to identifying trends in complexity, duplication, and dependencies?
Explain the tools and algorithms you would use to recommend refactoring
areas, and propose a machine learning model to predict future bugs or
performance issues based on historical data. How would your approach
complement traditional code reviews, and how could this evolve into an
automated system for ongoing code quality monitoring?

Q8.
Suppose you are a data scientist working for a major airline company. Your
team has been tasked with leveraging data science techniques to improve
various aspects of the airline system. How would you utilize data science in
this scenario, and what specific benefits could it bring to the airline?

Q9.
Let us assume that you are a data scientist and taking up a project for a Phone
Manufacturing Company. As a data scientist, your first step is collecting and
pre-processing the data. Explain how you would pre-process the data like
handling missing values, categorical Features, and Normalization of data.
Discuss in detail.

Q10.
Suppose a transportation company hires you to optimize route planning and
reduce delivery times. How would you utilize the data science process to
analyze route data, identify traffic patterns, and optimize delivery schedules?

Q11.
What is the role of a data scientist? What are the prerequisites to become a data
scientist?

Q12.
Consider a fraud detection system in banking sector that requires to implement
a set of proactive measures to detect and avoid fraudulent activities and
financial losses. Illustrate the different steps involved in data science process
with neat sketch with respect to the above scenario.

Q13.
Consider a logistics industry management system. Identify the need of data
science in logistics industry management system to enhance business and
management. Also describe in detail about uses of data science in logistics
industry automation system.

Q14.
What are the prerequisites for Data Scientists? Write the skills and Tools
required to perform Data Analytics.

Q15.
At a sweet factory there are machines that put sweets in the packets. The
company aims for 100 sweets in every packet. The factory owner sees that his
current machine puts a mean of 99 sweets in each packet with a standard
deviation of 0.5. He is considering buying another machine which puts a mean
of 100 sweets in each packet with a standard deviation of 3.5. Should he buy
the new machine? Why or why not?

Q16.
Briefly write about the Data Science process and explain each step with
suitable example

Q17.
List and discuss on each of the following:
(a)Components of data science
(b)Tools and skills needed

Q18.
Differentiate between data science and business intelligence on various factors

Q19.
Consider a company database for a retail chain that contains the following
tables:
Sales : sale_id, store_id, product_id, quantity_sold, sale_date, sale_amount
Stores : store_id, store_name, location
Products: product_id, product_name, price

Apply SQL Window Functions to solve the given queries to analyze the profit
performance across the stores.
1. Find the total quantity sold for each product across all sales and rank the
products based on the total quantity sold in descending order. Include
product_name, total_quantity, and rank.
2. Calculate the cumulative sales amount (sale_amount) for each product over
time (ordered by sale_date). Include product_name, sale_date, sale_amount,
and cumulative_sales.
3. For each product, calculate the difference in quantity sold between the
current sale and the previous sale (based on sale_date) and display
product_name, sale_date, quantity_sold, and quantity_difference .

Q20.
The e-commerce platform has collected data from customers, their orders, and
the products they purchase. The goal of this case study is to demonstrate how
we can use SQL to transform, and analyze this data to derive useful insights,
such as identifying high-value customers and tracking popular products.
Perform SQL functions for the following objectives:
1. Identify the total spending of each customer.
2. Find high-value customers who have spent more than $500.
3. Determine the most purchased product in the last 30 days.
4. List the users who have not placed any orders.

Q21.
Explain the use of the following terminologies using suitable SQL queries
a. % and _
b. IN and BETWEEN
e. RANK() and DENSE RANK().

Q22.
Differentiate between RDBMS and NoSQL. Discuss the different types of
NoSQL databases showcasing their application process using suitable examples
from each type.

Q23.
You are a data analyst for a movie streaming platform. The platform has the
following tables:
Viewership(view_id, movie_id, user_id, view_date, view_duration)
Movies(movie_id, movie_title, genre, release_year)
Users(user_id, user_name,subscription_type) (subscription type can be: Basic,
Standard, Premium).
Write SQL queries to answer the following using window functions:
i) List the top watched movies by genre:
Find the total viewing duration for each movie and rank them within
their genre based on viewing duration (highest to lowest). Include
movie_title, genre, total_duration, and rank.
ii) Find the cumulative viewing time for users (5 Marks):
Calculate the cumulative viewing time for each user over time,
ordered by view_date. Include user_name, view_date, view_duration,
and cumulative_duration.

Q24.
Consider the above dataset:
i) Display 'low' for students with a less than 4, 'medium' for students with cgpa
between 4 and 7.5, and 'high' for other students.
ii) Display the first available emergency contact for the students from
dad, mom, or guardian numbers.
iii) List the different courses available for the students
iv) Display 'Yes' along with the student's name if the student belongs to
CSE 'No otherwise.

Q25.
Considering the data is stored in a database, find the following
i) Find the range of the salary of the employees
ii) Find the employee with the most experience in the company
iii) List all the high-ranking employees with salaries greater than 10000
iv) List the representatives of various departments.
v) Find the average salary of a clerk.

Q26.
The database consists of four tables with attributes of each table as given
below. Write the SQL query for the given questions.
author (id, name, birth_year, death_year)
book (id, author_id, title, publish_year, publishing_house, rating )
adaptation (book_id, type -e.g. movie, game, play, musical, title, release_year,
rating
book_review (book_id, review, author).
a. Show the name of each author together with the title of the book they wrote
and the year in which that book was published.
b. Show the name of each author together with the title of the book they wrote
and the year in which that book was published. Show only books published
after 2005
c. For each book, show its title, adaptation title, adaptation year, and
publication year.
d. Show the title of each book together with the title of its adaptation and the
date of the release. Show all books, regardless of whether they had adaptations.
e. Show all books with their movie adaptations. Select each book's title, the
name of its publishing house, the title of its adaptation, and the type of the
adaptation. Keep the books with no adaptations in the result.

Q27.
You are provided with the following database schema for a retail store.
Employees(employee_id,first_name,last_name,department,salary,
hire_date)
Sales(sale_id, employee_id, sale_amount, sale_data)
Develop SQL query for the following:
i) For each employee, calculate their total sales amount and rank them within
their department based on this total sales amount. Provide employee_id,
first_name, last_name, department, total_sales, and sales_rank.
ii) Display each employee's salary along with their previous hired
employee's salary in the same department. Include employee_id,
first_name, last_name, department, salary, and previous_employee_salary.
iii) Divide all employees into three groups based on their salaries across the
entire company. List employee_id, first_name, last_name, salary, and
salary_group.
iv) For each sale, display the sale amount and the average sale amount for that
day. Include sale_id, sale_date, sale_amount, and daily_average_sale.
v) For each employee, show their salary and the difference between their salary
and the average salary of all employees hired before them. Include
employee_id, first_name, last_name, salary, and salary_difference.

Q28.
Given the following four scenarios, Identify the most appropriate NoSQL
database type for each case and justify your choice:
i) A social media platform needs to store user profiles, including varying data
such as usernames, profile pictures, and posts. Each user can have different
attributes.
il) A real-time analytics application requires storing and quickly retrieving
large volumes of user activity data (clicks, views) as key-value pairs for quick
lookup.
iii) An online retail platform wants to manage a vast inventory of products,
including product descriptions, prices, and customer reviews, where the
structure of product data can vary significantly.
iv) A transportation app needs to analyze relationships between cities, routes,
and vehicles, focusing on how they are interconnected.

Q29.
Given the following SQL table schema for an e-commerce application:
. Product(product_Id, product_name, category, price)
· Review(review_Id, product_Id, user_id, rating, comment)
Convert the SQL schema into a suitable NoSQL data model, justifying your
choice of NoSQL database type. Discuss how you would structure the data to
accommodate product details and user reviews while ensuring efficient
querying and retrieval.
After defining your NoSQL schema, populate the database with sample data for
at least two products and their associated reviews. Based on this NoSQL
model, write a query to retrieve the product details for Product ID 1, along with
only those reviews where the rating is greater than 4.

Q30.
Consider the following relations:
Plant(plant_id, plant_name, height_cm, growth_stage)
Farm(farm_id, farm_name, region, size_acres)
Farm_Plant(farm_id, plant_id, planted_date)
a) Write a query to find pairs of plants where the second plant is taller than the
first plant but they are in the same growth stage.
b) Write a query to find all farms where the length of the farm name is greater
than 10 characters, and display the farm name along with the region, with both
values concatenated into one column.
c) Write a query to find the names of plants and the farms where they were
planted before the year 2020.
d) Write a query to rank the plants based on their height in descending order,
assigning the same rank for plants with the same height.
e) Write a query to find the farms where the average height of the plants
planted is greater than 150 cm.

Q31.
Discuss the rise of NoSQL databases and their significance in modern data
management systems. Explain the key features, advantages, and use cases of
NoSQL databases, along with potential challenges and limitations.

Q32.
Consider the following data.
The sales_data table has the following columns:
transaction_id: Unique identifier for each transaction.
transaction_date: Date of the transaction.
product_id: Identifier for the product sold.
quantity_sold: Quantity of the product sold in each transaction.
unit_price: Price of each unit of the product.
(a) Write a SQL query to calculate the total sales amount for each product,
ordering the results by total sales amount in descending order.
(b) Write a SQL query to determine the percentile rank of sales amount for
each transaction, displaying the transaction_id, transaction_date,
sales_amount, and the percentile rank of sales amount.
(c) Write a SQL query to identify the next transaction date for each transaction,
displaying the transaction_id, transaction_date, and the
next transaction date.
(d) Write a SQL query to calculate the difference in quantity sold compared to
the previous transaction, partitioned by product and ordered by transaction
date.
(e) Write a SQL query to rank products based on their total revenue generated,
considering the top 3 products.

Q33.
Explain the following SQL window functions with an example.
i) PERCENT_RANK()
ii) DENSE_RANK ()
iii) NTILE()
iv) AVG ()
v) SUM ()

Q34.
Consider the following table and identify the output of following queries.
(i) select name, time, row_number() over (order by time), rank() over (order by
time), dense_rank() over (order by time) FROM runners order by time;
(ii) select name, time, percent_rank() over (order by time), cume_dist() over
(order by time) FROM runners order by time;

Q35.
Consider the following tables and write SQL Commands to perform inner join,
left join and full outer join. Also give the output
Q36.
Consider the following table and write SQL queries to perform the following:
(i) Display the employee details whose name starts with either a,c,d,s
(ii) Display the employee details who is getting highest salary in the
department "CSE"
(iii) Display all the employees whose city contains the pattern "on"
(iv) Display all the employees whose name starts with "s" and are at least 5
characters in length.
Q37.
Consider the following table. Write SQL commands to find the mean, median
and mode of gasoline cost and also write the mean, median and mode of
gasoline cost.

Q38.
Do as directed using Joins and String Functions:
a.Write a SQL query to list the full name of each employee (concatenated first
and last names with a space in between), their department name, and salary,
b. Write a SQL query to display each employee's last name (in uppercase), the
first three characters of their department name, and their salary. Include all
employees, even if they don't belong to a department.
c. Write a SQL query to display each employee's first name (with 'a' replaced
by 'x'), the department name, and salary. Include all employees and
departments.
d. Write a SQL query to calculate the total salary paid to employees in each
department. Return the department name and the total salary.
e. Write a SQL query to find the median of the salaries of the emplovees.
Q39.
Do as Directed using Window Functions:
a. Write a SQL query to rank each sale within the context of each employee
based on the sale amount in descending order. Return employee_id,
sale_amount, and rank.
b. Write a SQL query to display each sale amount for an employee along with
their previous sale amount (if any). Use a subquery to achieve this and return
the following columns: employee_id, sale_date, sale_amount, and
previous_sale_amount.
c. Write a SQL query to divide the sales into four groups (quartiles) within
each employee based on sale date. Return employee_id, sale_date,
sale_amount, and quartile.
d. Write a SQL query to rank employees based on their total sales. Return the
employee ID, total sales, and their rank.
e. Write a SQL query to calculate the cumulative sum of sales for each
employee, ordered by sale date. Return the employee ID, sale date, sale
amount, and cumulative sum.
Q40.
Write a NoSQL example for the following:
i. Inventory database (Document DB)
ii. Facebook user database (Graph DB)
iii. Employee database with his interview content (Column family DB)
iv. Student details (Key Value DB)

Q41.
When is NoSQL database preferred over SQL data base?

Q42.
Following are tables in the database. Write the SQL query for the questions
below.
emp (eno, ename, bdate, title, salary, dno)
proj (pno, pname, budget, dno)
dept (dno, dname, mgreno)
workson (eno, pno, resp, hours)

i) Write an SQL query that returns all works on records where hours worked is
less than 10 and the responsibility is 'Manager'.
ii) Write an SQL query that returns the project name, hours worked, and project
number for all works on records where hours > 10.
iii) Write an SQL query that returns the employee name, department name, and
employee title.
iv) Write an SQL query that returns the employee numbers and salaries of all
employees in the 'Consulting' department ordered by descending salary.
v) Write an SQL query that returns the employee name, project name,
employee title, and hours for all works on records.

Q43.
Use the following database schemas for writing the required queries using
window functions:
Zoo (ZooID, ZName, No_of_animals, Ticket_Price, City, State)
Marks (RegNo, Standard, Marks) – here, the values for standard can be 6 to 12.
a) Write a query to display the regno, standard and average of the marks in
each standard.
b) Write a query to display the percentile rank of each student along with their
register number.
c) Write a query to calculate and display the regno, standard and marks of each
student along with the absolute difference between each students’ mark and the
average of marks of that standard.
d) Write a query to display the student regno, standard, marks and their rank
(based on the marks) in the respective standard without gaps.

Q44.
Discuss the basic difference between SQL and NoSQL database transactions.
Why do we prefer NoSQL databases? Discuss in detail along with different
types of NoSQL databases

Q45.
Let us suppose that you are a data scientist and taking up a project for ABx
company. The company wants to improve its sales. As a data scientist how
would you approach this project and what are all the processes involved.
Discuss in detail. Also, identify the process that takes more time in your project
and justify.

Q46.
When do you call an application as Big Data Application?

Q47.
Explain in detail about the Data Preparation in Big Data Analytics life cycle.

Q48.
Explain in detail about the Model Building phase of Data Analytics Life Cycle
and also discuss the various tools used in this phase

Q49.
Identify the different phases of data analytics lifecycle and illustrate the key
activities need to be performed in each phase with neat sketch.

Q50.
Define big data? Explore the significance of the 5V's in understanding and
analyzing big data.

Q51.
Explain how big data analytics can be used to solve real-life problems and give
any five examples of big data analytics.

Q52.
Describe common tools used in the model-building phase for statistical
analysis and data mining.
Q53.
A government agency is seeking to harness the power of big data to improve
urban planning and address traffic congestion in a rapidly growing city. How
can big data be utilized in this context, and what characteristics of big data are
particularly relevant?

Q54.
Imagine a scenario where a healthcare organization aims to improve patient
outcomes and reduce costs through data analytics. Discuss how the data
analytics life cycle can be applied in this scenario, outlining the stages involved
and their significance. Provide examples and explanations for each stage.

Q55.
You are a senior data scientist at a company facing frequent deadlock issues in
its distributed cloud systems due to high resource contention. The current
Banker's Algorithm is no longer sufficient for managing complex and dynamic
resource allocation patterns. Your task is to develop an Al-driven solution using
Generative Al to predict and resolve deadlocks in real-time, improving system
performance. Using the Data Analytics Lifecycle, outline how you would
approach this problem. Consider which data to gather, and plan a model design
leveraging techniques like reinforcement learning or GANs to detect and
resolve deadlocks. Finally, explain how you would validate and present the
effectiveness of the model to the technical team while ensuring seamless
integration into the existing system.

Q56.
Explain in detail about Data Analytics life cycle with an example for every
stage.

Q57.
Using VSM, find the cosine similarity between documents & query for the
following. Arrange the documents as per the similarity ranking.
Document 1: “The quick brown fox jumps over the lazy dog.”
Document 2: “A brown dog chased the fox.”
Document 3: “The dog is lazy.”
Query: “brown dog”

Q58.
Given the frequency table ,Predict the type of vegetable(Ivygourd ,cucumber or
other) with the properties {Green,long,Seed,} using Naïve Bayes classifier.

Q59.
a) For the given data, compute two clusters using K-means algorithm where
initial cluster centers are (1.0, 1.0) and (5.0, 7.0). Execute for two iterations.
b) For the same data, perform single link hierarchical clustering. Show your
results by drawing a dendogram. The dendogram should clearly show the order
in which the points are merged.
c) Use the same data, perform DBSCAN and identify the core points and noisy
points with eps=0.5 and min.points=2
Q60.
Given a query and a set of documents, apply the Vector Space Model using
Term Frequency - Inverted Document Frequency method to order the
documents based on cosine similarity (Note: Calculate TF-IDF for the
keywords : Machine, Learning, fascinating, Deep, subset, NLP, application)
Document 1: Machine Learning is fascinating
Document 2: Deep Learning is a subset of Machine Learning
Document 3: NLP is a application of Deep Learning
Query: Machine application

Q61.
If Epsilon(€) is 2 and minimum points in the clusters (minpoint) is 2, what are
the clusters that DBSCAN would discover with the following 8 points?
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2),
A8=(4,9). Draw the 10 by 10 grid, plot the above points, and illustrate the
discovered clusters. What if Epsilon is increased to 10 and calculate the
clusters for the new epsilon value?

Q62.
The following dataset contains IPL match details for Chennai Super Kings.
Using the Naive Bayes classification algorithm, classify whether the following
instance would result in a Win or Loss - (Home, Deccan Charges, bat, MS
Dhoni)
Q63.
Consider the following two dialogues from Harry Potter characters:
. Dialogue 1 (Prof. McGonagall): "Harry is brave, talented, and has great
magical potential."
. Dialogue 2 (Hermione): "Harry is brave, skilled, and possesses great magical
potential."
Convert the dialogues into vectors, then calculate both the Euclidean distance
and the Jaccard similarity. Based on the results, interpret which metric better
reflects the similarity between the descriptions. Suggest the best data
structure(s) in python for efficiently calculating both metrics.

Q64.
Now, Professor McGonagall, with her keen eye for detail, wants to refine the
following text. Help her by performing text preprocessing on the following
sentence. This includes:
. Removing stop words
. Removing punctuation
. Performing stemming
After preprocessing, count the total number of words remaining in the
sentence. Also specify the number of stop words and punctuations used.
Consider python's NLTK's corpora.
"@HarryPotter#wizardingworld casting spells quickly, battling fiercely, and
protecting his friends from the dangerous creatures!"

Q65.
Perform lexical analysis on the following sentence. After generating the
lexemes, using the given CFG, parse the sentence "The big cat chased the
mouse" and draw its parse tree, illustrating the breakdown of the sentence into
its syntactic components such as noun phrases (NP), verb phrases (VP), and
other relevant sub-components like Det, Adj, N, and V.
. S ->NP | VP
. NP->Det N | Det Adj N
· VP -> V NP | V
Are there two or more possible ways to construct the parse trees? If yes, what
does it imply about the given grammar?

Q66.
Consider the following three documents:
. dl: "The new movie is amazing."
. d2: "I love watching new movies."
. d3: "The movie is not new, but it's still good."
Given the query: "new movie", use the vector space model and cosine
similarity to rank the documents based on their relevance to the query.
Calculate the tf-idf values for the terms in the documents and the query, then
compute the cosine similarity scores. Rank the
documents dl, d2, and d3 in order of relevance to the query.

Q67.
Consider a bigram model trained on the following corpus:
Corpus:
"Harry is a wizard."
"Harry is brave."
"Hermione is a witch."
"Hermione is intelligent."
Using the bigram model, calculate the probability of the following sequence of
words and predict the next most likely word after "Hermione is":

Q68.
Vectorize the given set of document after performing the following.
1. Stop word removal [are, is, for, and, I, it, too, makes, me, an,
a, keeps, the, away ]
2. Stemming
3. Lemmatizing
Documents:
DI: Apples are healthy. Apple is red.
D2. Eating apple is good for gut and I love apple.
D3: Red apple is tasty. I love red apples and it is healthy too.
D4: Eating apples makes me happy.
For the above vectorized document, apply Single Linkage Algorithm to group
the documents based on their similarity in score. Also draw the dendrogram.

Q69.
A Q&A website, similar to Stack Overflow, needs to identify duplicate
questions to avoid clutter. They use Euclidean distance, Jaccard and Cosine
similarity measures to compare the questions.
Q1: Three users post the following questions:
Question 1: "How can I sort an array in Python?
Question 2: "What is the easiest way to sort a list in Python?"
Question 3:"Code to sort list of items in python"
Which one of the above measures is the correct representation of similarity?
Justify.

Q70.
Cluster the Following seven points with (x, y) representing location into three
clusters:
A1(2,10), A2(2,5), A3(8,4), A4(5,8), A5(6,4),A6(1,2),A7(4,9)
Initial cluster centres are: A1(2,10), A4(5,8) and A6(1,2). Use K-Means
Algorithm to find the three cluster Gentres after the second Iteration. Also plot
the final cluster.
Q71.
A bank wants to develop an NLP-based chatbot that can handle basic customer
queries, such as checking account balances, opening new accounts, and
addressing common issues like transaction failures.
Question 1: The bank's chatbot struggles with understanding ambiguous or
complex queries. For instance, when a customer says, "I want to check my
balance and also transfer funds to another account," the chatbot only answers
one part of the query. Why does this happen, and how can intent recognition
and dialogue management be improved to handle multi-intent queries?
Question 2: The bank also wants to extend the chatbot's capabilities to handle
complaints and identify when a customer is frustrated.
How can sentiment analysis be incorporated into the chatbot to
recognize customer frustration and escalate the issue to a human
representative when needed?

Q72.
Calculate the probability of the new data point belonging to each
class. Use a naive Bayes classification for the following dataset. Find the class
label (Species) for the test sample: (Green, 2, Tall, No,?)
What are the challenges in applying Naïve Bayes classifier? List
the ways to overcome those challenges.
Q73.
Consider the following toy example: Training data:
<s>I am Sam</s>
<s>Sam I am </s>
<s>Sam I like</s>
<s>Sam I do like</s>
<s>do I like Sam</s>
Assume that we use a bigram language model based on the above
training data. What is the most probable next word predicted by the model for
the following word sequences?
(1) <s>Sam ...
(2) <s>Sam I do ...
(3) <s>Sam I am Sam ...
(4) <s>do I like ...
Which of the following sentences is better, i.e., gets a higher
probability with this model?
(5) <s>Sam I do I like </s>
(6)<s>I do like Sam I am </s>

Q74.
A company wants to monitor its brand reputation by analysing customer
feedback and social media posts. They decide to use sentiment analysis to
understand whether people are speaking positively, negatively, or neutrally
about their brand.
Q 1: The sentiment analysis model starts by identifying the grammatical
structure of sentences. How would POS tagging help in identifying the key
parts of a sentence (like adjectives and verbs) that carry sentiment (e.g., "The
product is amazing")?
Q 2: The company notices that their sentiment analysis model fails
to detect sarcastic remarks like "Oh, great! Another broken product!" How
would a more sophisticated sentiment analysis model handle such cases? What
NLP components can help detect sarcasm or irony in social media comments?

Q75.
Apply Single linkage and complete linkage for the data given below and draw
the dendrogram

Q76.
Apply bigram model to find the probability of the given sentence.
Corpus:
<s> I am from Vellore </s>
<s> I am a teacher </s>
<s> students are good and are from various cities</s>
<s> students from Vellore do engineering</s>
Sentence:
<s> students are from Vellore </s>

Q77.
Discuss the major components of Natural Language Processing (NLP) in detail.

Q78.
Consider a document collection consisting of five documents.
Doc1= English tutorial and fast track
Doc2 = learning latent semantic indexing
Doc3 = Book on semantic indexing
Doc4 = Advance in structure and semantic indexing
Doc5 = Analysis of latent structures
Query 1: Advance and structure and not analysis
Query 2: semantic and indexing and not latent.
Use a Boolean model to retrieve documents similar to the given queries.

Q79.
Apply Agglomerative clustering using the Complete Linkage method for a
given distance matrix and draw a dendrogram.

Q80.
Consider the following dataset. Apply the Naive Bayes classification algorithm
to predict the class label (Sleep problems) of the given samples. Test sample
(Bedtime = Normal, Walking = Normal, Mood during daytime = Normal).
Q81.
Consider a document collection consisting of four documents (D1, D2, D3, and
D4).
D1: "apple banana cherry"
D2: "apple banana"
D3: "banana cherry"
D4: "apple cherry"
Query 1: (apple "' banana "' 7cherry)
Query 2: (apple "' banana) V (apple "' cherry)
Query 3: (apple V banana) "' (apple V cherry) "' (7banana V 7cherry)
Query 4: (apple V banana V cherry) "' (apple V 7banana V 7cherry) "' (7apple
V banana V 7cherry) "' (7apple V 7banana V cherry)
Using Boolean Space Model, retrieve the documents that are similar to the
given queries.

Q82.
Discuss in detail the stages of Natural Language Processing (NLP) and apply
the first three stages of NLP to the following sentences.
i) "Independence Day is one of the important festivals for every Indian citizen."
ii) "It is celebrated on the 15th of August each year ever since India got
Independence from the British rule."
iii) "This day celebrates independence in the true sense."
Q83.
Design a Hidden Markov Model (HMM) tagger for the following sentences
using VITERBI approximation: The Doctor is in

Q84.
Given the following short movie reviews, each labeled with a genre, either
comedy or action:
a. (fun, couple, love, love) Target: comedy
b. (fast, furious, shoot) Target: action
c. (couple, fly, fast, fun, fun) Target: comedy
d. (furious, shoot, shoot, fun) Target: action
e. (fly, fast, shoot, love) Target: action
New document D: (fast, couple, shoot, fly)
Compute the most likely class for document D. Apply naive Bayes classifier
for classification and also use Laplacian smoothing for the likelihoods.

Q85.
Explain every stages of NLP with suitable example.
Q86.
Apply the Bigram model to predict the next word in the given sentence "He
went to the ?". Let's say in our corpus:
"to the" occurred 10,000 times
"to the store" occurred 3000 times
"to the park" occurred 2000 times
"to the beach" occurred 1500 times
"to the office" occurred 3500 times
Find the most suitable word based on the conditional probability.

Q87.
Write the python code snippet for the following:
a) Read the following real estate dataset and display the statistical information
about the given dataset
b) Display the values of ST_Name from the above given real estate dataset.
c) For the above real estate dataset, replace the missing value in ST_NUM
column by 125.
d) For the above real estate dataset, replace the NUM_BEDROOM by median
value of that column.
e) For the above real estate dataset, convert the categorical attribute ST_NAME
into numerical value.
f) Replace the value 12 with 'Y' for the OWN_OCCUPIED column
g) Count the number of each bedroom category
Q88.
Given the following data, Compute the eigen value and eigen vector based on
PCA Algorithm.

Q89.
Write Python source code to perform the following operations on a dataset.
Consider the following dataset named credit.csv.
(a) Identify the number of missing values in each column
(b) Print the columns and their values which has no missing values
(c) Generate Box plot to find outliers in the column C2
(d) Display the first five rows of each column
(e) Replace missing values With Median

Q90.
Consider the two-dimensional values x1=(4,7,12,7) and x2=(8,11,5,14), the
eigen values for the above pattern is as follows »1= 6.675 and »2=19.325;
eigen vector transpose values are e1t=[-0.0811 -0.585] and e2t=[0.585 -0.0811]
respectively where t is transpose. Compute the Principal Component PC1 and
PC2 using the Principal Component Algorithm. Plot the original points in a 6 x
6 grid, and
draw the calculated PCI axis and PC2 axis. Plot the original values and the
reduced dimensional values in the new axis.

Q91.
Given a dataset hogwarts_staff.csv with columns staff_id, house, salary, and
hire_date, write a Python function to calculate the median salary for each house
(e.g., Gryffindor, Slytherin, Ravenclaw, Hufflepuff) and plot the distribution of
salaries within each house using a box plot. Provide the Python code to solve
this.

Q92.
You are given the following dataset of 8 points in a 2D space.
A(1,2), B(3,4), C(5,6), D(7,8), E(9,1), F(4,7). G(6,2), H(8,3). Initialize the
centroids at A, D and F. Using the K-Means algorithm for three clusters,
perform clustering and provide the final cluster assignments and centroids after
convergence. Develop an efficient Python program to implement this problem

Q93.
Find Eigen Values and Eigen Vectors for the given 3*3 matrix.

Q94.
Suppose we have a dataset with the following points in two-dimensional space:
(2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12).
Perform K-means clustering with K=2, where initial cluster centroids are (3, 4)
and (10, 11). Execute K-means clustering algorithm for three iterations.

Q95.
Consider the height (in cm) and weight (in kg) of four soccer players:
c1(169,71), c2(193,92), c3(187,84), c4(175,69)
Compute the eigen vectors and eigen values for the given data using the
Principal Component Analysis (PCA) Algorithm.

Q96.
You are given a dataset as a python dictionary representing sales data for a
company, structured as follows:
sales_data = {
'product': ['A', 'B', 'C, 'A', 'B', 'C', 'A', 'B', 'C'],
'region': ['North', 'South', 'East', 'North', 'South', 'East', 'North', 'South', 'East'],
'sales': [100, 150, 200, 120, 160, 210, 130, 170, 220],
'date': ['2023-01-01', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02',
'2023-01-02', '2023-01-03', '2023-01-03', '2023-01-03'] )
Using the Pandas library present in python, provide code to perform the
following tasks:
i) Load the data into a pandas DataFrame and display the first 5 rows.
ii) Calculate and display the total sales for each product across all regions.
iii) Pivot the data to show total sales for each product across different regions
and visualize the results using a bar plot.
iv) Calculate a 3-day rolling average for the sales of each product and display
the results.
v) Apply Group By to analyze the sales trends for each region, showing the
total sales per day.

Q97.
Give applications for the following python modules:
i) Sympy
ii) NLTK
iii) Seaborn
iv) Torch
v) pySpark

Q98.
You are given the following Python function segment_string, which attempts to
segment a concatenated string s into valid words from a dictionary D.
However, this implementation is inefficient due to redundant computations and
lacks optimization. Optimize this code to improve its efficiency. Provide the
optimized version of the segment_string function using dynamic programming
techniques.
Q99.
Implement the Minimum Edit Distance algorithm to compute the minimum
number of edits required to transform one word into another using operations:
insertion, deletion, and substitution (each with a cost of 1). For example, the
minimum edit distance between the words "flaw" and "lawn" is 2(Remove f
and insert n).
i) Snow the dynamic programming table filling process step by step for
transforming "sunday" to "saturday", clearly indicating how you compute the
cost at each cell.
ii) Write a Python function using dynamic programming to calculate the
minimum edit distance between any two given words.

Q100.
The python code below, implements the gradient decent algorithm for logistic
regression, The parameters of the model are w and b. The learning rate of the
model is 0.1. Start with w= 0 and b =0, and report the value of w and b after
three iterations.
Q101.
Given the following points in 2D space, <(2,10) (2,5) (8,4) (5,8) (7,5) (6,4)
(1,2) (4,9)>. Suppose we initially assign (2,10), (5,8) and (1,2) as centroids,
Apply k-means clustering to cluster the given data points where k = 3.
Q102.
Write a python code for stop word removal, stemming, lemmatization,
vectorization using vector count.

Q103.
Write a GNU Octave script to represent the system in matrix form Ax=b, where
A is the coefficient matrix and b is the constant vector. Use Octave to solve for
the vector x, (which contains the values of x,y,z) using matrix operations.
Display the solution.
2x+3y-z=1; 4x+y+2z=2; -2x+5y+3z=3

Q104.
Write GNU Octave script to find the eigen value, transpose, inverse and
determinant of the following matrix:

Q105.
Write an octave code to perform the following.
i) Represent matrix A and B with (2,3) and (3,2) dimension.
ii) Retrieve non matching elements from set A and set B.
iii) Plot a SIN wave with dashed line with its colour in blue.
iv) Draw the pie chart for the following details. CAT I:24(15%), CAT
II:30(15%), Quiz I:5, Quiz II: 9, DA:10.
v) Draw the scatter plot for random x and y value with markers as circle, colour
as cyan and size with 6 points.

Q106.
Explain any five arithmetic operations in GNU Octave with examples.

Q107.
Discuss the concept of plotting in GNU Octave. Explain the three different
types of plots that can be generated and the methods for customizing the plot
appearance.

Q108.
Write the code using GNU Octave in both the questions
Q109.
You are a machine learning engineer developing a neural network for object
classification in self driving cars. The network architecture includes an input
layer with 5 neurons (object
features), a hidden layer with 3 neurons using the ReLU activation function,
and an output layer with 5 neurons using the Softmax activation function to
output probabilities for five object categories. Given the input feature vector x
and the weight matrices W1 and W2 below, write a GNU Octave script to
simulate the forward pass of the network. Display the outputs at each layer,
apply the ReLU activation after the hidden layer, and apply the Softmax
function at the output layer.
Q110.
What are the differences between MATLAB and GNU Octave?

Q111.
Write the octave code to perform the following.
a. Determinant and Transpose of the matrix
b. Eigen value and eigen vector
c. Set operations - union, intersection
d. Plot sin series data

Q112.
Write short notes on the following charts
i) Area chart
ii) Scatter plot
iii) Box-whisker plot
iv) Heat map

Q113.
How do you define a Dashboard? How to create a Dashboard? Explain the
Dashboard design principles.

Q114.
Discuss the significance and applications of special chart types in Tableau for
data visualization. Provide three examples of special chart types and explain
how they enhance data analysis and presentation.

Q115.
A store tracks daily transactions in a dataset with the following columns:
Transaction_ID (unique identifier for each transaction)
Date (date of the transaction)
Region (e.g., North, South, East, West)
Product_Type (e.g., Electronics, Clothing, Home Goods)
Units_Sold
Revenue
a) Sketch a mock line chart to show Revenue over time for each Region.
Ensure each region has a separate line with a distinct colour. Label the peak
revenue point for any one region and describe briefly how you would
accomplish this in Tableau.
b) Describe how you would apply a filter in Tableau to view data only for the
Product_Type "Electronics." Additionally, explain how you could drill down
from monthly to daily data on the line chart to analyze trends in more detail.
c) Assume that, after creating a dashboard with this chart, you observe that the
South region has a consistent upward trend in Revenue, while the West region
shows a decline over the last three months. Suggest two possible actions the
store might take to address these trends, specifically one for each region.

Q116.
Explain the significance of tableau in data analytics. Also write in detail about
the generation of various charts for visualization.

You might also like