Dear Students,
You are given a dataset of an online retail store. This dataset contains information about
online transactions. Following is the screenshot of first few rows of the dataset.
You are requested to conduct following analysis on this dataset.
1. Clean and preprocess the dataset so that values are cleaned, non-redundant, and
follow appropriate data types.
2. Compare monthly sales patterns of top five products (in terms of revenue)
3. Find out top 100 customers for 2009-10 and 2010-11 in terms revenue generated. Is
there any overlap in these two sets? If Yes, how many customers are common among
these two sets.
4. Find 2-3 products that has the most stable sales across various months of the year?
5. Which products are the best-selling products (top 5) in the months of May-June and
December-January respectively?
6. For which product, there is maximum units sold in any month across both years?
7. Which product has the highest monthly average revenue for the year 2010-11?
8. Please conduct RFM analysis on this dataset. Please generate an appropriate label for
each customer on the basis of their RFM score. Following note extracted from internet
may help you understand RFM segmentation. (Hint: You may use group by function
or pivot in RapidMiner to generate the Recency, Frequency, Monetary scores for each
customer in the dataset. Use only 2010-11 date for generating RFM segmentation).
“RFM is a method used for analyzing customer value. It is commonly used in database marketing
and direct marketing and has received particular attention in retail and professional services
industries. RFM stands for the three dimensions:
• Recency – How recently did the customer purchase?
• Frequency – How often do they purchase?
• Monetary Value – How much do they spend?
Customer purchases may be represented by a table with columns for the customer name, date of
purchase and purchase value. One approach to RFM is to assign a score for each dimension on
a scale from 1 to 10. The maximum score represents the preferred behavior and a formula could
be used to calculate the three scores for each customer. For example, a service-based business
could use these calculations:
• Recency = the maximum of "10 – the number of months that have passed since the
customer last purchased" and 1
• Frequency = the maximum of "the number of purchases by the customer in the last 12
months (with a limit of 10)" and 1
• Monetary = the highest value of all purchases by the customer expressed as a multiple of
some benchmark value
Alternatively, categories can be defined for each attribute. For instance, Recency might be broken
into three categories: customers with purchases within the last 90 days; between 91 and 365 days;
and longer than 365 days. Such categories may be derived from business rules or using data
mining techniques to find meaningful breaks.
Once each of the attributes has appropriate categories defined, segments are created from the
intersection of the values. If there were three categories for each attribute, then the resulting matrix
would have twenty-seven possible combinations (one well-known commercial approach uses five
bins per attributes, which yields 125 segments). Companies may also decide to collapse certain
subsegments, if the gradations appear too small to be useful. The resulting segments can be
ordered from most valuable (highest recency, frequency, and value) to least valuable (lowest
recency, frequency, and value). Identifying the most valuable RFM segments can capitalize on
chance relationships in the data used for this analysis. For this reason, it is highly recommended
that another set of data be used to validate the results of the RFM segmentation process.
Advocates of this technique point out that it has the virtue of simplicity: no specialized statistical
software is required, and the results are readily understood by business people. In the absence of
other targeting techniques, it can provide a lift in response rates for promotions.”