The goal: We have been challenged to predict sales data provided by the retail giant Walmart 28 days into the future.
The data: We are mainly working with a 13,683,901 rows × 13 cols data.**
**The result: **We are currently ranking 240/3685(Top 9%), which is in the bronze medal zone.
Fall abstract:Abstract.doc
There are three csv files in total, covering Walmart sales information from 2011–2016.
2.1.1 Hierarchical Sales Time Series Datasales_train_validation.csv
The main data were obtained in 10 Walmart stores from the 3 US states of California (CA), Texas (TX), and Wisconsin (WI).
Note: the test set (28 days we are challenged to predict) is not included
- “Hierarchical” information:** **The data split comprises 3049 individual products from 3 categories and 7 departments, sold in 10 stores in 3 states, thus can be aggregated on 4 different levels: item level, department level, product category level, and state level(or combined levels).
- Time Series data: 42,840 hierarchical sales time series, 1 column for each of the 1941 days from 2011-01-29 and 2016-05-22.
2.1.2 Product Price Informationsell_prices.csv
- Weekly average prices: The store and item IDs together with the sales price of the item as a weekly average.
Note: the test set (28 days we are challenged to predict) is included in this csv
2.1.3 Holiday Informationcalendar.csv
Dates together with related features like day-of-the week, month, year, and special holidays
- Event days: Religious, sports or cultural event days like SuperBowl.
- SNAP days: whether the stores in each state allowed purchases with SNAP food stamps.
The sales information in 3 .csv files following reaches back from Jan 2011 to June 2016, as well as prices and events calendar provided additionally.
The validation set is provided but the evaluation set is unknown (teams will be ranked based on RMSE scores in predicting evaluation set.)
We draw many plots to find the patterns of data. The following 3 mindmaps are all conclusions we found in above-mentioned 3 files. To keep this report brief, we will only cover ground overview in this parts. And the next "Feature Engineering" parts we present the EDA directly related to each topics.
交互式版本可以打开: EDA-TS-1.html
- Aggregating to 1 time series:
- Sales are generally going up, the most recent sales(2015~2016) appear to grow a bit faster.
- Some yearly seasonality and strong weekly seasonality.
- A dip at Christmas, which the day of the year the stores are closed.
- Explanation of features with 7days rolling:
- The weekly pattern is strong, with Sat and Sun standing out prominently.
- The months of Nov and Dec show clear dips, while the summer months May, Jun, and Jul suggest a milder secondary dip.
- Explanation of features 3~37:
- However, we use the lag started from 28 instead of 1, because it will appear "rolling predictions" problem that make the later models unstable in forecasting.
下图的交互式版本::EDA-TS-2.html
- California (CA) sells more items in general(maybe because it contains 3 stores in data)
- Wisconsin (WI) was slowly catching up to Texas (TX) and eventually surpassed in the last months.
- The CA stores are relatively well separated in store volume.
- “CA_2”, which declines to the “CA_4” level in 2015, recover and jump up to “CA_1” sales later.
- TX stores are quite close together in sales
- The WI stores “WI_1” and “WI_2” show a curious jump in 2012, while “WI_3” shows a long dip.
- Combined Level 2 + Level 3 visualization::
- “FOODS_3” is clearly driving the majority of “FOODS” category sales.
- “HOUSEHOLD_1” is also outselling “HOUSEHOLD_2”.
- “HOBBIES_1” has higher sales than “HOBBIES_2”, but both are not growing over time.
- “Foods” are the most common category in terms of sales, followed by “Household” which is still quite a bit above “Hobbies”.
- However, the number of “Household” rows is closer to the number of “Foods” rows than the corresponding sales figures, indicating that more “Foods” units are sold than “Household” ones.
这部分的数据展示比较复杂,所以我也画了一个交互式的图表:原始数据可视化表盘.html
- First of all, the distributions are almost identical between the 3 states.
- Only some minute differences in the “FOODS” category
- Also, there are notable differences between the categories: FOODs are on average cheaper than HOUSEHOLD items. And HOBBIES items span a wider range of prices than the other two; even suggesting a second peak at lower prices.
- Events effect on sales in different category:
- For HOBBIES, normal days have slightly large sales(also larger above-average portion)
- FOODS sales are notably higher during “Sporting” events.
- In general, “National” and “Religious” events both lead to relative decline in sales volume.
- **Events effect on sales in different **State:
- Special events slightly outsell non-event days in TX before 2014.
- For WI, “National” events have a strong negative impact on sales numbers.
- In contrast, “Religious” events have the smallest, but still negative impact in WI.
- “Sporting” events have a positive influence in each state.
- **SNAP days effect on sales in different **State:
- The SNAP days have clearly higher sales in every state.
- The largest difference to non-SNAP days is present for WI
To better present the data later and to allow predictions after training by inputting new data, I sketched a simple web UI. However, due to time constraints, the backend data connection was not completed. I hope to improve this in future courses。
user interface of data dashboard.zip(not yet complete)
During data visualization, many conclusions were discovered. To facilitate feature engineering and strategy design later, I summarized three mind maps (corresponding to sections 3.2.1–3.2.3, 3.2.4, 3.2.5).
: