0% found this document useful (0 votes)

4 views20 pages

MDA&R Chapter 1

The document outlines the process of data analysis, emphasizing the importance of cleaning raw data to identify and fix errors before summarizing it. It discusses various types of data errors, methods for identifying these errors, and the necessity of documenting all checks and modifications made. Additionally, it highlights the need for summarizing data to understand its distribution and suggests using automated checks and reasonableness checks to ensure data integrity.

Uploaded by

faith.aastha.shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views20 pages

MDA&R Chapter 1

Uploaded by

faith.aastha.shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Class: TY BSc

Subject : Model Documentation Analysis and Reporting

Subject Code: PUSASQF503
Chapter: Unit 1 Chapter 1
Chapter Name: Data Analysis

1
Today’s Agenda
1. Data Analysis
1. The given Data
2. Raw data and Clean data
3. Cleaning the data
4. Identify data errors
5. Fix the data errors
6. Summarizing data
7. Extracting information

2
1.1 The given Data
A typical assessment project will present you with a data set to work with.

What do you think – Can we use the given data as it is always or do we need to make some pre-processing of the
data?

3
1.1 The given Data
• The data set you are given will not be in the precise form you require.

• For example, you might be provided with a set of values of the FTSE 100 index, which you must ﬁrst
convert into rates of return. As a result, some pre‐processing of the data may be required.

• Also, the data you are given might contain errors. If these are not dealt with, the results of the analysis will
be meaningless. So it is important to validate or ‘clean’ the data, ie identify and deal with any errors.

• You may also need to summarise the data set, ie calculate some key statistics that describe the ‘shape’ of
the data.

• Note that you may be asked to create your own data! This is not as silly as it might sound – you may be
asked to use a particular model to simulate a set of results, which will then form your data set.

4
1.2 Raw Data & Clean Data
It is good practice to keep the original data set intact and work from a ‘copy’ within Excel. This means that if a
different, or corrected, data set is used later, it will be possible to compare the two data sets.

Ideally your spreadsheet should show separately the original ‘raw’ data with any warning messages from the
Excel checks that were applied and the modiﬁed ‘clean’ data with the corresponding Excel checks now saying
‘OK’.

5
1.3 Cleaning the Data
What do you think – What type of errors would the data contain?

6
1.3 Cleaning the Data
Types of data errors

Errors in numerical data in computer ﬁles usually consist of:

• wrong numbers
• outliers
• omissions or duplicates

7
1.3 Cleaning the Data
Types of data errors

• Wrong numbers can occur because of incorrect inputting.

Particularly common are:
Wrong • omitted digits, eg 21553 instead of 215553
numbers • anagrams, eg 2456 instead of 2546
• mistakes involving repeated digits, eg 1223 instead of 1233.

• ‘Outliers’ are extreme data values that don’t appear to be

Outliers consistent with the model – they don’t ﬁt the pattern.
• They may distort the results.

The data could have missing entries for certain values.

Omissions or Also, there could be certain values that are double counted or
Duplicates entered twice (two times for same person, same stock etc.)

8
1.4 How can we identify data errors?
Usually, it will be sufficient to:
• scan through the data by eye to spot any obvious problems (eg missing entries). However, this may not pick up
all the errors, especially if the data set is large.
• calculate a few summary statistics, such as the number of data values and the maximum/minimum values
• apply some automated Excel checks, using Excel formulae that will highlight any errors
• apply some reasonableness checks to the summary statistics. The summary statistics should highlight any
outliers, as these may fall outside the normal range of values. The summary calculations will also throw up an
error if you have applied an Excel function to data that contains invalid characters – for example, a letter O
instead of a zero.
• reconcile the summary statistics with any additional information you have been given in the project specification.
• Plotting a graph can be useful where we have a series of data values that we would expect to show a consistent
progression. A graph would highlight any spikes or other irregularities that might be difficult to spot otherwise.

9
1.4 How can we identify data errors?
• Try to incorporate some automated checks on the data and include a description of this in the audit trail.
For large data sets, automated checks are more reliable than reviewing by eye.

• Document all the data checks you apply and any remedial action you take (even if no remedial action is
required) and give reasons for your approach.

• However, don’t spend too long working on the data. It is important to move on to develop the rest of the
model.

10
Question
Two of the columns of data provided for a valuation of the beneﬁts for employees of a large company who are
members of the company’s pension scheme are:
• sex (with M for male or F for female)
• date of birth (in the format DD/MM/YYYY).

(i) List the checks that you could apply to the data values in these two columns to ‘clean’ the data.

11
Solution
If we are told how many employees there ‘should’ be, we can start by counting the numbers of M’s and F’s to
check that these match the numbers of males and females in the pension scheme on that date.

For the ‘sex’ column we could:

• scan by eye for any missing entries or ones that are not M’s or F’s
• apply an automated Excel check to ensure that all the entries are either M or F
• use the ﬁlter feature in Excel to identify the different entries present in the column (which should only
include M’s and F’s)
• count the number of entries in the column and check that this is consistent with the number of employees
in the pension scheme
• check with the company how many employees there should be.

More advanced checks we could apply (if we had the necessary information) would include:
• compare the numbers of each sex (or the gender ratio) with the corresponding ﬁgures from the previous
valuation
• use the employees’ names or employee numbers to check that the entries are consistent on an individual
basis with the previous valuation.

12
Solution
For the ‘date of birth’ column we could:
• scan by eye for any missing entries or obvious errors, eg years containing 5 digits
• apply an automated Excel check to ensure that all the entries are valid dates, eg no 30th February’s or
month 13’s or unpopulated entries such as 00/01/1900 or DD/MM/YYYY
• calculate the minimum and maximum age to check that there are no outliers, eg employees aged 12 or 105.

More advanced checks we could apply (if we had the necessary information) would include:
• calculate the average age and check that this is consistent with the employee proﬁle
• compare the average age with the corresponding ﬁgure from the previous valuation
• use the employees’ names or employee numbers to check that the entries are consistent on an individual
basis
• plot a graph of the number of employees born in each year (or each month) to look for any irregularities

13
1.5 How do we ﬁx data errors?
• If you spot a data error that you think would signiﬁcantly affect the results, you should modify the data as
you think best, and document clearly in your audit trail what you have done and why.

• Try to set up your spreadsheet so that any changes made to the data at a later stage (possibly by someone
else) will automatically be reﬂected in the subsequent calculations.

• Where possible avoid copying and pasting values since this means that changes to the data will not be
reﬂected later in the calculations. Use cell references instead.

• If, for some reason, you cannot avoid pasting values, document very clearly what you have done so that
someone else would be able to replicate your work

14
1.5 Summarizing a data set
The purpose of summarising a data set is to get an idea of the
distribution of the values – the
‘shape’ of the data. This normally involves ﬁnding:
• the number of data values
• the highest and lowest values
• sample moments, such as the mean and standard deviation.

15
1.5 Extracting the relevant information from a data
set
In some cases you may need to convert the data from its original form. For example, you might
need to convert a history of market values of an asset into rates of return or you might need to
convert dates of birth into ages before proceeding.

This conversion should be done after you have sorted out any problems with the original data.

16
Summary
• Make sure you understand the data you’ve been given.
• ‘Clean’ the data by identifying any obvious errors.
• Include some automated checks.
• Calculate summary statistics, such as totals and averages.
• Apply reasonableness checks to identify any outliers.
• Reconcile the summary statistics with any additional information given.
• Consider plotting a graph to highlight errors when checking a series of data values.
• ‘Prepare’ the data eg calculate any derived quantities and/or subdivide the data.
• Be prepared to create your own data set for a simulation.
• Document all the data checks you apply, even if no remedial action is required.
• Don’t spend too long working on the data, especially if there don’t appear to be any problems.
• Not all data sets will need all the steps outlined above. You will need to demonstrate that you can
apply the appropriate steps.

17
Question
A colleague has mentioned that marks are awarded in Paper 1 of the CP2 exam for ‘auto‐checks’, ie formulae in
the spreadsheet that check the values in particular cells. One purpose of an autocheck is to check whether a
value that has been entered by the user is an acceptable value for that cell, eg whether it is a valid date that
falls within a permitted time period.

(i) State two other purposes that auto‐checks can be used for and give an example of each. [2]

(ii) List four other types of check (other than auto‐checks) that can be used to identify possible errors in the
data in a spreadsheet. [2]

(iii) (a) Explain what is meant by a reasonableness check (also known as a ‘sense check’).

(b) Give an example of a reasonableness check that might be used in an actuarial context.

18
Solution
(i) Auto‐checks
Auto‐checks can be also be used to check for:
• consistency, eg checking that the totals in a summary table are consistent with the totals in the data they
are derived from or that a set of probabilities adds up to 1 [1]
• reasonableness, eg checking that a calculated value is similar to a rough estimate or lies within the range
expected, eg probabilities lying in the range [0,1] .

(ii) Other types of checks

Other types of checks that can be used include:
• checking for obvious errors ‘by eye’
• calculating summary statistics (eg averages, minimum/maximum)
• using Excel’s built in features such as auto‐filter
• comparing against independent information or background information provided
• spot checks, ie calculating a few figures manually
• reasonableness checks, eg comparing a figure with a rough estimate.

19
Solution
(iii)(a) Reasonableness checks
As the name suggests, a reasonableness check is intended to check whether a particular ﬁgure is believable, ie
that it is not obviously wrong. This could involve:
• comparing it with a rough estimate
• checking that it lies in the range that would be expected.

(iii)(b) Example of a reasonableness check

Examples in an actuarial context might include:
• a pension fund might check the total annual payroll for a large company by multiplying the number of
employees by a published average wage for companies in that sector
• a non‐life insurer might check the average claim size for a particular class of insurance by comparing it with
the corresponding ﬁgure from the previous year, adjusted for inﬂation
• an investment manager might check the latest value of its investments by comparing the proportions held
in each asset class with the proportions from the previous valuation.

Excel For Auditors
100% (1)
Excel For Auditors
53 pages
Eft Tapping Full Ebook
No ratings yet
Eft Tapping Full Ebook
45 pages
Data Analysis Notes
100% (1)
Data Analysis Notes
12 pages
The Practical Guide To Computer Practice Intro N4 Office 2007
100% (4)
The Practical Guide To Computer Practice Intro N4 Office 2007
13 pages
Caie As Level It 9626 Practical v1
No ratings yet
Caie As Level It 9626 Practical v1
13 pages
Vikram Nanda
No ratings yet
Vikram Nanda
1 page
Mathematical Functions in Excel
0% (1)
Mathematical Functions in Excel
15 pages
Data Science Analytics Reviewer
No ratings yet
Data Science Analytics Reviewer
10 pages
EDA
100% (1)
EDA
9 pages
Curriculum Map in Grade 4 Computer Education
No ratings yet
Curriculum Map in Grade 4 Computer Education
8 pages
Niq Preparation 2
No ratings yet
Niq Preparation 2
9 pages
Office 2016 Group Policy and Oct Settings
No ratings yet
Office 2016 Group Policy and Oct Settings
1,975 pages
Excel Lab: Data Quality & Text Functions
No ratings yet
Excel Lab: Data Quality & Text Functions
6 pages
Excel for Data Analysis Beginners
100% (1)
Excel for Data Analysis Beginners
56 pages
Lesson Notes - Validation & Verification
No ratings yet
Lesson Notes - Validation & Verification
4 pages
? Data Analysis With Excel - Study Notes
No ratings yet
? Data Analysis With Excel - Study Notes
5 pages
P&RLIP Unit 1 PracticeQuestions
No ratings yet
P&RLIP Unit 1 PracticeQuestions
17 pages
Module 3 Toolkit
No ratings yet
Module 3 Toolkit
38 pages
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
No ratings yet
CSC 452 DM Week04 Data PreProcessing A 13102020 015436pm
31 pages
Data Cleaning Guide for Analysts
100% (2)
Data Cleaning Guide for Analysts
19 pages
DAT101x Lab 1 - Exploring Data PDF
0% (2)
DAT101x Lab 1 - Exploring Data PDF
11 pages
DAE - Record 1a - 1b - 2 - 3a - Updated
No ratings yet
DAE - Record 1a - 1b - 2 - 3a - Updated
23 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
EmpowermentTechn q1 Mod2 ProductivityTools v5
100% (1)
EmpowermentTechn q1 Mod2 ProductivityTools v5
71 pages
4t GM1927-69 Drill Wide Matrix 01-JUN-06
100% (1)
4t GM1927-69 Drill Wide Matrix 01-JUN-06
8 pages
Exc Report
No ratings yet
Exc Report
28 pages
Subtitle
No ratings yet
Subtitle
2 pages
Week 10 Tutorial Questions Chapter 6
No ratings yet
Week 10 Tutorial Questions Chapter 6
4 pages
Harsh Raj Sarraf Practical
No ratings yet
Harsh Raj Sarraf Practical
74 pages
Priyanshi Tomer Data Analytics 1
No ratings yet
Priyanshi Tomer Data Analytics 1
74 pages
Module 3
No ratings yet
Module 3
55 pages
Advanced Excel Techniques Guide
100% (1)
Advanced Excel Techniques Guide
20 pages
Excel Course Manual LCBS S3
No ratings yet
Excel Course Manual LCBS S3
2 pages
Lab 2
No ratings yet
Lab 2
3 pages
CSAC 2511 AIS: Spreadsheet Analysis: Using Microsoft Excel
No ratings yet
CSAC 2511 AIS: Spreadsheet Analysis: Using Microsoft Excel
13 pages
WebADI Setup Guide for Excel Users
No ratings yet
WebADI Setup Guide for Excel Users
15 pages
Excel For Data Analysis
No ratings yet
Excel For Data Analysis
9 pages
Excel Projects: Formulas, Charts, and SmartArt
No ratings yet
Excel Projects: Formulas, Charts, and SmartArt
27 pages
Ms Excel Skills Required For Civil Engineering Student.
No ratings yet
Ms Excel Skills Required For Civil Engineering Student.
2 pages
Introduction To Softwares Microsoft Excel and Power Point - Unit 3
No ratings yet
Introduction To Softwares Microsoft Excel and Power Point - Unit 3
6 pages
Project Questions
No ratings yet
Project Questions
5 pages
Unit1 Practicequestion
No ratings yet
Unit1 Practicequestion
2 pages
Young Actuaries Advisory Board University Innovation Challenge 2025
No ratings yet
Young Actuaries Advisory Board University Innovation Challenge 2025
2 pages
BC Unit1 Chapter4
No ratings yet
BC Unit1 Chapter4
23 pages
Best Practices For Data Cleaning - EN - 1802
No ratings yet
Best Practices For Data Cleaning - EN - 1802
13 pages
Unit - II Business Analytics
No ratings yet
Unit - II Business Analytics
25 pages
Student Guide M3
No ratings yet
Student Guide M3
57 pages
Week 5 Assignme-WPS Office
No ratings yet
Week 5 Assignme-WPS Office
3 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Shadow Worksheet
No ratings yet
Shadow Worksheet
8 pages
Intergraph Smart 3D: (Includes Smartplant® 3D, Smartmarine® 3D, Smartplant® 3D Materials Handling Edition)
No ratings yet
Intergraph Smart 3D: (Includes Smartplant® 3D, Smartmarine® 3D, Smartplant® 3D Materials Handling Edition)
199 pages
Data Analysis
No ratings yet
Data Analysis
29 pages
Using Excel To Clean and Prepare Data
No ratings yet
Using Excel To Clean and Prepare Data
9 pages
Shishu Manu
No ratings yet
Shishu Manu
80 pages
Crystal Meaning N Uses
No ratings yet
Crystal Meaning N Uses
8 pages
MDA&R Chapter 2
No ratings yet
MDA&R Chapter 2
19 pages
Goalgetter Monthly Focus Work Book
No ratings yet
Goalgetter Monthly Focus Work Book
22 pages
Electronics Spreadsheet
No ratings yet
Electronics Spreadsheet
7 pages
Grapher 12 Users Guide Preview
No ratings yet
Grapher 12 Users Guide Preview
117 pages
Stage 2 Word Report Final
No ratings yet
Stage 2 Word Report Final
16 pages
Data Analytics Questions
No ratings yet
Data Analytics Questions
6 pages
Module 4 - (Process Data From Dirty To Clean)
No ratings yet
Module 4 - (Process Data From Dirty To Clean)
36 pages
Final Audit Trail To Use
No ratings yet
Final Audit Trail To Use
4 pages
الظاهره
No ratings yet
الظاهره
27 pages
Module 2 Data Science New
No ratings yet
Module 2 Data Science New
57 pages
Data Analitics 4
No ratings yet
Data Analitics 4
10 pages
Ritu Kumari
No ratings yet
Ritu Kumari
58 pages
Quiz InequalitiesExponentialLogFuncs
No ratings yet
Quiz InequalitiesExponentialLogFuncs
28 pages
Astro
No ratings yet
Astro
10 pages
Guc 59 64 49551 2024-11-03T14 11 20
No ratings yet
Guc 59 64 49551 2024-11-03T14 11 20
25 pages
Exam Code 0417: TR Thidamaung Data Analysis Notes 1
No ratings yet
Exam Code 0417: TR Thidamaung Data Analysis Notes 1
26 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
34 pages
Excel Cleanup Guide
No ratings yet
Excel Cleanup Guide
14 pages
User Guide of The Hazelnut Production Costing Tool PDF
No ratings yet
User Guide of The Hazelnut Production Costing Tool PDF
34 pages
INtools v13 Advanced Task Guide
No ratings yet
INtools v13 Advanced Task Guide
2 pages
Syllabus of 3 Years Degree Course 2024-25
No ratings yet
Syllabus of 3 Years Degree Course 2024-25
60 pages
Nrri Digital Logs TR 2003 21 PDF
No ratings yet
Nrri Digital Logs TR 2003 21 PDF
107 pages
Using Excel To Clean and Prepare Data For Analysis
No ratings yet
Using Excel To Clean and Prepare Data For Analysis
9 pages
Prompts
No ratings yet
Prompts
5 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Project 2023 Pharma
No ratings yet
Project 2023 Pharma
3 pages
12.ski Incident Investigation Tool
No ratings yet
12.ski Incident Investigation Tool
10 pages
VHDL Paper
No ratings yet
VHDL Paper
32 pages
Labview To Excel
No ratings yet
Labview To Excel
9 pages
Virtual Lab 3
No ratings yet
Virtual Lab 3
2 pages
Not For Sale: 17.6 Cleansing Data
No ratings yet
Not For Sale: 17.6 Cleansing Data
8 pages
Fundamental Data Analysis
No ratings yet
Fundamental Data Analysis
14 pages
Data Integrity for Analysts
No ratings yet
Data Integrity for Analysts
48 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Microsoft Office Excel 2007: by Sunil Kadam
No ratings yet
Microsoft Office Excel 2007: by Sunil Kadam
64 pages
Denisha FINAL - PROPOSAL.
No ratings yet
Denisha FINAL - PROPOSAL.
11 pages
Decision Support System Lesson Plan
No ratings yet
Decision Support System Lesson Plan
7 pages
Research Paper Descriptive Analysis Assignment Guide
No ratings yet
Research Paper Descriptive Analysis Assignment Guide
3 pages
Remedy Reporting List v1
No ratings yet
Remedy Reporting List v1
8 pages

MDA&R Chapter 1

Uploaded by

MDA&R Chapter 1

Uploaded by

Class: TY BSc

Subject : Model Documentation Analysis and Reporting

Errors in numerical data in computer ﬁles usually consist of:

• Wrong numbers can occur because of incorrect inputting.

• ‘Outliers’ are extreme data values that don’t appear to be

The data could have missing entries for certain values.

For the ‘sex’ column we could:

(ii) Other types of checks

(iii)(b) Example of a reasonableness check

You might also like