0% found this document useful (0 votes)

194 views3 pages

Exploratory Data Analysis For Machine Learning

The document summarizes exploratory data analysis of a climate dataset containing monthly temperature fluctuations from 1870 to 2019. Key findings include that the data follows a normal distribution with some outliers, linear regression is not suitable due to the non-linear dispersion of the data, and the temperature readings are not influenced by each other but rather external seasonal variables. A hypothesis test determined the average temperature anomaly is higher than -0.1°C. Further non-linear analysis is suggested due to the non-linear behavior of the data series. The data set is of high quality with no missing values.

Uploaded by

Gabriel Pessine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

194 views3 pages

Exploratory Data Analysis For Machine Learning

Uploaded by

Gabriel Pessine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Exploratory Data Analysis for Machine Learning

01. Brief description of the data set and a summary of its

attributes
The dataset used is Brazil’s INPE, which contains climate data. This one specif ically is related to the
ENOS phenomenon.

The dataset contains 1800 data of monthly temperature f luctuations, from 1870 to 2019.

The data series contains 150 rows, corresponding to the years, and 12 columns, one f or each month.

02. Initial plan for data exploration

The initial plan f or data exploration was to determine mean, median, quantities and ranges (max and
min) f or each of the monthly data measurement we have.

03. Actions taken for data cleaning and future engineering

Regarding data cleaning, I made a scatter plot, with Matplotlib, of the dataset. This initial step was
done to clean, since we noticed the outliers graphically. It was observed that there were outliers data
that did not correspond to the data trend, so they could, by identifying them visually, be cataloged as
anomalies.

Af ter that, I made histogram f or any of the 12 f eatures (early), thus creating a single plot with
histograms f or each f eature overlayed. This was done using Pandas plotting f unctionality.

For f uture engineering, I will set out to improve on a baseline set of f eatures :, deriving new f eatures
f rom our existing data. This should make the dif ference between a weak model and a strong o ne. I
plan to use visual visual exploration, intuition and domain understanding in order to construct new
f eatures to improve the f orecasting capabilities of some new f oreseen model.

04. Key Findings and Insights, which synthesizes the results

of Exploratory Data Analysis in an insightful and actionable
manner

Af ter analyzing, I came to thin on what do these plots tell us about the distribution of the target .
The f inding were that the distribution of each of the data sets f ollows an almost normal
distribution, with some exceptions (as expected).

By looking at the scatterplots of the data series, we can interpret that it is not possible to perf orm linear
regressions, either by the method of least squares or by any other way that allows the interpretation
of the behavior of the data in the f orm of linear, polynomial or other equations. The dispersion of the
data is such that other, more complex statistical analyses would be required.
Also, each data set is not dif f erent in nature f rom the others, as they all ref lect the behaviors of
a natural variable. Despite the possible sources of error in the measurement of these data, it is
not possible to say that these data are inf luenced by each other. The environmental conditions
vary according to the time of year, but this is an external variable.

05. Formulating at least 3 hypothesis about this data

As requested, 3 hypotheses were f ormulated, as f ollows:

1) The ENSO phenomenon will have an impact on sea level rise if the anomalies mostly oscillate
between -0.5 and 1.5 ºC. Has this been true in the time interval between 1870 and 2020?
2) The temperature's anomalies, on average, are higher than -0,10. The decision is based on a
check of a random sample of this data.
3) We can say that there hasn’t been a period of twelve months in which, in a row, these
temperature anomalies have f allen f rom 0.5 ºC?

06. Conducting a formal significance test for one of the

hypotheses and discuss the results
Hypotheses 2 was chosen f or the f ormal significance test.

On average, the temperature's anomalies are higher than -0,10. The decision is based on a check of
a random sample of this data.

Population: X = 'Water temperature anomalies on sector 3.4 of the Pacif ic Ocean'

Ho: μ = -0,1 null hypothesis

Ha: μ > -0,1 alternative hypothesis

The approach is to use the one sample t-test, which determines whether the sample mean is
statistically different f rom a known or hypothesized population mean. The One Sample t Test is a
parametric test.

Af ter the test was perf ormed, and observing that the p-value was way higher than the standard
conf idence level, we can accept the hypothesis that, on average, the temperature's anomalies are
higher than -0,1. There’s an almost non-existing evidence in support f or the alternative hypothesis -
that, on average, the temperature's anomalies are not higher than -0,1.

07. Suggestions for next steps in analyzing this data

Even though we can observe that the data series presents an almost normal distribution, the the
scatterplot allows us to identif y how this data series doesn’t show a linear trend, or that it can be
identif ied with a reasonable equation, due to the absolute dispersion of the data, as it can be detailed
in the standard deviations. So, it indicates that the analysis of this series need to use non-linear
approaches to be able to analyze their behavior more precisely.
08. A paragraph that summarizes the quality of this data set
and a request for additional data if needed
The INPE’s data set is of great quality: no data was missing, no inconsistencies or outliers were
observed f ar f rom the range of values in the data set. It also has a normal distribution and it showed
that dif ferent statistical analysis were possible to be done using this data set.

Public Policy Analysis Course
100% (1)
Public Policy Analysis Course
5 pages
D&D Monster Challenge Ratings
No ratings yet
D&D Monster Challenge Ratings
1 page
BDA Mean Median Mode Questions
67% (3)
BDA Mean Median Mode Questions
12 pages
Data Science at The Warriors
No ratings yet
Data Science at The Warriors
9 pages
ACtuarial CS1 2019 Booklet Core Reading
No ratings yet
ACtuarial CS1 2019 Booklet Core Reading
157 pages
Determination of Collagen in Cosmetics by HPLC
No ratings yet
Determination of Collagen in Cosmetics by HPLC
3 pages
Ab Initio Calculations Tutorial
No ratings yet
Ab Initio Calculations Tutorial
286 pages
(Universitext) Sterling K. Berberian (Auth.) - Fundamentals of Real Analysis (1999, Springer) (10.1007 - 978-1-4612-0549-4) - Libgen - Li
No ratings yet
(Universitext) Sterling K. Berberian (Auth.) - Fundamentals of Real Analysis (1999, Springer) (10.1007 - 978-1-4612-0549-4) - Libgen - Li
490 pages
Climate Data Analytics 15 Slides
100% (1)
Climate Data Analytics 15 Slides
15 pages
Texas Instruments BA II Plus (TI BA II+)
100% (1)
Texas Instruments BA II Plus (TI BA II+)
14 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
139 pages
FIN 5309 Homework 9 Solution Fall 2018: Instructions
No ratings yet
FIN 5309 Homework 9 Solution Fall 2018: Instructions
16 pages
Test Bank Questions Chapter 6
No ratings yet
Test Bank Questions Chapter 6
3 pages
DVI - Mid-Term Sample QP Bits Pilani Mid Sem
No ratings yet
DVI - Mid-Term Sample QP Bits Pilani Mid Sem
6 pages
ME-mit Admission Form
50% (2)
ME-mit Admission Form
1,057 pages
EDA Techniques for Analysts
100% (1)
EDA Techniques for Analysts
786 pages
4.0.Solutions-First Order ODEs-Exact Differential Equations
No ratings yet
4.0.Solutions-First Order ODEs-Exact Differential Equations
8 pages
Using Excel For Weibull Analysis
No ratings yet
Using Excel For Weibull Analysis
9 pages
Information Theory For Electrical Engineers
No ratings yet
Information Theory For Electrical Engineers
277 pages
Introduction to Statistics Basics
No ratings yet
Introduction to Statistics Basics
150 pages
TV Scientific Assessment
No ratings yet
TV Scientific Assessment
9 pages
Review of Related Literature Architectural Thesis
100% (2)
Review of Related Literature Architectural Thesis
4 pages
Patterns in Data Worksheet
No ratings yet
Patterns in Data Worksheet
9 pages
CHAPTER 1 Statistics
No ratings yet
CHAPTER 1 Statistics
41 pages
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
No ratings yet
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
6 pages
XI Chemistry Viva Questions
No ratings yet
XI Chemistry Viva Questions
6 pages
ANOVA ProblemSet
100% (1)
ANOVA ProblemSet
3 pages
MVS System Commands
No ratings yet
MVS System Commands
858 pages
Design Thinking Lesson Yl
No ratings yet
Design Thinking Lesson Yl
17 pages
Maths (041) Xii PB 1 QP Set A
No ratings yet
Maths (041) Xii PB 1 QP Set A
7 pages
3 Problem Analysis PDF
No ratings yet
3 Problem Analysis PDF
20 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
4 pages
CHM260 Basic Instrumental Analysis Laboratory Summary Written
No ratings yet
CHM260 Basic Instrumental Analysis Laboratory Summary Written
12 pages
Assignment - SPSS Latest
100% (1)
Assignment - SPSS Latest
17 pages
Lab Ex 3
No ratings yet
Lab Ex 3
25 pages
EDA: A Guide for Researchers
100% (1)
EDA: A Guide for Researchers
41 pages
A Review of Basic Statistical Concepts: Answers To Odd Numbered Problems 1
No ratings yet
A Review of Basic Statistical Concepts: Answers To Odd Numbered Problems 1
32 pages
Sensitivity Analysis and Duality PDF
No ratings yet
Sensitivity Analysis and Duality PDF
20 pages
Numerical Integration
No ratings yet
Numerical Integration
15 pages
Drill 3 Data Structures Laboratory
No ratings yet
Drill 3 Data Structures Laboratory
12 pages
Sample - 1 MBA 909
No ratings yet
Sample - 1 MBA 909
15 pages
Functions of Random Variables
No ratings yet
Functions of Random Variables
5 pages
Banknote Authentication
100% (1)
Banknote Authentication
3 pages
Data Quality
100% (2)
Data Quality
16 pages
Topic 2. Matemethical Modelling of Control Systems V1
No ratings yet
Topic 2. Matemethical Modelling of Control Systems V1
13 pages
Moving Average 2
No ratings yet
Moving Average 2
11 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
FIN 640 - Assignment 1 - Part 1 With Solutions
No ratings yet
FIN 640 - Assignment 1 - Part 1 With Solutions
17 pages
Data Analysis With R Week 4
No ratings yet
Data Analysis With R Week 4
7 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
High Performance Computing of Fluid Dynamics
No ratings yet
High Performance Computing of Fluid Dynamics
22 pages
Propagation of Data Uncertainty in Surface Wave Inversion
No ratings yet
Propagation of Data Uncertainty in Surface Wave Inversion
10 pages
Lab 4 Analysis Worksheet 111
No ratings yet
Lab 4 Analysis Worksheet 111
8 pages
8623040
No ratings yet
8623040
50 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Optimization & Stochastic Theory
No ratings yet
Optimization & Stochastic Theory
29 pages
Settingsprovider
No ratings yet
Settingsprovider
202 pages
Calculus: Concavity & Inflections
No ratings yet
Calculus: Concavity & Inflections
3 pages
Hybrid Feature Selection Student Performance Prediction Paper
No ratings yet
Hybrid Feature Selection Student Performance Prediction Paper
17 pages
Worksheet 4 Pert-Cpm
No ratings yet
Worksheet 4 Pert-Cpm
4 pages
Lab 1
No ratings yet
Lab 1
13 pages
Assignment3 Ans 2015 PDF
No ratings yet
Assignment3 Ans 2015 PDF
11 pages
Lecture1 PDF
No ratings yet
Lecture1 PDF
40 pages
State Space
No ratings yet
State Space
48 pages
Research Methods 6RM
No ratings yet
Research Methods 6RM
7 pages
9 Akbari-Ganji's Method: Advanced Numerical and Semi-Analytical Methods For Differential Equations
100% (1)
9 Akbari-Ganji's Method: Advanced Numerical and Semi-Analytical Methods For Differential Equations
8 pages
Datared: Data Reduction Program. JRC-LLB 2004
No ratings yet
Datared: Data Reduction Program. JRC-LLB 2004
5 pages
Hypothesis Testing With Two Samples
No ratings yet
Hypothesis Testing With Two Samples
43 pages
Section 4: Extreme Value Analysis - An Introduction
No ratings yet
Section 4: Extreme Value Analysis - An Introduction
58 pages
Case: HBAT Employee Retention
No ratings yet
Case: HBAT Employee Retention
2 pages
Barotropic Wave
No ratings yet
Barotropic Wave
55 pages
Omgt 333 Review Questions w2017
No ratings yet
Omgt 333 Review Questions w2017
4 pages
10.2 Forecasting Example Using Data From NOAA
No ratings yet
10.2 Forecasting Example Using Data From NOAA
6 pages
Homework2 PDF
No ratings yet
Homework2 PDF
3 pages
Alternating Service Model Analysis
No ratings yet
Alternating Service Model Analysis
16 pages
Gaussian Elimination Matlab
No ratings yet
Gaussian Elimination Matlab
3 pages
Week Two Assignment A
No ratings yet
Week Two Assignment A
1 page
Meshless PDF
No ratings yet
Meshless PDF
48 pages
LPAR Migration Test Plan
No ratings yet
LPAR Migration Test Plan
3 pages
Stat 110 CH8
No ratings yet
Stat 110 CH8
24 pages
18.04 Practice Problems Exam 2, Spring 2018 Solutions: X 2 2 XX y Yy 2 XX Yy
No ratings yet
18.04 Practice Problems Exam 2, Spring 2018 Solutions: X 2 2 XX y Yy 2 XX Yy
11 pages
QMF Query Results - Prod
No ratings yet
QMF Query Results - Prod
69 pages
Typical Takagi-Sugeno PI and PD Fuzzy Controllers: Analytical Structures and Stability Analysis
No ratings yet
Typical Takagi-Sugeno PI and PD Fuzzy Controllers: Analytical Structures and Stability Analysis
18 pages
Proc Esm 2
No ratings yet
Proc Esm 2
11 pages
QMF Query Results
No ratings yet
QMF Query Results
1,095 pages
ComparativePM 06 07
No ratings yet
ComparativePM 06 07
144 pages
1 PB
No ratings yet
1 PB
15 pages
Basic Radar Altimetry Toolbox Practical: V. Rosmorduc (CLS)
No ratings yet
Basic Radar Altimetry Toolbox Practical: V. Rosmorduc (CLS)
32 pages
Checklist STARD
No ratings yet
Checklist STARD
2 pages
1321 - 3 - 519493 - 1695916460 - Databricks - Generic
No ratings yet
1321 - 3 - 519493 - 1695916460 - Databricks - Generic
1 page
2023 1 Final Project Structural Analysis
No ratings yet
2023 1 Final Project Structural Analysis
9 pages
LEC15 - Moment Distribution Method (Beams, Frames With No Sidesway & Frames With Sidesway)
No ratings yet
LEC15 - Moment Distribution Method (Beams, Frames With No Sidesway & Frames With Sidesway)
55 pages
Determination of Gas Pressure Distribution in A Pipeline Network Using The Broyden Method
No ratings yet
Determination of Gas Pressure Distribution in A Pipeline Network Using The Broyden Method
21 pages
Human Security and The Copenhagen School's Securitization Approach
No ratings yet
Human Security and The Copenhagen School's Securitization Approach
12 pages
Heat Exchanger Efficiency: Journal of Heat Transfer September 2007
No ratings yet
Heat Exchanger Efficiency: Journal of Heat Transfer September 2007
10 pages
LPAR Migration Test Plan
No ratings yet
LPAR Migration Test Plan
3 pages
Type Z Base SQL
No ratings yet
Type Z Base SQL
7 pages
Data Analysis Course: Time Series Analysis & Forecasting (Version-1)
No ratings yet
Data Analysis Course: Time Series Analysis & Forecasting (Version-1)
43 pages
The Use of The Variogram in Time Series Analysis
No ratings yet
The Use of The Variogram in Time Series Analysis
50 pages
Analysis of Variance
No ratings yet
Analysis of Variance
51 pages

Exploratory Data Analysis For Machine Learning

Uploaded by

Exploratory Data Analysis For Machine Learning

Uploaded by

Exploratory Data Analysis for Machine Learning

01. Brief description of the data set and a summary of its

02. Initial plan for data exploration

03. Actions taken for data cleaning and future engineering

04. Key Findings and Insights, which synthesizes the results

05. Formulating at least 3 hypothesis about this data

06. Conducting a formal significance test for one of the

Population: X = 'Water temperature anomalies on sector 3.4 of the Pacif ic Ocean'

Ho: μ = -0,1 null hypothesis

Ha: μ > -0,1 alternative hypothesis

07. Suggestions for next steps in analyzing this data

You might also like