Exploratory Data Analysis for Machine Learning
01. Brief description of the data set and a summary of its
attributes
The dataset used is Brazil’s INPE, which contains climate data. This one specif ically is related to the
ENOS phenomenon.
The dataset contains 1800 data of monthly temperature f luctuations, from 1870 to 2019.
The data series contains 150 rows, corresponding to the years, and 12 columns, one f or each month.
02. Initial plan for data exploration
The initial plan f or data exploration was to determine mean, median, quantities and ranges (max and
min) f or each of the monthly data measurement we have.
03. Actions taken for data cleaning and future engineering
Regarding data cleaning, I made a scatter plot, with Matplotlib, of the dataset. This initial step was
done to clean, since we noticed the outliers graphically. It was observed that there were outliers data
that did not correspond to the data trend, so they could, by identifying them visually, be cataloged as
anomalies.
Af ter that, I made histogram f or any of the 12 f eatures (early), thus creating a single plot with
histograms f or each f eature overlayed. This was done using Pandas plotting f unctionality.
For f uture engineering, I will set out to improve on a baseline set of f eatures :, deriving new f eatures
f rom our existing data. This should make the dif ference between a weak model and a strong o ne. I
plan to use visual visual exploration, intuition and domain understanding in order to construct new
f eatures to improve the f orecasting capabilities of some new f oreseen model.
04. Key Findings and Insights, which synthesizes the results
of Exploratory Data Analysis in an insightful and actionable
manner
Af ter analyzing, I came to thin on what do these plots tell us about the distribution of the target .
The f inding were that the distribution of each of the data sets f ollows an almost normal
distribution, with some exceptions (as expected).
By looking at the scatterplots of the data series, we can interpret that it is not possible to perf orm linear
regressions, either by the method of least squares or by any other way that allows the interpretation
of the behavior of the data in the f orm of linear, polynomial or other equations. The dispersion of the
data is such that other, more complex statistical analyses would be required.
Also, each data set is not dif f erent in nature f rom the others, as they all ref lect the behaviors of
a natural variable. Despite the possible sources of error in the measurement of these data, it is
not possible to say that these data are inf luenced by each other. The environmental conditions
vary according to the time of year, but this is an external variable.
05. Formulating at least 3 hypothesis about this data
As requested, 3 hypotheses were f ormulated, as f ollows:
1) The ENSO phenomenon will have an impact on sea level rise if the anomalies mostly oscillate
between -0.5 and 1.5 ºC. Has this been true in the time interval between 1870 and 2020?
2) The temperature's anomalies, on average, are higher than -0,10. The decision is based on a
check of a random sample of this data.
3) We can say that there hasn’t been a period of twelve months in which, in a row, these
temperature anomalies have f allen f rom 0.5 ºC?
06. Conducting a formal significance test for one of the
hypotheses and discuss the results
Hypotheses 2 was chosen f or the f ormal significance test.
On average, the temperature's anomalies are higher than -0,10. The decision is based on a check of
a random sample of this data.
Population: X = 'Water temperature anomalies on sector 3.4 of the Pacif ic Ocean'
Ho: μ = -0,1 null hypothesis
Ha: μ > -0,1 alternative hypothesis
The approach is to use the one sample t-test, which determines whether the sample mean is
statistically different f rom a known or hypothesized population mean. The One Sample t Test is a
parametric test.
Af ter the test was perf ormed, and observing that the p-value was way higher than the standard
conf idence level, we can accept the hypothesis that, on average, the temperature's anomalies are
higher than -0,1. There’s an almost non-existing evidence in support f or the alternative hypothesis -
that, on average, the temperature's anomalies are not higher than -0,1.
07. Suggestions for next steps in analyzing this data
Even though we can observe that the data series presents an almost normal distribution, the the
scatterplot allows us to identif y how this data series doesn’t show a linear trend, or that it can be
identif ied with a reasonable equation, due to the absolute dispersion of the data, as it can be detailed
in the standard deviations. So, it indicates that the analysis of this series need to use non-linear
approaches to be able to analyze their behavior more precisely.
08. A paragraph that summarizes the quality of this data set
and a request for additional data if needed
The INPE’s data set is of great quality: no data was missing, no inconsistencies or outliers were
observed f ar f rom the range of values in the data set. It also has a normal distribution and it showed
that dif ferent statistical analysis were possible to be done using this data set.