Research Methodology
Course Code: CSE-RM, Fall 1200
SECTION 1: (F) 10AM-12PM
Presented by
Dr. Rubaiyat Islam
Associate Professor, WUB
Adjunct Faculty, IUB.
Omdena Bangladesh Chapter Lead
Crypto-economist Consultant
Sifchain Finance, USA.
COURSE CONTENTS
1. Understanding the nature of Problem, Idea generation.
2. Reviewing Literature.
3. Proposing/Developing/Experimenting/Collecting Data
4. Analyzing
5. Drawing calculation
6. Making generalization of the findings/results
7. Paper/Journal/Report Writing
8. Plagiarism, Selecting Journals.
9. Major /Minor Reviews
AFTER LITERATURE REVIEW ….
FORMULATE THE HYPOTHESIS
§ Tentative assumption.
§ A statement of expectation or prediction that will be
tested by research.
§ Guiding the researcher by span the area of research
and keep him on the right track.
§ Finalize which and how the tests must be conducted
, in the analysis of data and indirectly the quality of
data which is required for analysis.
LEARNING OBJECTIVES OF THIS
PRESENTATION
§ Six major methods of data collection
§ Effective method choosing the right data collection method
§ Strengths and weaknesses of different data collection
method.
§ Use the strong method in our research
DATA COLLECTION
§ Source of Data :
Statistical data may be obtained from two sources, namely, primary
and secondary
1 Primary data:
Data measured or collected by the investigator or the user directly from
the source. Primary sources are sources that can supply first hand
information for immediate user.
2 Secondary data:
When an investigator uses data, which have already been collected by
others, such data are called secondary data. Data gathered or compiled
from published and unpublished sources
COMMON COLLECTION METHODS
§ Tests - Participants fill out an instrument to measure their
ability or degree of skills.
§ Questionnaires – Fill out self-report instruments.
§ Interviews – Researchers can talk to participants in person or
over the telephone.
§ Focus groups – In a small group settings.
§ Observations – Observe natural and structured environment.
§ Constructed, secondary and existing data - Use data from
earlier time.
SAMPLING DESIGNS
Methods by which a representative sample can be chosen from a population.
Four sampling designs in common use:
1. Simple random sampling
2. Systematic sampling
3. Stratified sampling
4. Cluster sampling
SAMPLING DESIGNS
Simple Random Sampling
The example of putting all
students’ names and thoroughly
mixing these names before
drawing each name represents a
simple random sampling.
SAMPLING DESIGNS
Systematic Sampling : In this sampling
design, every k-th unit (or item) is
selected from a population until the
sample size is reached.
(size of population)
K = -------------------------
(size of sample)
SAMPLING DESIGNS
Stratified Sampling
In this sampling, the entire population is
divided in to several groups, called
strata, and a subsample is selected
from each group. All subsamples are
then combined to form a sample. This
sampling design is used when a
population is not homogeneous.
SAMPLING DESIGNS
Stratified sampling could be either
proportionate or disproportionate,
depending on the number of units
selected from each group.
SAMPLING DESIGNS
Cluster Sampling
This sampling design involves selecting
at random a few groups, called clusters,
from a population, and then selecting
units from each cluster. Cluster sampling
is used when a population is large,
fairly homogeneous and scattered
over a large geographical area .
DATA ORGANIZATION/PRE-PROCESSING
The process of selecting a sample from
a population amounts to data
collection. Once the data has been
collected, it must be organized to make
it meaningful. Unorganized data does
not convey any meaningful information.
Raw Data
A set of unorganized data
Data Pre-processing for Exploratory
Data Analysis:
1. Missing values
2. Creating a frequency distribution
table.
WHAT IS EXPLORATORY DATA ANALYSIS
EDA is an approach for data analysis using variety
of techniques to gain insights about the data.
• Cleaning and preprocessing
Basic steps in any • Statistical Analysis
exploratory data • Visualization for trend analysis,
analysis: anomaly detection, outlier
detection (and removal).
IMPORTANCE OF EDA
Improve understanding of variables by extracting
averages, mean, minimum, and maximum values, etc.
Discover errors, outliers, and missing values in the
data.
Identify patterns by visualizing data in graphs such as
bar graphs, scatter plots, heatmaps and histograms.
17
EDA USING PANDAS
Import data into workplace(Jupyter notebook, Google colab, Python IDE)
Descriptive statistics
Removal of nulls
Visualization
18
1. PACKAGES AND DATA IMPORT
• Step 1 : Import pandas to the workplace.
• “Import pandas”
• Step 2 : Read data/dataset into Pandas dataframe. Different
input formats include:
• Excel : read_excel
• CSV: read_csv
• JSON: read_json
• HTML and many more
19
• Used to make preliminary assessments about the population distribution of
the variable.
• Commonly used statistics:
1. Central tendency :
• Mean – The average value of all the data points. : dataframe.mean()
2. • Median – The middle value when all the data points are put in an
DESCRIPTIVE •
ordered list: dataframe.median()
Mode – The data point which occurs the most in the dataset
STATS :dataframe.mode()
(PANDAS)
2. Spread : It is the measure of how far the datapoints are away from the
mean or median
• Variance - The variance is the mean of the squares of the individual
deviations: dataframe.var()
• Standard deviation - The standard deviation is the square root of the
variance:dataframe.std()
3. Skewness: It is a measure of asymmetry: dataframe.skew()
Other methods to get a quick look on the data:
• Describe() : Summarizes the central tendency,
dispersion and shape of a dataset’s distribution,
excluding NaN values.
DESCRIPTIVE • Syntax: pandas.dataframe.describe()
STATS • Info() :Prints a concise summary of the
(CONTD.) dataframe. This method prints information
about a dataframe including the index dtype
and columns, non-null values and memory
usage.
• Syntax: pandas.dataframe.info()
3. NULL VALUES
Detecting Handling
Detecting Null- Handling null values:
values: •Dropping the rows with
•Isnull(): It is used as an null values: dropna()
alias for dataframe.isna(). function is used to delete
This function returns the rows or columns with null
dataframe with boolean values.
values indicating missing •Replacing missing values:
values. fillna() function can fill the
•Syntax : dataframe.isnull() missing values with a
special value value like
mean or median.
4. VISUALIZATION
• Univariate: Looking at one variable/column at a time
• Bar-graph
• Histograms
• Boxplot
• Multivariate : Looking at relationship between two or more
variables
• Scatter plots
• Pie plots
• Heatmaps(seaborn)
23
BAR-GRAPH, HISTOGRAM AND
BOXPLOT
• Bar graph: A bar plot is a plot that presents
data with rectangular bars with lengths
proportional to the values that they
represent.
• Boxplot : Depicts numerical data graphically
through their quartiles. The box extends
from the Q1 to Q3 quartile values of the
data, with a line at the median (Q2).
• Histogram: A histogram is a representation of
the distribution of data.
SC ATTERPLOT, PIEPLOT
• Scatterplot : Shows the data as a collection of points.
• Syntax: dataframe.plot.scatter(x = 'x_column_name', y = 'y_columnn_name’)
• Pie plot : Proportional representation of the numerical data in a column.
• Syntax: dataframe.plot.pie(y=‘column_name’)
OUTLIER DETECTION
• An outlier is a point or set of data points that lie away from the rest of the data
values of the dataset..
• Outliers are easily identified by visualizing the data.
• For e.g.
• In a boxplot, the data points which lie outside the upper and lower bound can be
considered as outliers
• In a scatterplot, the data points which lie outside the groups of datapoints can be
considered as outliers
OUTLIER REMOVAL
• Calculate the IQR as follows:
Ø Calculate the first and third quartile (Q1 and Q3)
Ø Calculate the interquartile range, IQR = Q3-Q1
Ø Find the lower bound which is Q1*1.5
Ø Find the upper bound which is Q3*1.5
Ø Replace the data points which lie outside this range.
Ø They can be replaced by mean or median.
REFERENCES
• More information on EDA tools and Pandas can be found
on below links:
• https://pandas.pydata.org/docs/user_guide/index.html
• https://pandas.pydata.org/docs/user_guide/missing_data.html
• https://pandas.pydata.org/docs/user_guide/visualization.html
28
PRACTIC AL DEMONSTRATION
• Creating research findings questionaries for EDA
• Data Story telling and EDA