Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views29 pages

Research Methodogy Class 4

The document discusses research methodology and data collection methods. It covers topics like formulating hypotheses, data collection techniques, sampling designs, exploratory data analysis and dealing with outliers. Common data collection methods like questionnaires, interviews and observations are described along with sampling techniques like simple random sampling and stratified sampling.

Uploaded by

Moriwam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views29 pages

Research Methodogy Class 4

The document discusses research methodology and data collection methods. It covers topics like formulating hypotheses, data collection techniques, sampling designs, exploratory data analysis and dealing with outliers. Common data collection methods like questionnaires, interviews and observations are described along with sampling techniques like simple random sampling and stratified sampling.

Uploaded by

Moriwam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Research Methodology

Course Code: CSE-RM, Fall 1200

SECTION 1: (F) 10AM-12PM

Presented by
Dr. Rubaiyat Islam
Associate Professor, WUB
Adjunct Faculty, IUB.
Omdena Bangladesh Chapter Lead
Crypto-economist Consultant
Sifchain Finance, USA.
COURSE CONTENTS

1. Understanding the nature of Problem, Idea generation.


2. Reviewing Literature.
3. Proposing/Developing/Experimenting/Collecting Data
4. Analyzing
5. Drawing calculation
6. Making generalization of the findings/results
7. Paper/Journal/Report Writing
8. Plagiarism, Selecting Journals.
9. Major /Minor Reviews
AFTER LITERATURE REVIEW ….
FORMULATE THE HYPOTHESIS

§ Tentative assumption.
§ A statement of expectation or prediction that will be
tested by research.
§ Guiding the researcher by span the area of research
and keep him on the right track.
§ Finalize which and how the tests must be conducted
, in the analysis of data and indirectly the quality of
data which is required for analysis.
LEARNING OBJECTIVES OF THIS
PRESENTATION

§ Six major methods of data collection


§ Effective method choosing the right data collection method
§ Strengths and weaknesses of different data collection
method.
§ Use the strong method in our research
DATA COLLECTION

§ Source of Data :
Statistical data may be obtained from two sources, namely, primary
and secondary
1 Primary data:
Data measured or collected by the investigator or the user directly from
the source. Primary sources are sources that can supply first hand
information for immediate user.
2 Secondary data:
When an investigator uses data, which have already been collected by
others, such data are called secondary data. Data gathered or compiled
from published and unpublished sources
COMMON COLLECTION METHODS

§ Tests - Participants fill out an instrument to measure their


ability or degree of skills.
§ Questionnaires – Fill out self-report instruments.
§ Interviews – Researchers can talk to participants in person or
over the telephone.
§ Focus groups – In a small group settings.
§ Observations – Observe natural and structured environment.
§ Constructed, secondary and existing data - Use data from
earlier time.
SAMPLING DESIGNS

Methods by which a representative sample can be chosen from a population.

Four sampling designs in common use:

1. Simple random sampling


2. Systematic sampling
3. Stratified sampling
4. Cluster sampling
SAMPLING DESIGNS

Simple Random Sampling


The example of putting all
students’ names and thoroughly
mixing these names before
drawing each name represents a
simple random sampling.
SAMPLING DESIGNS

Systematic Sampling : In this sampling


design, every k-th unit (or item) is
selected from a population until the
sample size is reached.
(size of population)
K = -------------------------
(size of sample)
SAMPLING DESIGNS

Stratified Sampling
In this sampling, the entire population is
divided in to several groups, called
strata, and a subsample is selected
from each group. All subsamples are
then combined to form a sample. This
sampling design is used when a
population is not homogeneous.
SAMPLING DESIGNS

Stratified sampling could be either


proportionate or disproportionate,
depending on the number of units
selected from each group.
SAMPLING DESIGNS

Cluster Sampling
This sampling design involves selecting
at random a few groups, called clusters,
from a population, and then selecting
units from each cluster. Cluster sampling
is used when a population is large,
fairly homogeneous and scattered
over a large geographical area .
DATA ORGANIZATION/PRE-PROCESSING

The process of selecting a sample from


a population amounts to data
collection. Once the data has been
collected, it must be organized to make
it meaningful. Unorganized data does
not convey any meaningful information.
Raw Data
A set of unorganized data
Data Pre-processing for Exploratory
Data Analysis:
1. Missing values
2. Creating a frequency distribution
table.
WHAT IS EXPLORATORY DATA ANALYSIS

EDA is an approach for data analysis using variety


of techniques to gain insights about the data.

• Cleaning and preprocessing


Basic steps in any • Statistical Analysis
exploratory data • Visualization for trend analysis,
analysis: anomaly detection, outlier
detection (and removal).
IMPORTANCE OF EDA

Improve understanding of variables by extracting


averages, mean, minimum, and maximum values, etc.

Discover errors, outliers, and missing values in the


data.

Identify patterns by visualizing data in graphs such as


bar graphs, scatter plots, heatmaps and histograms.

17
EDA USING PANDAS

Import data into workplace(Jupyter notebook, Google colab, Python IDE)

Descriptive statistics

Removal of nulls

Visualization

18
1. PACKAGES AND DATA IMPORT

• Step 1 : Import pandas to the workplace.


• “Import pandas”

• Step 2 : Read data/dataset into Pandas dataframe. Different


input formats include:
• Excel : read_excel
• CSV: read_csv
• JSON: read_json
• HTML and many more

19
• Used to make preliminary assessments about the population distribution of
the variable.

• Commonly used statistics:


1. Central tendency :

• Mean – The average value of all the data points. : dataframe.mean()


2. • Median – The middle value when all the data points are put in an

DESCRIPTIVE •
ordered list: dataframe.median()
Mode – The data point which occurs the most in the dataset
STATS :dataframe.mode()

(PANDAS)
2. Spread : It is the measure of how far the datapoints are away from the
mean or median

• Variance - The variance is the mean of the squares of the individual


deviations: dataframe.var()
• Standard deviation - The standard deviation is the square root of the
variance:dataframe.std()
3. Skewness: It is a measure of asymmetry: dataframe.skew()
Other methods to get a quick look on the data:
• Describe() : Summarizes the central tendency,
dispersion and shape of a dataset’s distribution,
excluding NaN values.
DESCRIPTIVE • Syntax: pandas.dataframe.describe()
STATS • Info() :Prints a concise summary of the
(CONTD.) dataframe. This method prints information
about a dataframe including the index dtype
and columns, non-null values and memory
usage.
• Syntax: pandas.dataframe.info()
3. NULL VALUES

Detecting Handling

Detecting Null- Handling null values:


values: •Dropping the rows with
•Isnull(): It is used as an null values: dropna()
alias for dataframe.isna(). function is used to delete
This function returns the rows or columns with null
dataframe with boolean values.
values indicating missing •Replacing missing values:
values. fillna() function can fill the
•Syntax : dataframe.isnull() missing values with a
special value value like
mean or median.
4. VISUALIZATION

• Univariate: Looking at one variable/column at a time


• Bar-graph
• Histograms
• Boxplot
• Multivariate : Looking at relationship between two or more
variables
• Scatter plots
• Pie plots
• Heatmaps(seaborn)

23
BAR-GRAPH, HISTOGRAM AND
BOXPLOT

• Bar graph: A bar plot is a plot that presents


data with rectangular bars with lengths
proportional to the values that they
represent.
• Boxplot : Depicts numerical data graphically
through their quartiles. The box extends
from the Q1 to Q3 quartile values of the
data, with a line at the median (Q2).
• Histogram: A histogram is a representation of
the distribution of data.
SC ATTERPLOT, PIEPLOT

• Scatterplot : Shows the data as a collection of points.


• Syntax: dataframe.plot.scatter(x = 'x_column_name', y = 'y_columnn_name’)

• Pie plot : Proportional representation of the numerical data in a column.


• Syntax: dataframe.plot.pie(y=‘column_name’)
OUTLIER DETECTION

• An outlier is a point or set of data points that lie away from the rest of the data
values of the dataset..
• Outliers are easily identified by visualizing the data.
• For e.g.
• In a boxplot, the data points which lie outside the upper and lower bound can be
considered as outliers
• In a scatterplot, the data points which lie outside the groups of datapoints can be
considered as outliers
OUTLIER REMOVAL

• Calculate the IQR as follows:


Ø Calculate the first and third quartile (Q1 and Q3)
Ø Calculate the interquartile range, IQR = Q3-Q1
Ø Find the lower bound which is Q1*1.5
Ø Find the upper bound which is Q3*1.5
Ø Replace the data points which lie outside this range.
Ø They can be replaced by mean or median.
REFERENCES

• More information on EDA tools and Pandas can be found


on below links:
• https://pandas.pydata.org/docs/user_guide/index.html
• https://pandas.pydata.org/docs/user_guide/missing_data.html
• https://pandas.pydata.org/docs/user_guide/visualization.html

28
PRACTIC AL DEMONSTRATION

• Creating research findings questionaries for EDA


• Data Story telling and EDA

You might also like