Data visualization & data cleaning
Data visualization is a quick, easy way to convey concepts in a universal manner.
Data visualization is the graphical representation of information and data. By using visual element
like charts, graphs, and maps , data visualization tool provide an accessible way to see and
understand trends, outliers, and patterns in data.
Plotting libraries;
Matplotlib: low level , provides lot of freedom.
Pandas visualization : easy to use interface, built on matplotlib
Seaborn: high-level interface , great default styles
Ggplot: based on r’s ggplot2, uses grammar of graphics.
Plotly: can create interactive plots
#why is data visualization important?
.understand complex data: visualization make it easier to understand complex data sets that would
be difficult to comprehend in tables or text.
.identifying pattern and trends: visualization can help you spot trends, outliers, and relationships
between different data points that might be missed in raw data.
.communicating insight: visualization are a powerful way to communicate data driven insights to
others, whether it’s your colleagues,clients, or the general public.
Make data-driven decisions: by understanding the data visually , you can make more informed and
effective decision.
#common types of data visualizations:
1.line charts: show trends over time.
2. bar charts: compare values across categories.
3.pie charts: show the proportion of different categories within a whole.
4. scatter plots: show the relationship between two variable.
5.maps: show geographic data.
6.inforgraphics: combine text,visuals, and data to tell a story.
#principle behind data visualizations.
Effective data visualizations are not just about creating pretty pictures. They are guided by several
key principles:
* Clarity: The visualization should be clear and easy to understand. Avoid clutter and unnecessary
elements.
* Accuracy: The visualization should accurately represent the data.
* Relevance: The visualization should be relevant to the data and the questions you are trying to
answer.
Data visualization & data cleaning
* Simplicity: Keep the visualization simple and avoid overcomplicating it.
* Context: Provide context to help viewers understand the data.
* Ethical considerations: Be mindful of ethical considerations when creating visualizations, such as
avoiding misleading or deceptive visuals.
#Visualizations can be used in a variety of fields, including:
* Business: To track sales, analyze customer behavior, and make strategic decisions.
* Science: To visualize research data, identify patterns, and communicate findings.
* Government: To track economic indicators, monitor public health, and plan public policy.
* Journalism: To tell data-driven stories and inform the public.
By following these principles, you can create effective data visualizations that help you understand
and communicate data more effectively.
histrograms-visualize, box plots-visualize
Creating visualizations such as histograms and box plots can really help in understanding and
analyzing data. These tools provide different ways to look at the distribution and spread of your data.
Histograms
Histograms represent the distribution of numerical data by dividing it into intervals or bins. Here's a
quick example using Python and Matplotlib:
import matplotlib.pyplot as plt
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 7, 8, 9]
plt.hist(data, bins=5, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
or
import matplott.pyplot as plt
x=[]
plt.hist(x)
plt.show()
1.import matplotlib.pyplot as plt
Import numpy as np
Data visualization & data cleaning
Import random
X=np.random.randint(10,60,(50))
Print(x)
No=[49 ,4,5, 5,4 1,6 3,1………]
L=[10,20,30,40,50] // for bins
Plt.hist(no,color=”r”,bins=l,(0,100),edgecolor=”r”,cumulative=-
1,bottom=10,align=”left”,histtype=”step”,orientation=”horizontal”,rwidth=0.8,log=true,label=”pyth
on”)
Plt.axvline(45,color=”r”,label=”line”)
Plt.legend()
Plt.titile(“wscube”)
Plt.xlabel(“python”)
Plt.ylabel(“no”)
Plt.show()
Box Plots
Five number summary and box plot.
Box plots (or box-and-whisker plots) summarize data through their quartiles and provide insights into
the spread and potential outliers. Here's an example using the same library:
Import matplotlib.pyplot as plt
X=[]
Plt.boxplot(x)
Plt.show()
##
import matplotlib.pyplot as plt
data = [10,20,30,40,50,60,70,120]
plt.boxplot(data,notch=True,vert=False,widths=0.8,label=[“python”],patch_artist=True,showmeans=T
rue,whis=2,sym=”g+”,boxprops=dict(color=”r”),capprops=dict(color=”r”),wiskerprops=dict(color=”r”),
filerprops=dict(markeredgecolor=”y”))
plt.ylabel('Value')
plt.title('Box Plot Example')
plt.show()
2.box plot
Import matplotlib.pyplot as plt
Data visualization & data cleaning
x= [10,20,30,40,50,60,70,120]
x1 = [10,20,40,50,30,60,90]
y=[x,x1]
plt.boxplot(y,labels=[“python”,”c+
+”],showmeans=True,whis=2,sym=”g+”,boxprops=dict(color=”r”),capprops=dict(color=”r”),wiskerpro
ps=dict(color=”r”),filerprops=dict(markeredgecolor=”y”))
the distribution of continuous numerical variables(bar plots. pie chart. line chart)
1. Bar Plots
Bar plots are excellent for comparing different groups or categories of data. However, they are not
typically used for continuous numerical variables. They are better suited for categorical data or
discrete numerical data. For continuous variables, histograms are more appropriate.
Import matplotlib .pyplot as plt
X=[“python”,”c”,”c++”,”java”]
Y=[85,70,60,82]
Plt.xlabel(“language”,fontsize=10)
Plt.ylabel(“no”, fontsize=10)
Plt.title(“wscube”,fontsize=10)
C=[“y”,”b”,”m”,”g”]
Plt.bar(x,y,width=0.4,color=”y”, color=c,align=”edge,
(center)”,edgecolor=”r”,linewidth=5,linestyle=”:”,alpha=0.4,label=”ws”)
Plt.legend() // for only label
Plt.show()
For multiple bar graph
Import matplot.pyplot as plt
Import numpy as np
X=[“python”,”c”,”c++”,”java”]
Y=[85,70,60,82]
Z=[20,30,40,50]
Width=0.2
P=np.arange(len(x))
P1=[j+width for j in p]
Plt.xlabel(“language”,fontsize=10)
Data visualization & data cleaning
Plt.ylabel(“no”, fontsize=10)
Plt.title(“wscube”,fontsize=10)
Plt.bar(p,y,width,color=”y”, color=c,label=”ws”)
Plt.bar(p1,z,width,color=”r”,label=”popularity1”)
Plt.xticks(p+width/2,x,rotation=10)
Plt.legend() // for only label
Plt.show()
For horizontal
Plot.barh()
2. Pie Charts
Pie charts display proportions of a whole. They can be used for continuous data when categorized
into segments, but they are often less effective in conveying precise distributions compared to other
chart types. They are best used for showing percentages or proportional data.
Import matplotlib.pyplot as plt
X=[]
Plt.pie(x)
Plt.show()
pie
1. import matplotlib.pyplot as plt
2. x=[10,20,30,40]
3. y=[“c”,”c++”,”java”,”python”] // for label not required for draw pie chart
4. ex=[0.4,0.0,0.0,0.0]
5. c=[“r”,”b”,”g”,”y”]
6. plt.pie(x,labels=y,explode=ex,colors=c,autopct=%0.1f
%,shadow=True,radius=1.5,labeldistance=1.1,startangle=90,textprops={“fontsize”:15},coun
terclock=False,wedgeprops={“linewidth”:4,”edgecolor”:”m”},center=(1,3),rotatelabels=Tru
e)
7. plt.title(“wscube”)
8. plt.legend(loc=1)
9. plt.show()
plt.pie([1]) // dot pie chart
plt.show()
2.pie chart
1. import matplotlib.pyplot as plt
2. x=[10,20,30,40]
3. x1=[40,30,20,10]
Data visualization & data cleaning
4. y=[“c”,”c++”,”java”,”python”] // for label not required for draw pie chart
5. c=[“r”,”b”,”y”,”g”]
6. plt.pie(x,labels=y,radius=1)
7. plt.pie(x1,radius=0.5,colors=c)
8. plt.show()
3. Line Charts
Line charts are ideal for showing trends over time or continuous data. They plot individual data
points connected by lines, making it easy to see changes and trends. Line charts are particularly
useful when you have time-series data or when you want to display data points over a continuous
range.
Example of Line Chart
Let's imagine you have a dataset showing the average temperature in a city over a year:
plaintext
Time (Month) | Temperature (°C)
-------------|------------------
January |5
February |7
March | 10
April | 15
May | 20
June | 25
July | 28
August | 27
September | 22
October | 17
November | 10
December |6
Using a line chart, you can plot the months on the x-axis and the temperature on the y-axis to
visualize the trend of temperature changes throughout the year
Data Visualization: Line Plots and Regression
Line Plots Line plots are a simple yet powerful way to visualize data points connected by straight
lines. These plots are particularly useful for showing trends over time.
Key Components:
Data visualization & data cleaning
o X-axis: Represents the independent variable (e.g., time).
o Y-axis: Represents the dependent variable (e.g., sales, temperature).
o Data Points: Represent individual measurements or observations.
Usage:
o Show trends over time.
o Compare multiple datasets.
Regression Regression analysis is a statistical method for estimating relationships among variables.
It's used to predict the value of a dependent variable based on one or more independent variables.
Types of Regression:
o Linear Regression: Fits a line through the data points.
o Multiple Regression: Uses more than one predictor variable.
o Polynomial Regression: Fits a curved line through the data points.
Applications:
o Predicting sales.
o Determining the impact of marketing efforts.
Tableau Introduction Tableau is a leading data visualization tool that's widely used for business
intelligence.
Key Features:
o Drag-and-Drop Interface: Easy to use, even for non-technical users.
o Interactive Dashboards: Allows users to explore data in a more intuitive way.
o Data Connectivity: Supports connections to various data sources, including Excel,
SQL databases, and cloud services.
Common Visualizations:
o Line charts, bar charts, histograms, and maps.
o Interactive dashboards and stories.
Introduction to Business Intelligence (BI) Business Intelligence involves using data analysis tools to
support decision-making processes in an organization.
Components:
o Data Warehousing: Storing large volumes of data.
o Data Mining: Extracting useful information from large datasets.
o Reporting: Generating reports to summarize and present data.
o Dashboards: Providing real-time insights through visual representations.
Data visualization & data cleaning
Benefits:
o Better decision-making.
o Improved operational efficiency.
o Enhanced customer insights.
These elements are just the tip of the iceberg in the exciting world of data visualization and business
intelligence! If you want to dive deeper into any specific area, feel free to ask.
#matplotlib: is plotting library for the python programming language and it numerical mathematic
extension numpy.
matplotlib: is a python library used for data visualization.
matplotlib: is 2d and 3d plotting python library.
It was introduced by john hunter in the year 2002.
Matplotlib graphs:
Linear plot , scatter plot, bar plot, stem plot, step plot, hist plot, box plot, pie plot, fill_between plot
Import matplotlib.pyplot as plt
Or
From matplotlib import pyplot as plt
Line graph
Import matplot.pyplot as plt
X=[1,2,3,4]
Y=[2,4,6,8]
C=[“r”, ‘y’,”g”,’b’]
Plt.bar(x,y, color=c) // bar chart
Plt.plot(x,y) // line chart
Plt.show()