Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views14 pages

Pre ML Practise

pre-ml-practise

Uploaded by

bpkdeveloper45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

Pre ML Practise

pre-ml-practise

Uploaded by

bpkdeveloper45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

NumPy is a powerful Python library used for numerical computing.

It provides support for large, multi-dimensional arrays and matrices, along with a collection
of mathematical functions to operate on these arrays efficiently. NumPy is widely used in scientific computing, machine learning, data analysis, and other fields
where numerical operations on large datasets are common
STACKING :

PANDAS:

Pandas is a popular open-source Python library used for data manipulation and analysis. It provides data structures like DataFrame and Series
that are designed to make working with structured data easy and intuitive.

Key features of pandas include:

1. DataFrame: A two-dimensional, labeled data structure with columns of potentially different types. It is similar to a spreadsheet or SQL
table.
2. Series: A one-dimensional labeled array capable of holding data of any type.
3. Data manipulation: Pandas provides a wide range of functions for data cleaning, reshaping, merging, slicing, indexing, and more.
4. Data import/export: Supports reading and writing data in various formats like CSV, Excel, SQL databases, and more.
5. Missing data handling: Provides tools for dealing with missing data, such as filling in missing values or dropping rows/columns with
missing data.
6. Time series data: Includes functionalities for working with time series data, such as date range generation, shifting, and frequency
conversion.
7. Powerful indexing: Supports various methods of indexing and selecting data, including label-based indexing with loc, integer-based
indexing with iloc, and boolean indexing.
8. Groupby: Allows splitting data into groups based on some criteria and then applying functions to each group independently.
9. Plotting: Integration with Matplotlib for creating visualizations directly from pandas data structures.

Overall, pandas is widely used in data analysis, data cleaning, data preprocessing, and various other data-related tasks in Python.

PIP INSTALL PANDAS

1. Google Dataset
Search
Type of data: Miscellaneous
Data compiled by: Google
Access: Free to search, but does include some fee-based search results
Sample dataset: Global price of coffee, 1990-present
2. Kaggle
Type of data: Miscellaneous
Data compiled by: Kaggle
Access: Free, but registration required
Sample dataset: Daily temperature of major cities

3. Data.Gov
Type of data: Government
Data compiled by: US Federal Government
Access: Free, no registration required
Sample dataset: Lobster Report for Transshipment and Sales

4. Datahub.io
Type of data: Mostly business and finance
Data compiled by: Datahub
Access: Mostly free, no registration required
Sample dataset: Average mass of glaciers since 1945

5. UCI Machine Learning Repository


Type of data: Machine learning
Data compiled by: University of California Irvine
Access: Free, no registration required
Sample dataset: Behavior of urban traffic in Sao Paulo, Brazil

5. Earth Data
Type of data: Earth science
Data compiled by: NASA
Access: Free, no registration required
Sample dataset: Environmental conditions during fall moose hunting season in Alaska, 2000-2016

6. CERN Open Data Portal


Type of data: Particle Physics
Data compiled by: CERN
Access: Free, no registration required
Sample dataset: Higgs candidate collision events from 2011 and 2012

7. Global Health Observatory Data Repository


Type of data: Health
Data compiled by: UN World Health Organization
Access: Free, no registration required
Sample dataset: Polio immunization coverage estimates by region
8. BFI film industry statistics
Type of data: Entertainment and film
Data compiled by: British Film Institute
Access: Free, no registration required
Sample dataset: Weekend box office figures from 2001-present

9. NYC Taxi Trip Data


Type of data: Transport
Data compiled by: New York City Taxi and Limousine Commission
Access: Free, no registration required
Sample dataset: Take your pick!

10. FBI Crime Data Explorer


Type of data: Crime and drugs
Data compiled by: Federal Bureau of Investigation
Access: Free, no registration required
Sample dataset: Homicide offense counts in Point Pleasant, 2008-2018

CREATING A DATASET AFTER DOWNLOSING

OPERRATIONS ON DATA FRAME


MATPLOTLIB:

Matplotlib is a popular Python library used for creating static, animated, and interactive visualizations in Python. It provides a wide variety of plots
and charts, including line plots, bar charts, histograms, scatter plots, and more.

Key features of Matplotlib include:

1. Wide range of plots: Matplotlib supports a wide variety of plots and charts, making it suitable for many different types of data visualization
tasks.
2. Customization: Matplotlib allows for extensive customization of plots, including colors, labels, fonts, line styles, and more.
3. Publication-quality output: Matplotlib is designed to produce high-quality plots suitable for publication and presentation.
4. Integration with Jupyter notebooks: Matplotlib integrates well with Jupyter notebooks, allowing for interactive plotting within the
notebook environment.
5. Backend support: Matplotlib supports multiple backends for rendering plots, including rendering to file formats like PNG, PDF, SVG, and
interactive backends for displaying plots in GUI applications.
6. Compatibility: Matplotlib is compatible with a wide range of Python versions and platforms, including Windows, macOS, and Linux.

Overall, Matplotlib is a powerful and flexible library for creating a wide variety of plots and visualizations in Python.
Seaborn:
Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative
statistical graphics. Seaborn is built on top of Matplotlib and integrates closely with pandas data structures, making it particularly useful for
working with data frames and arrays.

Key features of Seaborn include:

1. High-level interface: Seaborn provides a simple and intuitive interface for creating complex visualizations with just a few lines of code.
2. Attractive default styles: Seaborn comes with several built-in themes and color palettes that make it easy to create visually appealing plots.
3. Statistical plotting: Seaborn includes several functions for visualizing statistical relationships in data, such as scatter plots, box plots, violin
plots, and more.
4. Integration with pandas: Seaborn works seamlessly with pandas data frames, making it easy to plot data directly from a data frame.
5. Flexible customization: Seaborn allows for extensive customization of plots, including control over colors, styles, and other visual
properties.
6. Wide range of plots: Seaborn supports a wide range of plot types, including heatmaps, pair plots, joint plots, and more.
7. Works well with Jupyter notebooks: Seaborn integrates well with Jupyter notebooks, allowing for interactive plotting and data exploration.

Overall, Seaborn is a powerful and versatile library for creating informative and visually appealing plots in Python.
h ps://seaborn.pydata.org/

Scikit-learn is a popular open-source machine learning library for Python. It provides simple and efficient tools for data mining and data analysis,
built on NumPy, SciPy, and Matplotlib.
Key features of scikit-learn include:

1. Consistent interface: Scikit-learn provides a consistent API for various machine learning algorithms, making it easy to experiment with
different models.
2. Supervised and unsupervised learning algorithms: Scikit-learn includes a wide range of supervised and unsupervised learning algorithms,
including support for classification, regression, clustering, dimensionality reduction, and more.
3. Easy to use: Scikit-learn is designed to be easy to use, with a focus on simplicity and readability of code.
4. Integration with other Python libraries: Scikit-learn integrates well with other Python libraries, such as NumPy, pandas, and Matplotlib,
making it easy to use in conjunction with these libraries.
5. Model evaluation and validation: Scikit-learn provides tools for model evaluation and validation, including functions for cross-validation,
grid search, and performance metrics.
6. Community and support: Scikit-learn has a large and active community of users and developers, providing support and contributing to the
development of the library.

Overall, scikit-learn is a powerful and versatile library for machine learning in Python, suitable for both beginners and experts alike.

Statistics play a crucial role in machine learning, as they provide the foundation for many machine learning algorithms and techniques. Here are
some key statistics concepts that are important for machine learning:

1. Descriptive statistics: Descriptive statistics are used to summarize and describe the main features of a dataset. This includes measures such
as mean, median, mode, variance, standard deviation, and percentiles.
2. Probability distributions: Probability distributions describe the likelihood of different outcomes in a dataset. Common probability
distributions used in machine learning include the normal distribution, binomial distribution, and Poisson distribution.
3. Statistical inference: Statistical inference involves drawing conclusions about a population based on a sample of data. This includes
hypothesis testing and confidence intervals.
4. Correlation and covariance: Correlation measures the relationship between two variables, while covariance measures the extent to which
two variables change together.
5. Regression analysis: Regression analysis is used to model the relationship between a dependent variable and one or more independent
variables. It is commonly used for prediction in machine learning.
6. Classification: Classification is a type of supervised learning where the goal is to predict the class label of new observations based on past
observations. Statistics provides the theoretical foundation for many classification algorithms, such as logistic regression and decision
trees.
7. Clustering: Clustering is an unsupervised learning technique where the goal is to group similar data points together. Statistics provides
methods for measuring the similarity between data points, such as distance metrics and clustering algorithms.
8. Dimensionality reduction: Dimensionality reduction techniques, such as principal component analysis (PCA) and t-distributed stochastic
neighbor embedding (t-SNE), are used to reduce the number of variables in a dataset while preserving important information. Statistics
provides the theoretical basis for these techniques.

Understanding these statistics concepts is essential for effectively applying machine learning algorithms and interpreting their results.

Linear algebra is a fundamental mathematical tool in machine learning, as it provides the basis for many machine learning algorithms and
concepts. Here are some key linear algebra concepts that are important for machine learning:

1. Vectors and matrices: Vectors and matrices are used to represent data in machine learning. A vector is a one-dimensional array of
numbers, while a matrix is a two-dimensional array. Vectors and matrices are used to represent features, labels, and parameters in machine
learning models.
2. Matrix operations: Linear algebra provides several operations for manipulating matrices, such as addition, subtraction, multiplication, and
transposition. These operations are used in various machine learning algorithms for tasks like data transformation, model training, and
prediction.
3. Dot product and matrix multiplication: The dot product of two vectors and the matrix multiplication of two matrices are important
operations in linear algebra. They are used in machine learning for computing similarities between vectors, transforming data, and
updating model parameters during training.
4. Eigenvalues and eigenvectors: Eigenvalues and eigenvectors are important concepts in linear algebra that are used in machine learning for
dimensionality reduction, feature extraction, and solving systems of linear equations.
5. Singular value decomposition (SVD): SVD is a matrix factorization technique that is used in machine learning for dimensionality reduction,
data compression, and noise reduction.
6. Norms: Norms are used to measure the magnitude of vectors and matrices. Different norms, such as the L1 norm, L2 norm, and Frobenius
norm, are used in machine learning for regularization, error calculation, and model evaluation.
7. Linear transformations: Linear transformations are used to transform data in machine learning. They are used in algorithms like principal
component analysis (PCA) and linear regression.
8. Vector spaces and subspaces: Vector spaces and subspaces are used to define the mathematical properties of vectors and matrices in
machine learning.

Understanding these linear algebra concepts is essential for effectively implementing and understanding many machine learning algorithms.
Probability is a fundamental concept in machine learning, as it provides a framework for reasoning about uncertainty and making predictions
based on data. Here are some key probability concepts that are important for machine learning:

1. Probability distributions: Probability distributions describe the likelihood of different outcomes in a dataset. Common probability
distributions used in machine learning include the normal distribution, binomial distribution, and Poisson distribution.
2. Conditional probability: Conditional probability measures the probability of an event occurring given that another event has already
occurred. It is used in machine learning for modeling dependencies between variables.
3. Bayes' theorem: Bayes' theorem is a fundamental theorem in probability theory that describes how to update the probability of a
hypothesis based on new evidence. It is used in machine learning for Bayesian inference and probabilistic modeling.
4. Expectation and variance: Expectation is a measure of the central tendency of a random variable, while variance is a measure of its spread.
These concepts are used in machine learning for model evaluation and optimization.
5. Joint, marginal, and conditional probability distributions: Joint probability distributions describe the probabilities of multiple events
occurring together, while marginal probability distributions describe the probabilities of individual events. Conditional probability
distributions describe the probabilities of events given certain conditions. These concepts are used in machine learning for modeling
complex dependencies between variables.
6. Maximum likelihood estimation (MLE): MLE is a method for estimating the parameters of a probability distribution based on observed
data. It is used in machine learning for fitting probabilistic models to data.
7. Naive Bayes classifier: The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive)
independence assumptions between the features. It is commonly used in machine learning for text classification and spam filtering.
8. Probabilistic graphical models: Probabilistic graphical models are a framework for modeling complex probabilistic relationships between
variables. They are used in machine learning for representing and reasoning about uncertainty in data.

Understanding these probability concepts is essential for effectively applying probabilistic models and reasoning in machine learning.
FINALLY INSTALL ANACONDA -- & FROM THAT INSTALL THIS JUPITER NOTEBBOK :

Now start working on your project


Filename: Document1
Directory:
Template: Normal.dotm
Title:
Subject:
Author: sandanakari sachin
Keywords:
Comments:
Crea on Date: 3/13/2024 7:00:00 PM
Change Number: 1
Last Saved On:
Last Saved By:
Total Edi ng Time: 1,399 Minutes
Last Printed On: 3/14/2024 6:29:00 PM
As of Last Complete Prin ng
Number of Pages: 13
Number of Words: 2,550 (approx.)
Number of Characters: 14,538 (approx.)

You might also like