Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
1 views13 pages

ML With Unstructured Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views13 pages

ML With Unstructured Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

---

*Citations and Notes in Machine Learning**

Purpose of Citations in ML**

Citations serve to:

* Credit original ideas, algorithms, datasets, or tools.

* Provide background or theoretical foundations (e.g., referencing neural networks or decision trees).

* Support claims with prior studies or benchmarks.

* Point readers to relevant tools/libraries (e.g., TensorFlow, scikit-learn).

### 🔹 **Common Citation Styles in ML**

* **IEEE** (used in engineering/CS conferences)

* **ACM** (used in computing research)

* **APA/MLA/Chicago** (used more in interdisciplinary work)

> **Note:** Most ML papers follow **IEEE** or **ACM** formats. In arXiv or NeurIPS papers, citations
often appear in square brackets \[1], \[2].

---

## 📘 **Example Citations in Machine Learning**


Here are some examples in **IEEE style**, which is common in ML conferences like NeurIPS, ICML, and
CVPR:

### 🔸 1. Algorithm/Model Citation

**Text:**

The backpropagation algorithm revolutionized training of neural networks \[1].

**Citation (IEEE):**

\[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating


errors,” *Nature*, vol. 323, no. 6088, pp. 533–536, 1986.

---

### 🔸 2. Dataset Citation

**Text:**

We used the CIFAR-10 dataset for image classification experiments \[2].

**Citation:**

\[2] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” University of Toronto, Tech.
Rep., 2009.

---

### 🔸 3. Software Library Citation


**Text:**

All experiments were conducted using the PyTorch deep learning library \[3].

**Citation:**

\[3] A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” in
*Advances in Neural Information Processing Systems (NeurIPS)*, vol. 32, 2019.

---

### 🔸 4. Benchmark Paper Citation

**Text:**

Transformers have become the backbone of modern NLP models \[4].

**Citation:**

\[4] A. Vaswani et al., “Attention is all you need,” in *Proc. NeurIPS*, 2017.

---

## 📝 Notes (Footnotes/Endnotes)

In ML writing, **footnotes** are rare in formal papers but may be used:

* For clarifications or exceptions.

* To credit non-academic contributions (e.g., open-source projects or blog posts).


* To explain implementation details not central to the main argument.

**Example footnote (APA style):**

> The model was trained using a batch size of 128.

>

> ^1^ We also tested batch sizes of 64 and 256, which yielded similar results.

---

## 🧾 Foundational References in Machine Learning (You Can Cite)

| Topic | Reference |

| ---------------------- | ---------------------------------------------------------------- |

| Neural Networks | Rumelhart et al., 1986 |

| SVMs | Cortes & Vapnik, 1995 |

| Random Forests | Breiman, 2001 |

| Deep Learning | LeCun, Bengio & Hinton, 2015 |

| CNNs | Krizhevsky et al., 2012 (AlexNet) |

| Transformers | Vaswani et al., 2017 |

| Reinforcement Learning | Sutton & Barto, 2018 |

| Datasets | Krizhevsky (CIFAR), Deng et al. (ImageNet), Marcus et al. (SNLI) |

| Software | Paszke et al. (PyTorch), Abadi et al. (TensorFlow) |

---
EVALUATION OF TEXT CLASSIFICATION:

---

## 🧠 **Evaluation of Text Classification**

Text classification refers to assigning predefined categories to textual data (e.g., spam detection,
sentiment analysis, topic labeling). Evaluation measures how well a model performs this task.

---

## 🔹 **1. Common Evaluation Metrics**

### ✅ Accuracy

* **Definition**: Proportion of correctly classified examples out of total examples.

* **Use case**: Good for balanced datasets.

* **Formula**:

$$

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

$$

### ⚖️Precision, Recall, and F1 Score


* Useful for **imbalanced datasets** or when **false positives/negatives** matter.

| Metric | Definition | Formula |

| --------- | ------------------------------------- | --------------------------------- |

| Precision | Correct positive predictions | $\frac{TP}{TP + FP}$ |

| Recall | Captures all actual positives | $\frac{TP}{TP + FN}$ |

| F1 Score | Harmonic mean of precision and recall | $\frac{2 \cdot P \cdot R}{P + R}$ |

### 📊 Confusion Matrix

* A table showing counts of TP, TN, FP, FN.

* Helps visualize where the classifier makes errors.

### Macro, Micro, and Weighted Averages

* For multi-class tasks:

* **Macro**: Average metric over all classes (equal weight).

* **Micro**: Aggregate metric considering all instances (good for class imbalance).

* **Weighted**: Weighted average based on support (class frequency).

---

## 🔹 **2. Model Evaluation Methodology**


### 🧪 Cross-Validation

* K-fold cross-validation increases robustness of results.

* Useful when dataset size is small.

### 📉 Holdout Validation (Train/Dev/Test Split)

* Common: 80% training / 10% validation / 10% test.

* Prevents overfitting and ensures generalization.

### 📈 Learning Curves

* Plots performance vs. number of training samples.

* Helps diagnose underfitting vs. overfitting.

---

## 🔹 **3. Specialized Considerations**

| Task Type | Additional Evaluation Tools |

| --------------------------- | ----------------------------- |

| Multi-label | Hamming loss, subset accuracy |

| Imbalanced data | ROC-AUC, PR-AUC curves |

| Hierarchical classification | Hierarchical Precision/Recall |


clustering task in text analysis in machine learning:
In machine learning, a text clustering task groups similar text documents into
clusters based on their semantic content, using unsupervised learning because it
does not require labeled data. The process involves representing texts as numerical
vectors, computing their similarity, and then applying a clustering algorithm to form
groups where documents within a cluster share common themes and are dissimilar
to documents in other clusters. Key applications include organizing large document
collections, improving information retrieval, aiding in topic modeling, and creating
datasets for other NLP tasks.

How Text Clustering Works


1. 1. Data Preprocessing:
Text data is cleaned by removing stop words (common words like "the," "a"),
punctuation, and performing stemming or lemmatization to standardize word forms.
2. 2. Feature Extraction:
Texts are converted into numerical representations called vectors or
embeddings. Common methods include:
 Bag-of-Words: Counts the frequency of each word in a document.
 TF-IDF: Weights words by their importance in a document and across a collection.
 Word Embeddings: Uses models like Word2Vec or GloVe to capture semantic
relationships between words.
 Contextual Embeddings: Advanced methods like those
from BERT or LLMs capture word meaning based on its surrounding context.
3. 3. Similarity Computation:
A distance metric (e.g., Euclidean distance) or similarity measure is used to
determine how close two text vectors are in the feature space.
4. 4. Clustering Algorithm:
An algorithm is applied to group the vectors based on their proximity. Popular
algorithms include:
 K-Means: Partitions data into a pre-defined number of clusters (k).
 Hierarchical Clustering: Builds a hierarchy of clusters, merging or splitting them
iteratively.
5. 5. Evaluation:
The quality of the clusters is assessed using metrics like the silhouette
coefficient, which measures how similar a document is to its own cluster compared
to others.
Key Applications
 Document Organization: Groups large collections of documents, making them
easier to navigate and manage.
 Information Retrieval: Improves search engines by grouping related documents,
making it easier to find relevant information.
 Topic Modeling: Identifies underlying themes and topics within a large corpus of
text.
 Data Summarization: Can help condense and organize information by identifying
key themes across documents.
 Recommendation Systems: Used to recommend similar content or items based on
shared textual themes.

General clustering problem:


The clustering problem in machine learning is an unsupervised learning task that
involves grouping unlabeled data into clusters where data points within a cluster are
more similar to each other than to those in other clusters. The goal is to discover
inherent patterns and structures within the data by maximizing within-cluster
similarity and minimizing between-cluster dissimilarity. Common challenges include
selecting the right similarity metric, determining the appropriate number of clusters,
and handling high-dimensional or noisy data.
Key Aspects of the Clustering Problem
 Unsupervised Learning:
Unlike supervised learning, clustering does not use labeled data to learn a target
function.
 Data Partitioning:
The process divides a dataset into homogeneous subsets, or clusters, based on
shared characteristics.
 Similarity Metrics:
Clustering algorithms use various metrics, such as Euclidean distance or cosine
similarity, to measure the closeness between data points.
 No Predefined Clusters:
The number and nature of the clusters are not known beforehand and must be
discovered by the algorithm.
Common Challenges
 Defining "Good" Clusters:
Evaluating the quality of the resulting clusters can be difficult, as the clusters may
not always represent meaningful real-world relationships.
 Curse of Dimensionality:
Clustering becomes challenging in high-dimensional data, where algorithms may
struggle to find meaningful patterns, according to ECML PKDD 2017.

 Scalability:
Many clustering algorithms are computationally expensive, with runtimes that
increase significantly with the number of data points, making them impractical for
large datasets.
 Outliers and Noise:
The presence of outliers or noisy data can negatively impact the quality and
structure of the clusters.
Applications of Clustering
 Customer Segmentation:
Grouping customers with similar purchasing behaviors for targeted marketing
campaigns.
 Image Segmentation:
Dividing an image into distinct regions or segments for object recognition.
 Anomaly Detection:
Identifying unusual data points or outliers that deviate from the normal patterns in a
cluster.
 Document Analysis:
Clustering documents into thematic groups to organize and retrieve information.

clustering algorithms in ML:


Clustering algorithms are unsupervised machine learning techniques used to group
unlabeled data points into clusters based on their similarities. The goal is to identify
intrinsic groupings within the data, where data points within the same cluster are
more similar to each other than to those in other clusters.
Common types of clustering algorithms include:
 Centroid-based Clustering (e.g., K-Means):
This type of algorithm represents each cluster by a central vector (centroid). K-
Means aims to partition data into K clusters, where each data point belongs to the
cluster with the nearest centroid. The algorithm iteratively assigns data points to
clusters and updates the centroids until convergence.
 Connectivity-based Clustering (e.g., Hierarchical Clustering):
These algorithms build a hierarchy of clusters.
 Agglomerative Hierarchical Clustering: Starts with each data point as a separate
cluster and progressively merges the closest clusters until a single cluster remains
or a stopping criterion is met.
 Divisive Hierarchical Clustering: Begins with all data points in one large cluster
and recursively divides it into smaller clusters.
 Density-based Clustering (e.g., DBSCAN):
These algorithms define clusters as areas of high data point density separated by
areas of lower density. DBSCAN can discover clusters of arbitrary shapes and
identify outliers as noise.
 Distribution-based Clustering (e.g., Gaussian Mixture Models):
These algorithms assume that data points within a cluster are generated from a
specific probability distribution (e.g., Gaussian distribution). They aim to find the
parameters of these distributions that best fit the data.
 Neural Network-based Clustering (e.g., Self-Organizing Maps - SOMs):
SOMs are a type of unsupervised neural network that maps high-dimensional data
onto a lower-dimensional grid, preserving the topological relationships of the input
data. Similar data points are mapped to neighboring nodes on the grid, forming
clusters.
The choice of clustering algorithm depends on the characteristics of the data, the
desired cluster shapes, and the specific application.

clustering of textual data in ML:


Clustering of textual data in machine learning algorithms is an unsupervised learning
technique that groups similar text documents together based on their content. Unlike
classification, it does not require pre-labeled data, making it useful for exploratory
data analysis and organizing large text collections.
Process of Text Clustering:
 Text Preprocessing:
Raw text data is cleaned and transformed into a numerical representation. This
often involves:
 Tokenization: Breaking text into individual words or sub-word units.
 Stop Word Removal: Eliminating common words (e.g., "the," "is") that carry little
semantic meaning.
 Stemming or Lemmatization: Reducing words to their root form (e.g., "running,"
"ran" to "run").
 Vectorization: Converting text into numerical vectors using techniques like:
 TF-IDF (Term Frequency-Inverse Document Frequency): Assigns weights to
words based on their frequency within a document and across the entire corpus.
 Word Embeddings (e.g., Word2Vec, GloVe): Represent words as dense vectors
capturing semantic relationships.
 Contextual Embeddings (e.g., BERT, GPT embeddings): Capture word
meanings based on their context within a sentence or document, often derived from
Large Language Models (LLMs).
 Clustering Algorithm Application:
Once the text is vectorized, a clustering algorithm is applied to group similar
documents:
 K-Means: A centroid-based algorithm that partitions data into k clusters, where k is
a pre-defined number. It iteratively assigns data points to the nearest centroid and
updates the centroids.
 Hierarchical Clustering: Builds a hierarchy of clusters, either by starting with
individual data points and merging them (agglomerative) or by starting with one
large cluster and dividing it (divisive).
 DBSCAN: A density-based algorithm that groups together data points that are
closely packed together, marking as outliers points that lie alone in low-density
regions.
 Affinity Propagation: Identifies exemplars (representative data points) and
assigns other data points to the clusters defined by these exemplars.
 Evaluation and Interpretation:
The quality of the clusters is assessed using metrics like silhouette score, and the
clusters are interpreted to understand the themes or topics present in each
group. Dimensionality reduction techniques like t-SNE or PCA can be used for
visualizing clusters in lower dimensions.
Applications of Text Clustering:
 Document Organization: Grouping articles, research papers, or legal documents
by topic.
 Topic Modeling: Discovering hidden thematic structures within a collection of texts.
 Information Retrieval: Improving search results by clustering relevant documents.
 Customer Feedback Analysis: Grouping similar customer reviews or support
tickets to identify common issues.
 News Article Grouping: Categorizing news articles based on their content.

You might also like