0% found this document useful (0 votes)

32 views5 pages

Text Analysis - Py

The document outlines a Python program designed to analyze textual data by extracting key linguistic features such as word count, sentence count, readability score, and lexical diversity. It details the methodology, including data acquisition from an Excel file, text preprocessing, and feature extraction, culminating in the generation of an output Excel file with the results. The project demonstrates effective text analysis, providing insights for readability assessment, text complexity analysis, and style comparison.

Uploaded by

Queenkiruba 2009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views5 pages

Text Analysis - Py

Uploaded by

Queenkiruba 2009

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

CODING – ANALYSIS

import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from textstat import flesch_reading_ease, lexicon_count

def analyze_text(text):

# Tokenization
sentences = sent_tokenize(text)
words = word_tokenize(text)

# Remove stop words

stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.lower() not in stop_words]

# Calculate basic statistics

word_count = len(words)
sentence_count = len(sentences)
avg_sentence_length = word_count / sentence_count

# Calculate readability score

readability = flesch_reading_ease(text)

# Calculate lexical diversity

lexical_diversity = len(set(filtered_words)) / len(filtered_words)
# Calculate syllable count
syllable_count = sum(lexicon_count(word, syllables=True) for word in words)
avg_syllables_per_word = syllable_count / word_count

# Calculate other features

# ...

return {
'word_count': word_count,
'sentence_count': sentence_count,
'avg_sentence_length': avg_sentence_length,
'readability_score': readability,
'lexical_diversity': lexical_diversity,
'avg_syllables_per_word': avg_syllables_per_word,
# ... add more features here
}

if __name__ == "__main__":
# Load data from Excel
input_df = pd.read_excel("Input.xlsx")

# Analyze text and create a new DataFrame

results_df = pd.DataFrame(columns=[
'word_count',
'sentence_count',
'avg_sentence_length',
'readability_score',
'lexical_diversity',
'avg_syllables_per_word',
# ... add more columns here
])

for index, row in input_df.iterrows():

text = row["Text"]
results_df.loc[index] = analyze_text(text)

# Save results to Excel

results_df.to_excel("output.xlsx", index=False)

print("Analysis complete. Results saved to 'output.xlsx'.")

pip install nltk textstat

import nltk
nltk.download('punkt')
nltk.download('stopwords')
REPORT:

1. Objectives:

 To develop Python code that can effectively analyze textual data.

 To extract key linguistic features, such as word count, sentence count, average sentence
length, readability score, and lexical diversity.
 To generate a comprehensive report of the extracted features for further analysis and
interpretation.

2. Methodology:

 Data Acquisition: The code reads input text data from an Excel file ("Input.xlsx").
 Text Preprocessing:
 Tokenization: The text is divided into individual words and sentences using NLTK's
word_tokenize() and sent_tokenize() functions.
 Stop Word Removal: Common stop words (e.g., "the," "a," "is") are removed to focus
on more meaningful words.
 Feature Extraction:
 Basic Statistics: Word count, sentence count, and average sentence length are calculated.
 Readability: The Flesch Reading Ease score is calculated using the textstat library,
providing an estimate of text readability.
 Lexical Diversity: The lexical diversity of the text is calculated, indicating the variety of
words used.
 Syllable Count: The average number of syllables per word is calculated using
textstat.lexicon_count().
 Output Generation: The extracted features are organized into a Pandas DataFrame and
exported to a new Excel file ("output.xlsx") for further analysis and visualization.
3. Code Implementation:

The Python code is structured as follows:

 analyze_text(text) function: This core function performs the text analysis, extracting the
specified features and returning them as a dictionary.
 Main Execution Block:
 Loads input data from the Excel file.
 Creates an empty DataFrame to store the results.
 Iterates through each text in the input data and calls the analyze_text() function.
 Stores the results in the DataFrame.
 Saves the results DataFrame to a new Excel file.

4. Results:

The output Excel file ("output.xlsx") contains the extracted features for each input text, enabling
further analysis and interpretation. These features can be used for various purposes, such as:

 Readability Assessment: Evaluating the readability of different texts for specific

audiences.
 Text Complexity Analysis: Identifying complex or challenging texts.
 Style Analysis: Comparing the writing styles of different authors or texts.
 Content Analysis: Understanding the vocabulary and structure of different types of text.

5. Conclusion:

This project successfully demonstrates the development of Python code for analyzing textual
data. The code effectively extracts key linguistic features, providing a foundation for further
exploration and analysis. The modular design and clear documentation make it adaptable for
various text analysis tasks.

Problem Statement
No ratings yet
Problem Statement
10 pages
DSBDL Assn 07
No ratings yet
DSBDL Assn 07
4 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Requirements Documentation
No ratings yet
Requirements Documentation
3 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Batch 2
No ratings yet
Batch 2
13 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Group 1
No ratings yet
Group 1
9 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
3
No ratings yet
3
5 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Objective
No ratings yet
Objective
4 pages
TSA Student
No ratings yet
TSA Student
20 pages
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
No ratings yet
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
5 pages
Muhammad Irsyad Bin Mohd Hanafe - D2cdcs2418a
No ratings yet
Muhammad Irsyad Bin Mohd Hanafe - D2cdcs2418a
5 pages
3
No ratings yet
3
7 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Record
No ratings yet
Record
6 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
Untitled Document
No ratings yet
Untitled Document
18 pages
Ment Analysis Text Classification
No ratings yet
Ment Analysis Text Classification
9 pages
NLP Day1
No ratings yet
NLP Day1
4 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
Ass 3
No ratings yet
Ass 3
3 pages
Tsa Lab Manual Document About Text and Speech Analysis
No ratings yet
Tsa Lab Manual Document About Text and Speech Analysis
25 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
All Practicals
No ratings yet
All Practicals
33 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
PHD Regulation 2015
No ratings yet
PHD Regulation 2015
5 pages
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
No ratings yet
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
7 pages
Computer Scinece Practical File
No ratings yet
Computer Scinece Practical File
52 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
Bling
No ratings yet
Bling
7 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Gebrekidan Yonatan Yakob
No ratings yet
Gebrekidan Yonatan Yakob
14 pages
Resume Phrase Matcher Code GitHub
No ratings yet
Resume Phrase Matcher Code GitHub
2 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
Shreya Srivastava-27
No ratings yet
Shreya Srivastava-27
3 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
05 - Dictionaries and Tuples
No ratings yet
05 - Dictionaries and Tuples
61 pages
CS Practical File
No ratings yet
CS Practical File
47 pages
NLP Session 4
No ratings yet
NLP Session 4
13 pages
Data Analysis Assignment Guide
No ratings yet
Data Analysis Assignment Guide
4 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
Aim - Procedure - Result - Single Side
No ratings yet
Aim - Procedure - Result - Single Side
18 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP Record
No ratings yet
NLP Record
23 pages
Text Processing
No ratings yet
Text Processing
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
Feature Extraction Codes
No ratings yet
Feature Extraction Codes
3 pages
HiPath Hospitality V2.0 Brochure
No ratings yet
HiPath Hospitality V2.0 Brochure
14 pages
Anpr
No ratings yet
Anpr
18 pages
Midterm Exam Data Base
No ratings yet
Midterm Exam Data Base
5 pages
Cable ID Test Limit Length Headroom Date / Time: Untitled1
No ratings yet
Cable ID Test Limit Length Headroom Date / Time: Untitled1
8 pages
Manual TerraSpec 4
100% (1)
Manual TerraSpec 4
85 pages
Cyberstalking & Cyberbullying Guide
No ratings yet
Cyberstalking & Cyberbullying Guide
8 pages
Educational Tech Exam Questions
No ratings yet
Educational Tech Exam Questions
6 pages
Njrat Uncovered
No ratings yet
Njrat Uncovered
27 pages
Python Programming Lab Guide
No ratings yet
Python Programming Lab Guide
25 pages
Siebel Data Mapping Guide
No ratings yet
Siebel Data Mapping Guide
19 pages
Introduction To MongoDB
No ratings yet
Introduction To MongoDB
8 pages
Exercise - Digital Documentation - Elementary
No ratings yet
Exercise - Digital Documentation - Elementary
3 pages
Tibetan Windows Software Overview
No ratings yet
Tibetan Windows Software Overview
7 pages
WHERE Clause: DCL Command
No ratings yet
WHERE Clause: DCL Command
8 pages
Azure Project
No ratings yet
Azure Project
13 pages
Administare Netwrok and Peripheral Devices Information Sheet
88% (16)
Administare Netwrok and Peripheral Devices Information Sheet
54 pages
SGFL Job Opportunities 2020
No ratings yet
SGFL Job Opportunities 2020
7 pages
4test: Free Valid Test Questions and Dumps PDF For Certification Test Prep
No ratings yet
4test: Free Valid Test Questions and Dumps PDF For Certification Test Prep
6 pages
SAP ERP Utilities Certification Prep
No ratings yet
SAP ERP Utilities Certification Prep
5 pages
Descriptive Texts on Favorite Items
No ratings yet
Descriptive Texts on Favorite Items
6 pages
An Empirical Study of DevSecOps Focused On Continuous Security Testing
No ratings yet
An Empirical Study of DevSecOps Focused On Continuous Security Testing
8 pages
Assignment
No ratings yet
Assignment
8 pages
Dictionary in Python
No ratings yet
Dictionary in Python
6 pages
Crash
No ratings yet
Crash
9 pages
Google AdSense
No ratings yet
Google AdSense
2 pages
Quotation: No. Items Description Qty. Unit Price Amount
No ratings yet
Quotation: No. Items Description Qty. Unit Price Amount
1 page
1 Introduction Fall24v1
No ratings yet
1 Introduction Fall24v1
19 pages
Webleaflet ENG Amiko Mira WiFi v170719
No ratings yet
Webleaflet ENG Amiko Mira WiFi v170719
2 pages
Repport Btech Final
No ratings yet
Repport Btech Final
50 pages
CBE Internship PDF
No ratings yet
CBE Internship PDF
23 pages

Text Analysis - Py

Uploaded by

Text Analysis - Py

Uploaded by

CODING – ANALYSIS

# Remove stop words

# Calculate basic statistics

# Calculate readability score

# Calculate lexical diversity

# Calculate other features

# Analyze text and create a new DataFrame

for index, row in input_df.iterrows():

# Save results to Excel

print("Analysis complete. Results saved to 'output.xlsx'.")

pip install nltk textstat

 To develop Python code that can effectively analyze textual data.

The Python code is structured as follows:

 Readability Assessment: Evaluating the readability of different texts for specific

You might also like