Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
32 views5 pages

Text Analysis - Py

The document outlines a Python program designed to analyze textual data by extracting key linguistic features such as word count, sentence count, readability score, and lexical diversity. It details the methodology, including data acquisition from an Excel file, text preprocessing, and feature extraction, culminating in the generation of an output Excel file with the results. The project demonstrates effective text analysis, providing insights for readability assessment, text complexity analysis, and style comparison.

Uploaded by

Queenkiruba 2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views5 pages

Text Analysis - Py

The document outlines a Python program designed to analyze textual data by extracting key linguistic features such as word count, sentence count, readability score, and lexical diversity. It details the methodology, including data acquisition from an Excel file, text preprocessing, and feature extraction, culminating in the generation of an output Excel file with the results. The project demonstrates effective text analysis, providing insights for readability assessment, text complexity analysis, and style comparison.

Uploaded by

Queenkiruba 2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

CODING – ANALYSIS

import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from textstat import flesch_reading_ease, lexicon_count

def analyze_text(text):

# Tokenization
sentences = sent_tokenize(text)
words = word_tokenize(text)

# Remove stop words


stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.lower() not in stop_words]

# Calculate basic statistics


word_count = len(words)
sentence_count = len(sentences)
avg_sentence_length = word_count / sentence_count

# Calculate readability score


readability = flesch_reading_ease(text)

# Calculate lexical diversity


lexical_diversity = len(set(filtered_words)) / len(filtered_words)
# Calculate syllable count
syllable_count = sum(lexicon_count(word, syllables=True) for word in words)
avg_syllables_per_word = syllable_count / word_count

# Calculate other features


# ...

return {
'word_count': word_count,
'sentence_count': sentence_count,
'avg_sentence_length': avg_sentence_length,
'readability_score': readability,
'lexical_diversity': lexical_diversity,
'avg_syllables_per_word': avg_syllables_per_word,
# ... add more features here
}

if __name__ == "__main__":
# Load data from Excel
input_df = pd.read_excel("Input.xlsx")

# Analyze text and create a new DataFrame


results_df = pd.DataFrame(columns=[
'word_count',
'sentence_count',
'avg_sentence_length',
'readability_score',
'lexical_diversity',
'avg_syllables_per_word',
# ... add more columns here
])

for index, row in input_df.iterrows():


text = row["Text"]
results_df.loc[index] = analyze_text(text)

# Save results to Excel


results_df.to_excel("output.xlsx", index=False)

print("Analysis complete. Results saved to 'output.xlsx'.")

pip install nltk textstat


import nltk
nltk.download('punkt')
nltk.download('stopwords')
REPORT:

1. Objectives:

 To develop Python code that can effectively analyze textual data.


 To extract key linguistic features, such as word count, sentence count, average sentence
length, readability score, and lexical diversity.
 To generate a comprehensive report of the extracted features for further analysis and
interpretation.

2. Methodology:

 Data Acquisition: The code reads input text data from an Excel file ("Input.xlsx").
 Text Preprocessing:
 Tokenization: The text is divided into individual words and sentences using NLTK's
word_tokenize() and sent_tokenize() functions.
 Stop Word Removal: Common stop words (e.g., "the," "a," "is") are removed to focus
on more meaningful words.
 Feature Extraction:
 Basic Statistics: Word count, sentence count, and average sentence length are calculated.
 Readability: The Flesch Reading Ease score is calculated using the textstat library,
providing an estimate of text readability.
 Lexical Diversity: The lexical diversity of the text is calculated, indicating the variety of
words used.
 Syllable Count: The average number of syllables per word is calculated using
textstat.lexicon_count().
 Output Generation: The extracted features are organized into a Pandas DataFrame and
exported to a new Excel file ("output.xlsx") for further analysis and visualization.
3. Code Implementation:

The Python code is structured as follows:

 analyze_text(text) function: This core function performs the text analysis, extracting the
specified features and returning them as a dictionary.
 Main Execution Block:
 Loads input data from the Excel file.
 Creates an empty DataFrame to store the results.
 Iterates through each text in the input data and calls the analyze_text() function.
 Stores the results in the DataFrame.
 Saves the results DataFrame to a new Excel file.

4. Results:

The output Excel file ("output.xlsx") contains the extracted features for each input text, enabling
further analysis and interpretation. These features can be used for various purposes, such as:

 Readability Assessment: Evaluating the readability of different texts for specific


audiences.
 Text Complexity Analysis: Identifying complex or challenging texts.
 Style Analysis: Comparing the writing styles of different authors or texts.
 Content Analysis: Understanding the vocabulary and structure of different types of text.

5. Conclusion:

This project successfully demonstrates the development of Python code for analyzing textual
data. The code effectively extracts key linguistic features, providing a foundation for further
exploration and analysis. The modular design and clear documentation make it adaptable for
various text analysis tasks.

You might also like