CODING – ANALYSIS
import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from textstat import flesch_reading_ease, lexicon_count
def analyze_text(text):
# Tokenization
sentences = sent_tokenize(text)
words = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.lower() not in stop_words]
# Calculate basic statistics
word_count = len(words)
sentence_count = len(sentences)
avg_sentence_length = word_count / sentence_count
# Calculate readability score
readability = flesch_reading_ease(text)
# Calculate lexical diversity
lexical_diversity = len(set(filtered_words)) / len(filtered_words)
# Calculate syllable count
syllable_count = sum(lexicon_count(word, syllables=True) for word in words)
avg_syllables_per_word = syllable_count / word_count
# Calculate other features
# ...
return {
'word_count': word_count,
'sentence_count': sentence_count,
'avg_sentence_length': avg_sentence_length,
'readability_score': readability,
'lexical_diversity': lexical_diversity,
'avg_syllables_per_word': avg_syllables_per_word,
# ... add more features here
}
if __name__ == "__main__":
# Load data from Excel
input_df = pd.read_excel("Input.xlsx")
# Analyze text and create a new DataFrame
results_df = pd.DataFrame(columns=[
'word_count',
'sentence_count',
'avg_sentence_length',
'readability_score',
'lexical_diversity',
'avg_syllables_per_word',
# ... add more columns here
])
for index, row in input_df.iterrows():
text = row["Text"]
results_df.loc[index] = analyze_text(text)
# Save results to Excel
results_df.to_excel("output.xlsx", index=False)
print("Analysis complete. Results saved to 'output.xlsx'.")
pip install nltk textstat
import nltk
nltk.download('punkt')
nltk.download('stopwords')
REPORT:
1. Objectives:
To develop Python code that can effectively analyze textual data.
To extract key linguistic features, such as word count, sentence count, average sentence
length, readability score, and lexical diversity.
To generate a comprehensive report of the extracted features for further analysis and
interpretation.
2. Methodology:
Data Acquisition: The code reads input text data from an Excel file ("Input.xlsx").
Text Preprocessing:
Tokenization: The text is divided into individual words and sentences using NLTK's
word_tokenize() and sent_tokenize() functions.
Stop Word Removal: Common stop words (e.g., "the," "a," "is") are removed to focus
on more meaningful words.
Feature Extraction:
Basic Statistics: Word count, sentence count, and average sentence length are calculated.
Readability: The Flesch Reading Ease score is calculated using the textstat library,
providing an estimate of text readability.
Lexical Diversity: The lexical diversity of the text is calculated, indicating the variety of
words used.
Syllable Count: The average number of syllables per word is calculated using
textstat.lexicon_count().
Output Generation: The extracted features are organized into a Pandas DataFrame and
exported to a new Excel file ("output.xlsx") for further analysis and visualization.
3. Code Implementation:
The Python code is structured as follows:
analyze_text(text) function: This core function performs the text analysis, extracting the
specified features and returning them as a dictionary.
Main Execution Block:
Loads input data from the Excel file.
Creates an empty DataFrame to store the results.
Iterates through each text in the input data and calls the analyze_text() function.
Stores the results in the DataFrame.
Saves the results DataFrame to a new Excel file.
4. Results:
The output Excel file ("output.xlsx") contains the extracted features for each input text, enabling
further analysis and interpretation. These features can be used for various purposes, such as:
Readability Assessment: Evaluating the readability of different texts for specific
audiences.
Text Complexity Analysis: Identifying complex or challenging texts.
Style Analysis: Comparing the writing styles of different authors or texts.
Content Analysis: Understanding the vocabulary and structure of different types of text.
5. Conclusion:
This project successfully demonstrates the development of Python code for analyzing textual
data. The code effectively extracts key linguistic features, providing a foundation for further
exploration and analysis. The modular design and clear documentation make it adaptable for
various text analysis tasks.