The Automated Data Analysis script simplifies the process of analyzing datasets by performing key statistical operations and generating insights using an LLM (Large Language Model). The script handles the analysis of CSV files, generates visualizations, and creates an insightful report based on the dataset provided.
- Automatically detects the encoding of CSV files.
- Generates summary statistics for numerical columns.
- Identifies and reports missing values.
- Calculates the correlation matrix and creates a heatmap visualization.
- Uses an LLM to provide statistical insights and narrative reports.
- Outputs a comprehensive README with analysis results and visualizations.
- Python Version: Ensure Python 3.11 or above is installed.
- API Token: Set the
AIPROXY_TOKENenvironment variable for accessing the LLM.
To set the environment variable:
export AIPROXY_TOKEN=<your_token_here>Run the script with the following command:
python automated_analysis.py <path_to_csv_file>python automated_analysis.py data/sample_dataset.csv- A folder will be created using the name of the dataset file (without the extension).
- Inside this folder:
README.md: Contains detailed insights and analysis.correlation_matrix.png: Heatmap visualization of the correlation matrix.
|-- automated_data_analysis/
|-- automated_analysis.py
|-- requirements.txt
|-- README.md (this file)
|-- <dataset_folder>/
|-- README.md
|-- correlation_matrix.png
- Summary statistics for numerical columns.
- Missing value counts for each column.
- Key observations about correlations.
- Correlation Heatmap: A visual representation of relationships between numerical features.
The script handles common issues such as:
- Missing API tokens.
- File encoding errors.
- Network timeouts or rate limits during API calls.
If an error occurs, descriptive messages are logged to the console.
This project is licensed under the MIT License. See the LICENSE file for details.
Developed with ❤️ by Jay Thadeshwar.