This project migrates an on-premises batch processing system to AWS, addressing challenges in reliability, scalability, and maintainability. The architecture leverages AWS services like S3, Glue, and Redshift Serverless to create a robust, serverless data pipeline. All infrastructure is provisioned using Terraform, ensuring consistency and reproducibility.
The repository contains the following Terraform configuration files:
backend.tf: Configures the S3 backend for storing the Terraform state file.vpc.tf: Defines the VPC, subnets, Internet Gateway, NAT Gateway, Route Tables, and Security Groups.iamrole.tf: Creates IAM roles and policies for secure service interactions.redshift.tf: Provisions Redshift Serverless Workgroup, Namespace, and associated configurations.glue.tf: Configures Glue jobs, connections, and crawlers for data processing.providers.tf: Specifies the required Terraform providers and versions.s3.tf: Creates S3 buckets for raw data, processed data, and backups.variable.tf: Defines input variables for reusable configurations.sns.tf: Sets up SNS topics for error notifications.
- Serverless Architecture: Uses managed services like Redshift Serverless and Glue for scalability eliminating infrastructure management overhead.
- Automated Data Pipeline: Glue jobs are scheduled to run hourly to process new data files.
- Error Handling: SNS sends email notifications for Glue job failures.
- Infrastructure as Code: All resources are deployed using Terraform, ensuring consistency and reproducibility.
- Security: Redshift is deployed in a VPC (private subnet), least privilege access, and credentials are stored securely in Secrets Manager.
- Cost Optimization: Pay-for-use model with no upfront infrastructure costs
- AWS Account: Ensure you have an active AWS account.
- Terraform: Install Terraform on your local machine.
- AWS CLI: Configure AWS CLI with your credentials (access key and secret key.
-
Clone the Repository:
git clone https://github.com/HakeemSalaudeen/salesproject-batch-processing-on-AWS.git
-
Initialize Terraform:
terraform init
-
Review Variables: Update the
variable.tffile with your specific configurations (Redshift credentials). -
Deploy Infrastructure:
terraform apply
-
Verify Deployment:
- Check the AWS Management Console to ensure all resources are created.
- Test the data pipeline by uploading a file to the S3 bucket.
- Code Linting: All Terraform files are formatted using
terraform fmtfor consistency. - Best Practices: Follows Terraform best practices for modularity and readability.
- CloudWatch: Use CloudWatch to monitor Glue jobs, Redshift performance, and system logs.
- SNS Alerts: Configure SNS topics to receive notifications for failures.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Submit a pull request with a detailed description of your changes.
Happy Coding! 🚀