RP 7
RP 7
Abstract—This paper addresses the pressing need for an for efficient planning and management of solar resources.
accurate solar energy prediction model, which is crucial for Renewable energy like solar power is said to benefit human
efficient grid integration. We explore the influence of the Air beings in a lot of different ways and the most important
Quality Index and weather features on solar energy generation,
employing advanced Machine Learning and Deep Learning is in the health domain. Research by Galimova et al. [4]
techniques. Our methodology uses time series modeling and suggests that by 2050 if the world goes under a global
makes novel use of power transform normalization and transition and the energy sector emissions drop by 92%
zero-inflated modeling. Various Machine Learning algorithms then we can reduce premature deaths by air pollution by
and Conv2D Long Short-Term Memory model based Deep 97%. This study reinforced the significance of considering
Learning models are applied to these transformations for
precise predictions. Results underscore the effectiveness of environmental factors in our solar energy forecasting models.
our approach, demonstrating enhanced prediction accuracy This study examines the use of machine learning algorithms
with Air Quality Index and weather features. We achieved a that incorporate Air Quality Index (AQI) and meteorological
0.9691 R2 Score, 0.18 MAE, 0.10 RMSE with Conv2D Long features to improve forecast accuracy.
Short-Term Memory model, showcasing the power transform
technique’s innovation in enhancing time series forecasting
for solar energy generation. Such results help our research In order to shed some light on the inconsistent patterns
contribute valuable insights to the synergy between Air Quality of solar generation data, a number of regression models were
Index, weather features, and Deep Learning techniques for solar initially utilised to predict the per-hour generation of solar
energy prediction. power. We thus benchmarked a number of regression models,
of which the chief ones were Linear Regression, Lasso,
Index Terms—Solar Power Generation, Zero Inflated Model,
Power Transform, Time series, LSTM, Deep Learning Ridge, ElasticNet, and ensemble models like RandomForest
and XGBoost. These models utilize different methodologies
which were compared to determine the model with optimal
I. I NTRODUCTION
performance to provide valuable insights into the future of
In the modern world, it has become increasingly clear solar generation.[1] Additionally, this paper describes a few
that eliminating fossil fuels is one of the huge requirements novel methods such as implementing a Zero Inflated Model
to achieve a carbon-neutral future. The Working Group III and scaling the data using Power Transform, which can
Special Report on Renewable Energy Sources and Climate significantly improve solar power predictions irrespective
Change Mitigation (SRREN) [19] suggests that consumption of the irregularity in data. This study also incorporates the
of fossil fuels accounts for the majority of anthropogenic Convolutional Long Short-Term Memory 2D (ConvLSTM2D)
GHG emissions worldwide. It states that CO2 consumption network in the prediction process. ConvLSTM2D is a deep
had risen to over 390 ppm, which was around 39% above pre- learning algorithm that combines the spatial processing
industrial levels by the end of 2010. In the race to an efficient capabilities of Convolutional Neural Networks (CNNs) with
energy ecosystem, solar energy is a promising renewable the sequential processing capabilities of Long Short-Term
resource, but its intermittent nature poses challenges for Memory (LSTMs)s. Using the spatiotemporal dependence of
integration into the grid. A recent study shows that India lost solar generation data, the ConvLSTM2D network improves
29% of its utilizable global horizontal irradiance potential the solar energy generation forecast accuracy by building
due to air pollution in the period between 2008 and 2018 upon historical data.
[3]. Effective prediction of solar power generation is crucial
The novel contributions of this work can be summarized multiscale air quality in their research indicating how air
as follows: This study hopes to investigate and optimise pollutants can contribute to soiling of PV panels affecting the
the performance of the aforementioned machine learning solar power generation which is also mentioned by Chiteka
algorithms and provide a clear and accurate picture of solar et al. [12]. Additionally, Jia et al.
generation by considering weather features and AQI data The study in [5] presented models to predict solar radiation;
which are theorized to have an impact on the fluctuation of even though our research is based on solar power generation
this data. It is envisaged that the methodologies employed this paper gave us important insights regarding the use of
in this paper will contribute considerably to painting a machine learning models in solar forecasting under various
clearer picture of the sporadic nature of solar power and weather conditions. Along with this we also considered how
its influencing factors. The thorough benchmarking and machine learning models are computationally better than
prediction pipeline provides significant benefits for the physical modeling methods since we can use historical data
efficient and sustainable utilization of solar resources by to train the model and predict new data which is difficult to
stakeholders, contributing to the adoption and use of practices do in the physical modeling approach. In addition, the study
that ultimately power our society into a clean energy future. by Jebli, Liu, Sweerts et al. [6][7][8] helped us understand the
importance of addressing topics like Pearson correlation, air
II. S URVEY OF L ITERATURE pollutant deposition effects, and random forest optimization
Various studies have used machine learning algorithms to in our research. Incorporating this increased the accuracy of
increase the understanding and improvement of solar power the prediction models clearly indicating how different factors
forecasting models. Chuluunsaikhan et al. [1] discusses and approaches combined can enhance solar power generation
the importance of considering environmental factors such prediction.
as climate and air pollution when predicting solar power
generation. It states that solar panels work best when Along with machine learning models, there were a lot
there is sunlight and no partial shade. However, factors of studies that suggested the use of deep learning methods for
such as weather conditions (e.g. clouds or rain) and air predicting solar power generation. Application of models like
pollution (e.g. fine dust) can cause partial shading and reduce CNN’s and Recurrent Neural Network (RNN)’s exhibits the
the power output of solar panels. The authors propose a effectiveness of these deep learning techniques in capturing
method to regulate the power output of solar panels through complex patterns and dependencies in solar generation
machine learning. Machine learning models are developed signified by Lee et al. [11]. The study by Zazoum et al.
with three components: weather components, air pollution [10] also evaluates the accuracy and reliability of deep
components, and combined meteorological and air pollution learning methods in forecasting solar PV power generation
components. The datasets used in the study were collected which is essential for effective grid integration and energy
from 2017 to 2019 from the Seoul province of South Korea. management. To address the unique characteristics of the
The paper describes the methodology used, including data dataset, which exhibited an excess of zero values, the
acquisition, feature extraction, model training, and power researchers in the study proposed by Kim et al.
output prediction. The authors compare machine learning Researchers in [13] explored alternative statistical models
models, such as linear regression, k-Nearest Neighbors beyond the traditional Poisson regression. In addition to
(kNN), Support Vector Regression (SVR), Multi-Level the Poisson model, the zero-inflated model was employed,
Perceptron (MLP), Random Forest Regressor (RFR), and acknowledging its ability to effectively handle datasets with an
Gradient-Boosting Regressor (GBR). Models are evaluated excessive proportion of observed zero values. By employing
using quantitative error methods such as MAE, Coefficient of the zero-inflated model, the researchers sought to capture
Determination (R2), and Root Mean Square Error (RMSE). the dual processes contributing to the occurrence of zeros,
Experimental results show that weather and air pollution distinguishing between structural zeros and excess zeros.
parameters can be effective predictors. This paper has been Thomas et al.The work in [14] also emphasizes the need
the main premise of our research. for a modeling framework for univariate and multivariate
zero-inflated time series of counts. The basic modeling
Zhou et al. [2] presents a different approach to forecasting framework used is observation-driven Poisson regression with
short-term solar power output in smart cities by using deep a Generalized Linear Model (GLM) structure.
learning techniques. They used a combination of clustering,
CNN, LSTM, and attention mechanisms that obtained The Zero-Inflated Poisson (ZIP) model is employed to
improved accuracy in predicting future energy generation. capture the possibility of extra observed zeros relative
The authors proposed that for future work one could develop to the Poisson distribution, a common feature in count
training models with time series-based data for further data. Using these insights we also utilized zero-inflated
improvement and this is a proposal that we took under models in our research. Yeom et al. [15] introduces a
consideration. This study and research by Zhou et al. [9] novel approach using deep learning models, specifically
motivated us to incorporate air quality index as a feature in ConvLSTM networks, to predict short-term solar radiation
our machine learning models since they [9] used community by incorporating geostationary satellite images. The proposed
model showed high accuracy in capturing cloud-induced
variations in ground-level solar radiation compared to the
conventional Artificial Neural Network (ANN) and RFR
models. This paper led to us implementing ConvLSTM2D on
our dataset too.
III. M ETHODOLOGY
Fig. 1. Location where dataset was sourced from (La Trobe University) along
with nearest AQI station (Macleod, Victoria)
A. Dataset
Y = β0 + β1 X1 + β2 X2 + . . . + βp Xp + ε (1)
• Gradient Boosting Regression
It is an ensemble model which combines multiple weak
prediction models like decision trees, and takes the
strongest combination of predictions to build a strong
predictive model. Gradient Boosting works by repeatedly
training weak models to fit the negative gradient of the
ongoing prediction’s loss model, which helps improve
its accuracy.
K
X K
X
ŷi = fk (xi ) = f0 (xi ) + γk hk (xi ) (2)
k=1 k=1
µ(1−p)
y p−1 exp 1−p
yp
f (y; µ, p) = exp −
A(µ, p) A(µ, p)
Thus, on noticing this skewness of the solar energy Zero-Inflated Model with Tweedie Distribution
provide solutions faster and more efficiently. When it
comes to model application, it consists of numerous
estimator functions like H2OGradientBoostingEstimator
and H2ODeepLearningEstimator served using REST API
abstraction, each of which consists of a plethora of
hyperparameters for deep customization. For the scope of
our project, we have chosen H2OGradientBoostingEstimator,
H2OXGBoostEstimator, H2ODeepLearningEstimator, and
H2ORandomForestEstimator. The value of the parameters
distribution and tweedie power were set to ‘tweedie’ and
‘1.5’ respectively for the models and other parameters were
optimized using GridSearchCV. For the Deep Learning
Zero-inflated MOdel (ZIM), 4 hidden layers were chosen of
neuron count 100,100,50 and 50 respectively. Transforming
Fig. 5. A plot demonstrating a standard Tweedie distribution.
our data using a zero-inflated model resulted in a marked
improvement in our solar power prediction with a reduced
The zero-inflated model with the Tweedie distribution can MAE and RMSE.
be represented using the following formula: D. Power-Transform
( It is a data transformation and scaling technique which
0, with probability π is another way to tackle the skewness of the solar gener-
Y =
Z, with probability 1 − π ation data; we found that using PowerTransformer was a
µ significantly better fit for our dataset than the zero-inflated
log = β0 + β1 X1 + β2 X2 + . . . + βp Xp model. This scaling method is applied feature-wise to make
ϕ
the data more Gaussian or Gaussian-like which is inherently
log(ϕ) = γ0 + γ1 X1 + γ2 X2 + . . . + γq Xq assumed by regression-based prediction models. It is used
π when dealing with non-constant variance. There are 2 different
log = θ0 + θ1 X1 + θ2 X2 + . . . + θr Xr
1−π methods of performing the power transform, namely the Box-
Cox transform and the Yeo-Johnson transform.
The Yeo-Johnson power transform is given by the formula:
where:
- Y represents the response variable (count variable with
(x + 1)λ − 1 /λ, if λ ̸= 0, x ≥ 0
excess zeros).
ln(x + 1), if λ = 0, x ≥ 0
xλ =
- Z represents the positive count variable.
− (|x| + 1) 2−λ
− 1 /(2 − λ), if λ ̸= 2, x < 0
- π represents the probability of excess zeros.
− ln(|x| + 1), if λ = 2, x < 0
- µ represents the mean parameter.
- ϕ represents the dispersion parameter. Here, x represents the original variable, and λ is a parameter
- X1 , X2 , . . . , Xp represent the predictor variables for the that determines the type of power transform applied. The
mean equation. transformed variable xλ is the result of applying the Yeo-
- X1 , X2 , . . . , Xq represent the predictor variables for the Johnson power transform. The aforementioned Yeo-Johnson
dispersion equation. power transform was thus applied to the data, and the re-
- X1 , X2 , . . . , Xr represent the predictor variables for the sulting distribution was notably found to resemble a Tweedie
zero-inflation equation. distribution.
- β0 , β1 , β2 , . . . , βp represent the coefficients for the mean The solar energy generation data points are now normalized
equation. to make the distribution more Gaussian as shown in Figure 4.
- γ0 , γ1 , γ2 , . . . , γq represent the coefficients for the dispersion
equation. IV. R ESULTS
- θ0 , θ1 , θ2 , . . . , θr represent the coefficients for the zero- For the prediction of solar energy generation using multiple
inflation equation. methodologies, we have found that the Power Transformed
data led to the most accurate prediction in comparison to
Now in order to accurately and efficiently apply a Regular Time Series and Zero-Inflated models. Power Trans-
customized zero-inflated model to our data and test out formation of data is a particular method that stands out in
various regression models in Python, we utilized H2O, a comparison to the rest. This is due to solar energy generation
Java-based software for data modeling and general computing. being dependent on various factors like temperature, seasonal-
H2O is an abstracted distributed processing engine ity, time of day, and air quality of the region. These data points
that allows for simple horizontal scaling in order to have non-linear relationships with each other. ConvLSTM2D
TABLE II
E VALUATION METRICS APPLIED FOR Z ERO I NFLATED M ODEL TIME
SERIES APPROACH
TABLE I
E VALUATION METRICS APPLIED FOR REGULAR TIME SERIES APPROACH TABLE IV
E VALUATION METRICS APPLIED FOR LSTM TIME SERIES APPROACH
Models Hours Out R2 Score MAE RMSE
24 0.6344 6.45 14.91 Model Hours Out R2 Score MAE RMSE
Linear Regression 48 0.5884 7.08 15.99 24 0.9691 0.18 0.10
72 0.5677 7.57 16.43 ConvLSTM2D 48 0.9637 0.18 0.08
24 0.7142 5.15 13.18 72 0.9608 0.20 0.09
GradientBoosting Regression 48 0.6793 5.62 14.12
72 0.6555 6.13 14.66
24 0.7431 5.00 12.49 After comparing all three methodologies, regression models,
XGBoost Regression 48 0.7095 5.47 13.44
72 0.6865 5.92 13.99
and variation in prediction time, the best combination of
24 0.8005 3.42 11.02 factors respectively has been tabulated in Table V.
RandomForest Regression 48 0.7987 3.52 11.19
72 0.7996 3.48 11.18
TABLE V
24 0.8244 3.49 10.33
O PTIMUM METRIC FOR P OWER T RANSFORM TIME SERIES APPROACH
RandomForest + XGBoost 48 0.8145 3.71 10.74
72 0.8178 3.75 10.66
Model Hours Out R2 Score MAE RMSE
RandomForest + XGBoost 24 0.9595 0.09 0.20
The range of target variables was between 0 and 320 for ConvLSTM2D 24 0.9691 0.18 0.10
regular time series and Zero-Inflated models. For the Zero
Fig. 7. Complete pipeline used for data collection and prediction of solar data.
Fig. 8. Monthly solar generation which provides insights into the seasonality
of the data
Fig. 10. Zero Inflated 24H out prediction using Gradient Boost Regressor