Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views10 pages

RP 7

Uploaded by

brightritesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

RP 7

Uploaded by

brightritesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Predicting Solar Energy Generation with Machine

Learning based on AQI and Weather Features


Arjun Shah Varun Viswanath Kashish Gandhi
Synapse, Computer Engineering Synapse, Computer Engineering Synapse, Computer Engineering
D.J. Sanghvi College of Engineering D.J. Sanghvi College of Engineering D.J. Sanghvi College of Engineering
Mumbai, India Mumbai, India Mumbai, India
[email protected] [email protected] [email protected]

Dr. Nilesh Madhukar Patil


arXiv:2408.12476v2 [cs.LG] 23 Aug 2024

Synapse, Computer Engineering


D.J. Sanghvi College of Engineering
Mumbai, India
[email protected]

Abstract—This paper addresses the pressing need for an for efficient planning and management of solar resources.
accurate solar energy prediction model, which is crucial for Renewable energy like solar power is said to benefit human
efficient grid integration. We explore the influence of the Air beings in a lot of different ways and the most important
Quality Index and weather features on solar energy generation,
employing advanced Machine Learning and Deep Learning is in the health domain. Research by Galimova et al. [4]
techniques. Our methodology uses time series modeling and suggests that by 2050 if the world goes under a global
makes novel use of power transform normalization and transition and the energy sector emissions drop by 92%
zero-inflated modeling. Various Machine Learning algorithms then we can reduce premature deaths by air pollution by
and Conv2D Long Short-Term Memory model based Deep 97%. This study reinforced the significance of considering
Learning models are applied to these transformations for
precise predictions. Results underscore the effectiveness of environmental factors in our solar energy forecasting models.
our approach, demonstrating enhanced prediction accuracy This study examines the use of machine learning algorithms
with Air Quality Index and weather features. We achieved a that incorporate Air Quality Index (AQI) and meteorological
0.9691 R2 Score, 0.18 MAE, 0.10 RMSE with Conv2D Long features to improve forecast accuracy.
Short-Term Memory model, showcasing the power transform
technique’s innovation in enhancing time series forecasting
for solar energy generation. Such results help our research In order to shed some light on the inconsistent patterns
contribute valuable insights to the synergy between Air Quality of solar generation data, a number of regression models were
Index, weather features, and Deep Learning techniques for solar initially utilised to predict the per-hour generation of solar
energy prediction. power. We thus benchmarked a number of regression models,
of which the chief ones were Linear Regression, Lasso,
Index Terms—Solar Power Generation, Zero Inflated Model,
Power Transform, Time series, LSTM, Deep Learning Ridge, ElasticNet, and ensemble models like RandomForest
and XGBoost. These models utilize different methodologies
which were compared to determine the model with optimal
I. I NTRODUCTION
performance to provide valuable insights into the future of
In the modern world, it has become increasingly clear solar generation.[1] Additionally, this paper describes a few
that eliminating fossil fuels is one of the huge requirements novel methods such as implementing a Zero Inflated Model
to achieve a carbon-neutral future. The Working Group III and scaling the data using Power Transform, which can
Special Report on Renewable Energy Sources and Climate significantly improve solar power predictions irrespective
Change Mitigation (SRREN) [19] suggests that consumption of the irregularity in data. This study also incorporates the
of fossil fuels accounts for the majority of anthropogenic Convolutional Long Short-Term Memory 2D (ConvLSTM2D)
GHG emissions worldwide. It states that CO2 consumption network in the prediction process. ConvLSTM2D is a deep
had risen to over 390 ppm, which was around 39% above pre- learning algorithm that combines the spatial processing
industrial levels by the end of 2010. In the race to an efficient capabilities of Convolutional Neural Networks (CNNs) with
energy ecosystem, solar energy is a promising renewable the sequential processing capabilities of Long Short-Term
resource, but its intermittent nature poses challenges for Memory (LSTMs)s. Using the spatiotemporal dependence of
integration into the grid. A recent study shows that India lost solar generation data, the ConvLSTM2D network improves
29% of its utilizable global horizontal irradiance potential the solar energy generation forecast accuracy by building
due to air pollution in the period between 2008 and 2018 upon historical data.
[3]. Effective prediction of solar power generation is crucial
The novel contributions of this work can be summarized multiscale air quality in their research indicating how air
as follows: This study hopes to investigate and optimise pollutants can contribute to soiling of PV panels affecting the
the performance of the aforementioned machine learning solar power generation which is also mentioned by Chiteka
algorithms and provide a clear and accurate picture of solar et al. [12]. Additionally, Jia et al.
generation by considering weather features and AQI data The study in [5] presented models to predict solar radiation;
which are theorized to have an impact on the fluctuation of even though our research is based on solar power generation
this data. It is envisaged that the methodologies employed this paper gave us important insights regarding the use of
in this paper will contribute considerably to painting a machine learning models in solar forecasting under various
clearer picture of the sporadic nature of solar power and weather conditions. Along with this we also considered how
its influencing factors. The thorough benchmarking and machine learning models are computationally better than
prediction pipeline provides significant benefits for the physical modeling methods since we can use historical data
efficient and sustainable utilization of solar resources by to train the model and predict new data which is difficult to
stakeholders, contributing to the adoption and use of practices do in the physical modeling approach. In addition, the study
that ultimately power our society into a clean energy future. by Jebli, Liu, Sweerts et al. [6][7][8] helped us understand the
importance of addressing topics like Pearson correlation, air
II. S URVEY OF L ITERATURE pollutant deposition effects, and random forest optimization
Various studies have used machine learning algorithms to in our research. Incorporating this increased the accuracy of
increase the understanding and improvement of solar power the prediction models clearly indicating how different factors
forecasting models. Chuluunsaikhan et al. [1] discusses and approaches combined can enhance solar power generation
the importance of considering environmental factors such prediction.
as climate and air pollution when predicting solar power
generation. It states that solar panels work best when Along with machine learning models, there were a lot
there is sunlight and no partial shade. However, factors of studies that suggested the use of deep learning methods for
such as weather conditions (e.g. clouds or rain) and air predicting solar power generation. Application of models like
pollution (e.g. fine dust) can cause partial shading and reduce CNN’s and Recurrent Neural Network (RNN)’s exhibits the
the power output of solar panels. The authors propose a effectiveness of these deep learning techniques in capturing
method to regulate the power output of solar panels through complex patterns and dependencies in solar generation
machine learning. Machine learning models are developed signified by Lee et al. [11]. The study by Zazoum et al.
with three components: weather components, air pollution [10] also evaluates the accuracy and reliability of deep
components, and combined meteorological and air pollution learning methods in forecasting solar PV power generation
components. The datasets used in the study were collected which is essential for effective grid integration and energy
from 2017 to 2019 from the Seoul province of South Korea. management. To address the unique characteristics of the
The paper describes the methodology used, including data dataset, which exhibited an excess of zero values, the
acquisition, feature extraction, model training, and power researchers in the study proposed by Kim et al.
output prediction. The authors compare machine learning Researchers in [13] explored alternative statistical models
models, such as linear regression, k-Nearest Neighbors beyond the traditional Poisson regression. In addition to
(kNN), Support Vector Regression (SVR), Multi-Level the Poisson model, the zero-inflated model was employed,
Perceptron (MLP), Random Forest Regressor (RFR), and acknowledging its ability to effectively handle datasets with an
Gradient-Boosting Regressor (GBR). Models are evaluated excessive proportion of observed zero values. By employing
using quantitative error methods such as MAE, Coefficient of the zero-inflated model, the researchers sought to capture
Determination (R2), and Root Mean Square Error (RMSE). the dual processes contributing to the occurrence of zeros,
Experimental results show that weather and air pollution distinguishing between structural zeros and excess zeros.
parameters can be effective predictors. This paper has been Thomas et al.The work in [14] also emphasizes the need
the main premise of our research. for a modeling framework for univariate and multivariate
zero-inflated time series of counts. The basic modeling
Zhou et al. [2] presents a different approach to forecasting framework used is observation-driven Poisson regression with
short-term solar power output in smart cities by using deep a Generalized Linear Model (GLM) structure.
learning techniques. They used a combination of clustering,
CNN, LSTM, and attention mechanisms that obtained The Zero-Inflated Poisson (ZIP) model is employed to
improved accuracy in predicting future energy generation. capture the possibility of extra observed zeros relative
The authors proposed that for future work one could develop to the Poisson distribution, a common feature in count
training models with time series-based data for further data. Using these insights we also utilized zero-inflated
improvement and this is a proposal that we took under models in our research. Yeom et al. [15] introduces a
consideration. This study and research by Zhou et al. [9] novel approach using deep learning models, specifically
motivated us to incorporate air quality index as a feature in ConvLSTM networks, to predict short-term solar radiation
our machine learning models since they [9] used community by incorporating geostationary satellite images. The proposed
model showed high accuracy in capturing cloud-induced
variations in ground-level solar radiation compared to the
conventional Artificial Neural Network (ANN) and RFR
models. This paper led to us implementing ConvLSTM2D on
our dataset too.

To summarize, the reviewed papers have considerably


contributed to solar power generation using machine learning
and deep learning techniques. Their research provided
observations that helped us build our research on and further
enhance solar forecasting by utilizing AQI, time series-
based data, exploring novel approaches, and other different
approaches to making solar power forecasting more reliable
and accurate.

III. M ETHODOLOGY
Fig. 1. Location where dataset was sourced from (La Trobe University) along
with nearest AQI station (Macleod, Victoria)
A. Dataset

In order to accurately train our model on features that would


help it effectively predict solar generation, we needed a dataset
that had high granularity, solar generation, irradiance, and
meteorological data. We thus utilized the UNISOLAR Solar
Generation Dataset which includes two years of Photovoltaic
solar energy generation data collected at an interval of 15
minutes at La Trobe University Campus in Victoria, Australia.
Weather data like apparent temperature, air temperature, dew
point temperature, wind speed, wind direction, and relative hu-
midity were also provided by the dataset. We curated the data
by merging and selecting provided features such that a suitable
dataset may be created for our model. We did this by merging
the data provided about the solar panels such as the number of Fig. 2. Distance in kilometers between La Trobe University and Macleod
AQI station (1.77km)
panels and type of inverters, with the aforementioned weather
data features to obtain a comprehensive and feature rich data
of potential factors affecting solar data. In order to improve B. Time-Series Approach
the input features of our model, we utilized AQI data sourced
from the aqicn.org website, which was captured by a station Initially, a regression-based approach was utilized to predict
located in Macleod, Victoria with the location showed in Fig. the solar power generation based on the factors present. How-
1. This station was at a distance of 1.77 km from the University ever, this did not provide adequate information regarding the
where the UNISOLAR dataset was collected (as shown in Fig. relationship between these factors and solar power generation.
2). This prompted us to try out a time series-based approach
It was noticed that the solar data generated for 15-minute as we also had chronological data. This is because a time
intervals had a number of irregularities in the number of data series model is designed to capture patterns in sequential and
collections per day. Switching to one-hour intervals eliminated discrete data points over a set period in time. We made use of
this problem and made the data intervals regular. Another solar generation at a particular moment in time to predict the
reason for the use of 1 hour intervals was to reduce the generation at some time in the future as this type of approach
noise caused due to various external factors like equipment would help us identify trends and seasonality to make more
sensitivity, cloud cover and other atmospheric factors. accurate predictions. We decided to shape our prediction by
For our study, we implemented 70:30 dataset split for training creating a column that would predict solar generation values
and testing respectively. This approach with 70% of the dataset 24, 48 and 72 hours out and compared the effectiveness of
dedicated to training, allows our model to effectively learn the these models.
complex temporal dependencies and relationships inherent in The Machine Learning models used for generating and com-
solar data, thereby enhancing its predicting accuracy. The re- paring solar power generation for a Time Series approach
maining 30% served as an independent testing set to determine were:
the efficacy of our proposed model on unseen data. • Linear Regression
It is a statistical model used to establish a linear • ConvLSTM2D
relationship between a dependent variable and one or In this subsection, we delve into the specifics of the
more independent variables. It allows for the prediction ConvLSTM2D model, a hybrid architecture that merges
of coefficients that decide the strength of the established CNNs and LSTM networks. This unique architecture
linear relationship. It is used to predict the line of best is designed specifically for analyzing spatio-temporal
fit which minimizes the distance between actual values data, where understanding both spatial relationships and
and predicted values. temporal dynamics is crucial for accurate predictions.

Y = β0 + β1 X1 + β2 X2 + . . . + βp Xp + ε (1)
• Gradient Boosting Regression
It is an ensemble model which combines multiple weak
prediction models like decision trees, and takes the
strongest combination of predictions to build a strong
predictive model. Gradient Boosting works by repeatedly
training weak models to fit the negative gradient of the
ongoing prediction’s loss model, which helps improve
its accuracy.
K
X K
X
ŷi = fk (xi ) = f0 (xi ) + γk hk (xi ) (2)
k=1 k=1

• Random Forest Regression


It is a supervised learning algorithm used for regression-
based tasks. It combines concepts of decision trees and
ensemble learning to make precise predictions. It also
reduces overfitting due to an element of randomness
which restricts individual trees from memorizing the
training set data provided to it. The decision of the Fig. 3. This figure shows the ConvLSTM2D model architecture, which
combines convolutional and LSTM layers to process spatio-temporal data.
algorithm is made by aggregating the decisions of
individual trees using majority voting. The architecture of the ConvLSTM2D model, illustrated
N in Figure 3, consists of layers that facilitate the processing
1 X of spatio-temporal data. At its core are ConvLSTM units,
ŷi = fj (xi ) (3)
N j=1 which extend the traditional LSTM cells by incorporating
convolutional operations within the recurrent structure.
• XGBoost Regression This innovative design enables the model to effectively
Extreme Gradient Boosting is an advanced gradient capture both spatial and temporal dependencies within
boosting algorithm that is widely used due to the high the data. The utilization of ConvLSTM units makes the
accuracy of predictions. It achieves this by iteratively model adept at handling sequences of spatio-temporal
adding weak models to the ensemble and ensuring they fit observations, a characteristic essential for tasks such as
according to the current prediction. It also uses features video analysis, weather forecasting, and motion predic-
like column and row subsampling to further improve its tion.
performance. The training procedure of the ConvLSTM2D model fol-
K
X K
X lows a systematic approach tailored to exploit its full po-
ŷi = fk (xi ) = f0 (xi ) + hk (xi ) (4) tential. Initially, standard data preprocessing techniques,
k=1 k=1 similar to those employed for other machine learning
models, are applied. However, due to the unique architec-
• Random Forest Regression + XGBoost Regression
ture of ConvLSTM2D, additional reshaping of the data
Building upon the positive results displayed by Lokesh et
is necessary to conform to its input shape requirements.
al [20], we decided upon an ensemble model consisting
The input data is reshaped into a 5-dimensional tensor,
of a Random Forest Regressor and a XGBoost Regressor,
accommodating batches of sequences, each comprising
considering that their strengths are complementary in the
2D matrices over time. This reshaping operation enables
sense that both are ensemble learners and use boosting
the model to interpret the input data as a spatio-temporal
and bagging respectively. RFR can handle non-linear re-
sequence, facilitating effective learning of complex pat-
lationships between data points and XGBoost can capture
terns. Furthermore, to ensure robust training and prevent
subtle patterns and temporal dependencies.
issues stemming from skewed distributions, a power
transformation is applied to scale input features appro- generation data, we decided to switch to a zero-inflated model
priately. This normalization step fosters homogeneous which essentially differentiates between structural zeros(due
feature scales, thereby preventing any individual feature to nighttime and genuine absences of solar generation)
from disproportionately influencing the learning process. and zero inflation(additional factors that may influence the
During the training phase, the model is optimized us- probability of observing a zero, such as cloudy days or
ing the Adam optimization algorithm, with a prede- equipment failures). According to a study, previous research
fined learning rate. The optimization process involves indicates that if excessive zero is not accounted for, an
iteratively adjusting the model’s internal parameters to unreasonable fit for both the zeros and nonzero counts will
minimize the mean squared error loss between predicted occur (Perumean-Chaney et al. 2013). ZI (Lambert 1992)
and actual values. This iterative optimization enables the and hurdle models(Mullahy 1986; Heilbron 1994) have
model to discern intricate patterns within the data and been developed to model zero-inflation when the regular
refine its predictive capabilities accordingly. count models such as Poisson or negative binomial are
The ConvLSTM2D model was chosen for its inherent unrealistic’[17]. Thus it became a necessity to apply such
capability to capture spatio-temporal dependencies effec- a zero-inflated model to our data that would best represent
tively, a critical requirement in our analysis. Through the state of the current distribution and convert it into a
rigorous experimentation and evaluation, it consistently distribution that might be more acceptable for our regression
outperformed alternative models, exhibiting superior per- models to make predictions with. After considering various
formance in terms of predictive accuracy and generaliza- models and comparing them with our data to determine the
tion capabilities. Its hybrid architecture, which integrates best fit, we decided upon the Tweedie distribution.
both convolutional and recurrent operations, enables it Tweedie Distribution
to discern complex patterns within spatio-temporal data, The Tweedie distribution(Figure 3) is characterized by the
making it particularly well-suited for the tasks at hand. following components:
C. Zero-Inflated Models • Random Variable: Let y be the random variable repre-
An analysis of the solar power data showed that the senting the observed value.
distribution was such that an unusually high number of values • Mean Parameter: µ is the mean parameter of the distri-
amounted to zero, this was likely due to negligible solar bution, indicating the average value of y.
generation due to the intermittent nature of sunlight received • Power Parameter: p is the power parameter of the dis-
by the solar panels. These zeros overshadowed the other tribution, controlling the variance structure. It determines
values in the distribution by a fair amount. the shape of the distribution and can take any positive
The histogram shown in Figure 2 highlights this zero-inflated value, excluding 1.
data of our initial dataset. It also provides insights into the • Normalizing Constant: A(µ, p) is the normalizing con-
range of values of solar energy generation and their respective stant that ensures the Probability Density Function (PDF)
frequencies. integrates to 1 over the support of y. It accounts for the
specific value of µ and p and is essential for properly
defining the distribution.
The probability density function of the Tweedie distribution is
thus given by:

 
µ(1−p)
y p−1 exp 1−p

yp

f (y; µ, p) = exp −
A(µ, p) A(µ, p)

The Tweedie distribution encompasses various shapes,


ranging from heavy-tailed with excess zeros (for p < 1) to
symmetric and Gaussian-like (for 1 < p < 2 and p > 2). It
includes special cases such as the Poisson distribution (when
p = 1) and the gamma distribution (when p = 2). The choice
of p determines the specific characteristics of the Tweedie
distribution, including its skewness, tail behavior, and overall
shape. This was suitable for our zero-inflated distribution, by
using a modified version of the above called a Zero Inflated
Fig. 4. Histogram of the initial solar power distribution before application of
transformations, highlighting the zero-inflated nature of our dataset. Tweedie (ZIT) Model which accounts for the excess zeros
using a separate inflation component.

Thus, on noticing this skewness of the solar energy Zero-Inflated Model with Tweedie Distribution
provide solutions faster and more efficiently. When it
comes to model application, it consists of numerous
estimator functions like H2OGradientBoostingEstimator
and H2ODeepLearningEstimator served using REST API
abstraction, each of which consists of a plethora of
hyperparameters for deep customization. For the scope of
our project, we have chosen H2OGradientBoostingEstimator,
H2OXGBoostEstimator, H2ODeepLearningEstimator, and
H2ORandomForestEstimator. The value of the parameters
distribution and tweedie power were set to ‘tweedie’ and
‘1.5’ respectively for the models and other parameters were
optimized using GridSearchCV. For the Deep Learning
Zero-inflated MOdel (ZIM), 4 hidden layers were chosen of
neuron count 100,100,50 and 50 respectively. Transforming
Fig. 5. A plot demonstrating a standard Tweedie distribution.
our data using a zero-inflated model resulted in a marked
improvement in our solar power prediction with a reduced
The zero-inflated model with the Tweedie distribution can MAE and RMSE.
be represented using the following formula: D. Power-Transform
( It is a data transformation and scaling technique which
0, with probability π is another way to tackle the skewness of the solar gener-
Y =
Z, with probability 1 − π ation data; we found that using PowerTransformer was a
 
µ significantly better fit for our dataset than the zero-inflated
log = β0 + β1 X1 + β2 X2 + . . . + βp Xp model. This scaling method is applied feature-wise to make
ϕ
the data more Gaussian or Gaussian-like which is inherently
log(ϕ) = γ0 + γ1 X1 + γ2 X2 + . . . + γq Xq assumed by regression-based prediction models. It is used
 
π when dealing with non-constant variance. There are 2 different
log = θ0 + θ1 X1 + θ2 X2 + . . . + θr Xr
1−π methods of performing the power transform, namely the Box-
Cox transform and the Yeo-Johnson transform.
The Yeo-Johnson power transform is given by the formula:
where:  
- Y represents the response variable (count variable with 
 (x + 1)λ − 1 /λ, if λ ̸= 0, x ≥ 0

excess zeros).
ln(x + 1), if λ = 0, x ≥ 0
xλ = 
- Z represents the positive count variable. 
 − (|x| + 1) 2−λ
− 1 /(2 − λ), if λ ̸= 2, x < 0
- π represents the probability of excess zeros.

− ln(|x| + 1), if λ = 2, x < 0

- µ represents the mean parameter.
- ϕ represents the dispersion parameter. Here, x represents the original variable, and λ is a parameter
- X1 , X2 , . . . , Xp represent the predictor variables for the that determines the type of power transform applied. The
mean equation. transformed variable xλ is the result of applying the Yeo-
- X1 , X2 , . . . , Xq represent the predictor variables for the Johnson power transform. The aforementioned Yeo-Johnson
dispersion equation. power transform was thus applied to the data, and the re-
- X1 , X2 , . . . , Xr represent the predictor variables for the sulting distribution was notably found to resemble a Tweedie
zero-inflation equation. distribution.
- β0 , β1 , β2 , . . . , βp represent the coefficients for the mean The solar energy generation data points are now normalized
equation. to make the distribution more Gaussian as shown in Figure 4.
- γ0 , γ1 , γ2 , . . . , γq represent the coefficients for the dispersion
equation. IV. R ESULTS
- θ0 , θ1 , θ2 , . . . , θr represent the coefficients for the zero- For the prediction of solar energy generation using multiple
inflation equation. methodologies, we have found that the Power Transformed
data led to the most accurate prediction in comparison to
Now in order to accurately and efficiently apply a Regular Time Series and Zero-Inflated models. Power Trans-
customized zero-inflated model to our data and test out formation of data is a particular method that stands out in
various regression models in Python, we utilized H2O, a comparison to the rest. This is due to solar energy generation
Java-based software for data modeling and general computing. being dependent on various factors like temperature, seasonal-
H2O is an abstracted distributed processing engine ity, time of day, and air quality of the region. These data points
that allows for simple horizontal scaling in order to have non-linear relationships with each other. ConvLSTM2D
TABLE II
E VALUATION METRICS APPLIED FOR Z ERO I NFLATED M ODEL TIME
SERIES APPROACH

Models Hours Out R2 Score MAE RMSE


24 0.7979 3.99 11.07
GradientBoosting Regression 48 0.7678 4.37 11.90
72 0.7506 4.55 12.29
24 0.7790 4.17 11.58
XGBoost Regression 48 0.7501 4.54 12.35
72 0.7406 4.76 12.61
24 0.8086 3.51 10.65
RandomForest Regression 48 0.7987 3.63 10.74
72 0.8165 3.62 10.61
24 0.6761 4.99 14.00
Deep Learning 48 0.6287 5.55 16.26
72 0.6285 5.33 16.18

Inflated Model, we performed five-fold cross-validation, and


there was no significant variance in the validation accuracy
Fig. 6. Histogram of solar power distribution after applying power transform from the training accuracy.
which is analogous to the tweedie distribution in figure 5.
TABLE III
E VALUATION METRICS APPLIED FOR P OWER T RANSFORM TIME SERIES
models outperform normal regression models as it combines APPROACH
convolutional operations and LSTM memory cells, allowing
Models Hours Out R2 Score MAE RMSE
for the modeling of both spatial and temporal dependencies in 24 0.8357 0.21 0.41
data. This makes ConvLSTM2D well-suited for tasks where Linear Regression 48 0.7875 0.25 0.46
both spatial and sequential information are important, such as 72 0.7380 0.30 0.51
24 0.8754 0.18 0.35
solar power generation forecasting. GradientBoosting Regression 48 0.8345 0.21 0.41
For evaluating the performance of the models on the gath- 72 0.7952 0.25 0.45
ered data, various statistical metrics were considered, of which 24 0.8926 0.17 0.33
ultimately R2 , MAE, and RMSE were used. Our key metric XGBoost Regression 48 0.8581 0.20 0.38
72 0.8289 0.24 0.41
was R2 , as the coefficient of determination R2 is generally 24 0.9574 0.09 0.21
a better indicator of regression model performance when RandomForest Regression 48 0.9580 0.09 0.21
compared to other metrics[18]. These metrics were tabulated 72 0.9584 0.08 0.20
24 0.9595 0.09 0.20
and compared across the various models and for three different RandomForest + XGBoost 48 0.9561 0.21 0.10
time slots: 24 hours, 48 hours and 72 hours. There are 72 0.9562 0.11 0.21
three such tables corresponding to the principal methodologies
utilised: Regular time-series, Zero Inflated Model, and Power The range of target variables was between -0.9 and 1.8 after
Transform. Power Transform was applied.

TABLE I
E VALUATION METRICS APPLIED FOR REGULAR TIME SERIES APPROACH TABLE IV
E VALUATION METRICS APPLIED FOR LSTM TIME SERIES APPROACH
Models Hours Out R2 Score MAE RMSE
24 0.6344 6.45 14.91 Model Hours Out R2 Score MAE RMSE
Linear Regression 48 0.5884 7.08 15.99 24 0.9691 0.18 0.10
72 0.5677 7.57 16.43 ConvLSTM2D 48 0.9637 0.18 0.08
24 0.7142 5.15 13.18 72 0.9608 0.20 0.09
GradientBoosting Regression 48 0.6793 5.62 14.12
72 0.6555 6.13 14.66
24 0.7431 5.00 12.49 After comparing all three methodologies, regression models,
XGBoost Regression 48 0.7095 5.47 13.44
72 0.6865 5.92 13.99
and variation in prediction time, the best combination of
24 0.8005 3.42 11.02 factors respectively has been tabulated in Table V.
RandomForest Regression 48 0.7987 3.52 11.19
72 0.7996 3.48 11.18
TABLE V
24 0.8244 3.49 10.33
O PTIMUM METRIC FOR P OWER T RANSFORM TIME SERIES APPROACH
RandomForest + XGBoost 48 0.8145 3.71 10.74
72 0.8178 3.75 10.66
Model Hours Out R2 Score MAE RMSE
RandomForest + XGBoost 24 0.9595 0.09 0.20
The range of target variables was between 0 and 320 for ConvLSTM2D 24 0.9691 0.18 0.10
regular time series and Zero-Inflated models. For the Zero
Fig. 7. Complete pipeline used for data collection and prediction of solar data.

months in Australia and reaches its minimum during the


months of May to August which are the winter months. This
can be explained due to the sun being directly overhead during
summer, leading to longer days and more exposure to solar
radiation. This phenomenon reverses during the winter months
as is shown by the histogram(Fig. 6).

Fig. 8. Monthly solar generation which provides insights into the seasonality
of the data

Fig. 10. Zero Inflated 24H out prediction using Gradient Boost Regressor

Fig. 9. Loss vs Epoch for ConvLSTM2D

The solar power generation data when plotted monthly


follows a specific pattern that can be attributed to the seasonal
cycle of the Australian landmass, where the dataset was
sourced from. The generation is noted to be maximum from Fig. 11. Power Transformed 24H out prediction using RandomForest Regres-
November to February which coincides with the summer sor
Figure 7 shows the Loss vs Epoch plot for the away from the solar site. Thus in order to increase accuracy
ConvLSTM2D model displaying the training progress. one can take AQI data from a data station at maximum
proximity to the data collection center in order to eliminate
Our time-series based solar power prediction models many geographical errors. Another possibility can be to collect
also capture this phenomenon as seen in the graphs plotted our own data to create a localized dataset which would likely
above (Fig. 8) and (Fig. 9). There is a drop in solar production be more accurate since we can customize it and remove
during the winter months of 2020 and the peak production is any bottlenecks faced earlier by having utilised an external
reached in the summer of 2021. dataset. Moreover, feature engineering can be employed to
extract more meaningful features from the data available,
V. C ONCLUSION hence increasing the model performance. Features like cloud
This study investigates the use of various machine learning cover, dust, and further manners of seasonality can also be
algorithms to effectively determine the future solar generation taken into consideration for future research.
in a region by utilizing a time series approach. The chief mod-
els employed were Linear Regression, Lasso, Ridge, Elastic- R EFERENCES
Net, ensemble models like RandomForest and XGBoost, and
deep learning models like ConvLSTM2D. Later, on conferring, [1] Chuluunsaikhan, Tserenpurev. (2021). Predicting the Power Output
of Solar Panels based on Weather and Air Pollution Features using
we decided to switch to a time- series based approach due to Machine Learning. Journal of Korea Multimedia Society. 24. 222.
the seasonal fluctuation in the solar data. However, on noticing 10.9717/kmms.2021.24.2.222.
the skewness of the solar energy generation data which had [2] Zhou, H., Liu, Q., Yan, K., Du, Y. (2021). Deep Learning En-
hanced Solar Energy Forecasting with AI-Driven IoT. Wireless Com-
a high number of zeros, we decided to switch to a zero- munications and Mobile Computing, 2021, Article ID 9249387.
inflated model which helps ascertain the difference between doi:10.1155/2021/9249387.
the true zero data points and the inflated zeros. This approach [3] Ghosh, S., Dey, S., Ganguly, D., Roy, S. B., & Bali, K. (2022). Cleaner
air would enhance India’s annual solar energy production by 6–28 TWh.
immediately yielded a higher accuracy of solar prediction Environmental Research Letters, 17(5), 054007. doi:10.1088/1748-
with a lower mean standard error. Another way to tackle the 9326/ac5d9a.
skewness of the solar generation data was to try different [4] Galimova, T., Ram, M., & Breyer, C. (2022). Mitigation of air pollution
and corresponding impacts during a global energy transition towards
scaling techniques; we found that using PowerTransformer 100% renewable energy system by 2050. Energy Reports, 8, 14124-
was a significantly better fit for our dataset than the zero- 14143. ISSN 2352-4847. doi:10.1016/j.egyr.2022.10.343.
inflated model. This scaling method is applied feature-wise [5] Jia, Dongyu & Yang, Liwei & Lv, Tao & Liu, Weiping & Gao,
Xiaoqing & Zhou, Jiaxin. (2022). Evaluation of machine learning
to make the data more Gaussian or Gaussian-like which is models for predicting daily global and diffuse solar radiation un-
inherently assumed by regression-based prediction models. In der different weather/pollution conditions. Renewable Energy. 187.
conclusion, this study investigates the use of machine learning 10.1016/j.renene.2022.02.002.
algorithms incorporating AQI and climate factors to provide [6] Jebli, I., Belouadha, F.-Z., Kabbaj, M. I., & Tilioua, A. (2021). Prediction
of solar energy guided by Pearson correlation using machine learning.
more accurate solar generation forecasts. Considering seasonal Energy, 224, 120109. doi:10.1016/j.energy.2021.120109
variations in the solar data, the time series-based method [7] Liu, D., & Sun, K. (2019). Random forest solar power
was modified. In addition, a zero-inflated model and scaling forecast based on classification optimization. Energy, 115940.
doi:10.1016/j.energy.2019.115940
techniques were used to address the skewness of the solar [8] Sweerts, B., Pfenninger, S., Yang, S., Folini, D., van der Zwaan, B., &
generation data. The findings provide valuable insights for Wild, M. (2019). Estimation of losses in solar energy production from
solar stakeholders, contributing to the adoption and use of air pollution in China since 1960 using surface radiation data. Nature
Energy. doi:10.1038/s41560-019-0412-4
sustainable energy sources. [9] Zhou, L., Schwede, D. B., Appel, K. W., Mangiante, M. J., Wong, D. C.,
Napelenok, S. L., Whung, P.-Y., & Zhang, B. (2019). The impact of air
VI. F UTURE S COPE pollutant deposition on solar energy system efficiency: an approach to
estimate PV soiling effects with the Community Multiscale Air Quality
Solar energy generation forecasting is a dynamic field that (CMAQ) model. Science of the Total Environment, 651(Pt 1), 456–465.
will always develop and demand exploration. There is a lot doi:10.1016/j.scitotenv.2018.09.194
of scope in the future to improve the accuracy of solar [10] Zazoum, B. (2022). Solar photovoltaic power prediction using different
machine learning methods. Energy Reports, 8(Supplement 1), 19-25.
forecasting. To make this forecasting scalable and generalized ISSN 2352-4847. doi:10.1016/j.egyr.2021.11.183.
for all geographical regions, the dataset can be collected [11] Lee, C.-H., Yang, H.-C., & Ye, G.-B. (2021). Predicting the Performance
from different regions and time periods since our research of Solar Power Generation Using Deep Learning Methods. Applied
Sciences, 11(15), 6887. doi:10.3390/app11156887
primarily utilizes data from a specific region and time zone.
[12] Chiteka, K., Arora, R., Sridhara, S. N., & Enweremadu, C. C. (2020).
Additionally, data fusion and feature engineering can be done A novel approach to Solar PV cleaning frequency optimization for
to enhance the power of forecasting. Data fusion is the practice soiling mitigation. Scientific African, 8, e00459. ISSN 2468-2276.
of collecting data from multiple sources like satellite imagery doi:10.1016/j.sciaf.2020.e00459.
[13] Kim, D.-W., Deo, R. C., Park, S.-J., Lee, J.-S., & Lee, W.-S. (2019).
and weather stations which can be combined together for better Weekly heat wave death prediction model using zero-inflated regression
results. There was a limitation on the dataset which we faced approach. Theoretical and Applied Climatology, 137, 823-838. doi:
during implementation which was the fact that the AQI data 10.1007/s00704-018-2632-4
[14] Thomas, S. J. (2010). Model-based clustering for multivariate time
we used wasn’t from a data station at the exact geographical series of counts. Rice University. ProQuest Dissertations Publishing.
location of the data collection but rather another data station (Publication No. 3421317).
[15] Yeom, J. M., Deo, R. C., Adamowski, J. F., Park, S., & Lee, C. S. (2020).
Spatial mapping of short-term solar radiation prediction incorporating
geostationary satellite images coupled with deep convolutional LSTM
networks for South Korea. Environmental Research Letters, 15(9),
094025. doi: 10.1088/1748-9326/ab9467
[16] S. Wimalaratne, D. Haputhanthri, S. Kahawala, G. Gamage, D. Ala-
hakoon and A. Jennings, ”UNISOLAR: An Open Dataset of Photovoltaic
Solar Energy Generation in a Large Multi-Campus University Setting,”
2022 15th International Conference on Human System Interaction (HSI),
2022, pp. 1-5, doi: 10.1109/HSI55341.2022.986947
[17] Feng, C.X. A comparison of zero-inflated and hurdle models for
modeling zero-inflated count data. J Stat Distrib App 8, 8 (2021).
https://doi.org/10.1186/s40488-021-00121-4
[18] Chicco D, Warrens MJ, Jurman G. The coefficient of determination
R-squared is more informative than SMAPE, MAE, MAPE, MSE
and RMSE in regression analysis evaluation. PeerJ Comput Sci. 2021
Jul 5;7:e623. doi: 10.7717/peerj-cs.623. PMID: 34307865; PMCID:
PMC8279135.
[19] O. Edenhofer et al., Eds., Renewable Energy Sources and Climate
Change Mitigation: Special Report of the Intergovernmental Panel on
Climate Change. Cambridge: Cambridge University Press, 2011.

You might also like