Distributed ARIMA Models implemented with Apache Spark
DARIMA is designed to facilitate forecasting ultra-long time series by utilizing the industry-standard MapReduce framework. The algorithm is developed on Spark platform and both Python as well as R interfaces.
See darima for developed functions used for implementing DARIMA models.
model.py: Train ARIMA models for each subseries and convert the trained models into AR representations (Mapper).dlsa.py: Combine the local estimators obtained in Mapper by minimizing the global loss function (Reducer).forecast.py: Forecast the next H observations by utilizing the combined estimators.evaluation.py: Calculate the forecasting accuracy in terms as MASE, sMAPE and MSIS.R: R functions designed for modeling, combining and forecasting. rpy2 is needed as an interface to use R from Python.
Spark >= 2.3.1Python >= 3.7.0pyspark >= 2.3.1rpy2 >= 3.0.4scikit-learn >= 0.21.2numpy >= 1.16.3pandas >= 0.23.4
R >= 3.5.2forecast >= 8.5polynom = 1.3.9dplyr >= 0.8.4quantmod >= 0.4.13magrittr >= 1.5
Run the PySpark code to forecast the time series of the GEFCom2017 by utilizing DARIMA.
./bash/run_darima.shor simply run
PYSPARK_PYTHON=/usr/local/bin/python3.7 ARROW_PRE_0_15_IPC_FORMAT=1 spark-submit ./run_darima.pyNote: ARROW_PRE_0_15_IPC_FORMAT=1 is added to instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x.
Run the R code to forecast the time series of the GEFCom2017 by utilizing the auto.arima() function (used for comparison).
./bash/auto_arima.shor simply run
Rscript auto_arima.R- Xiaoqian Wang, Yanfei Kang, Rob J Hyndman, & Feng Li (2022) Distributed ARIMA models for ultra-long time series. International Journal of Forecasting DOI: 10.1016/j.ijforecast.2022.05.001.