Objective: Forecast day-ahead electricity prices for the Germany-Luxembourg (DE-LU) bidding zone.
Electricity prices are highly volatile and influenced by multiple factors including renewable generation (wind, solar), fossil fuel availability, cross-border electricity flows, and demand patterns. Accurate price forecasting is critical for:
- Energy traders optimizing market positions
- Grid operators balancing supply and demand
- Renewable generators maximizing revenue
- Consumers managing energy costs
This project combines ENTSO-E market data with atmospheric forecast latent states to predict electricity prices, capturing the relationship between weather patterns and market dynamics.
Core Question: "Given a weather forecast valid at a specific time, what will the electricity price (or other market variable) be at that same time?"
To address this, you are provided with two data files:
entsoe_data_2023.csv- ENTSO-E market data (electricity demand, generation, prices, cross-border flows)latent_states_enabling-muskox.zarr- Latent forecast weather dataset (encoded atmospheric predictions)
The ENTSO-E dataset contains hourly/15-minute resolution data for the DE-LU zone spanning the entire 2023 year with the following variables:
price_eur_mwh: Day-ahead electricity market price in EUR per megawatt-hour (hourly resolution)
load_actual_mw: Total electricity consumption across the entire DE-LU zone (households, businesses, industry)
Production columns (*_actual_aggregated_mw): Electricity generated and injected into the grid by all production units of that type:
gen_actual_biomass_actual_aggregated_mw: Biomass generationgen_actual_fossil_brown_coal_lignite_actual_aggregated_mw: Lignite (brown coal) generationgen_actual_fossil_hard_coal_actual_aggregated_mw: Hard coal generationgen_actual_fossil_coal_derived_gas_actual_aggregated_mw: Coal-derived gas generationgen_actual_fossil_gas_actual_aggregated_mw: Natural gas generationgen_actual_fossil_oil_actual_aggregated_mw: Oil-based generationgen_actual_nuclear_actual_aggregated_mw: Nuclear power generationgen_actual_solar_actual_aggregated_mw: Solar photovoltaic generationgen_actual_wind_onshore_actual_aggregated_mw: Onshore wind generationgen_actual_wind_offshore_actual_aggregated_mw: Offshore wind generationgen_actual_hydro_run_of_river_and_poundage_actual_aggregated_mw: Run-of-river hydroelectric generationgen_actual_hydro_water_reservoir_actual_aggregated_mw: Reservoir hydroelectric generationgen_actual_hydro_pumped_storage_actual_aggregated_mw: Pumped storage generation (when releasing stored energy)gen_actual_geothermal_actual_aggregated_mw: Geothermal generationgen_actual_waste_actual_aggregated_mw: Waste-to-energy generationgen_actual_other_actual_aggregated_mw: Other generation sourcesgen_actual_other_renewable_actual_aggregated_mw: Other renewable sources
Auxiliary consumption columns (*_actual_consumption_mw): Electricity consumed by generation facilities themselves for operations (parasitic load):
gen_actual_hydro_pumped_storage_actual_consumption_mw: Power used to pump water uphill (storing energy for later release)gen_actual_solar_actual_consumption_mw: Auxiliary power for solar farm operations (cooling, controls, monitoring)gen_actual_wind_onshore_actual_consumption_mw: Power for wind farm systems (yaw motors, heating, blade de-icing, controls)
Cross-border flows represent electricity exports FROM Germany-Luxembourg TO neighboring countries:
flow_fr_mw: Exports to France (MW) - Note: Germany often imports FROM France (not captured)flow_nl_mw: Exports to Netherlands (MW)flow_pl_mw: Exports to Poland (MW)flow_cz_mw: Exports to Czech Republic (MW)flow_at_mw: Exports to Austria (MW)flow_dk1_mw: Exports to Denmark West (MW)flow_dk2_mw: Exports to Denmark East (MW)flow_se4_mw: Exports to Sweden SE4 (MW)flow_ch_mw: Exports to Switzerland (MW)
Current Data Interpretation:
- Values represent electricity exported FROM Germany-Luxembourg TO each neighbor
- All values are ≥0 (no negative values)
- Low/zero values may indicate either no export OR reverse flow (import) - you can't tell which!
To get complete bidirectional flows, query both directions and calculate net flow:
export_flow = client.query_crossborder_flows(DE_LU, neighbor) # Germany → Neighbor
import_flow = client.query_crossborder_flows(neighbor, DE_LU) # Neighbor → Germany
net_flow = export_flow - import_flow # positive=net export, negative=net importExample: For France in June 2023:
flow_fr_mw(export) averages ~140 MW (small)- Missing import flow averages ~1,887 MW (large)
- Germany was a net importer from France (not visible in current data)
Atmospheric forecast latent states (64 channels, 3 compressed pressure levels, 90 compressed latitudes, 180 compressed longitudes) spanning the entire 2023 year.
Note: The latent inputs have global coverage (not just Europe).
What are these?
- Pre-computed compressed representations of weather forecasts from Jua EPT2 model
- Each latent state represents a weather forecast: given the initial weather state at time t₀, the EPT2 model predicts the weather state at time t = t₀ + leadtime, and this prediction is encoded using a Variational Auto-Encoder (VAE)
- The timestamps in the dataset represent the valid time (t₀ + leadtime), not the initialization time
- Example: A latent state with timestamp "2023-01-01 18:00:00" represents a weather forecast valid at 6 PM on January 1st
- Capture patterns relevant to renewable energy generation (wind speeds, solar radiation, temperature, pressure, etc.)
- Storage location:
s3://entsoe-datasets/datasets/latent_states_enabling-muskox.zarr(requires R2 credentials - see "How to Run" section)
Zarr Dataset Structure:
Dimensions: (time_idx: 10093, channel: 64, level: 3, y: 90, x: 180)
Data variables:
- latent_states_input (time_idx, channel, level, y, x) float16 - EPT2 forecasted weather encoded by VAE
- latent_states_target (time_idx, channel, level, y, x) float16 - Actual weather state at target time (DO NOT USE)
- timestamps (time_idx) datetime64[ns] - Forecast initialization times
- leadtimes (time_idx) float32 - Forecast lead times in normalized hours (N/12)
- running_mean (channel) float32 - Channel-wise normalization means
- running_variance (channel) float32 - Channel-wise normalization variances
- z_original_surf (time_idx) float32 - Surface geopotential height
- z_original_atmos (time_idx) float32 - Atmospheric reference height
- latent_states_input: EPT2 weather predictions encoded by VAE (this is what you should use)
- latent_states_target: Actual atmospheric state at the forecast target time (NOT for use in this task - included only for reference)
- 64 channels: Learned atmospheric features from VAE encoder
- 3 levels: Vertical atmospheric layers (surface, mid-level, upper level)
- 90 × 180 grid: Global spatial coverage at reduced resolution (~2° × 2° approximately)
- 10,093 timesteps: Hourly data throughout 2023 (note: 1,440 samples are repeated timestamps; the dataset class handles deduplication automatically)
The provided latent_price_dataset.py demonstrates how to combine these two data sources:
Input: Single latent weather forecast state (already valid at timestamp + leadtime)
Output: Corresponding ENTSO-E variable(s) at the same valid time
Key Concept: The latent states are already forecast states from the model. Each latent state represents a weather forecast valid at a specific time (initialization time + leadtime). The dataset simply maps each latent state to the corresponding ENTSO-E variable at that same time.
from latent_price_dataset import LatentForecastDataset
# Single target variable (e.g., electricity price)
dataset = LatentForecastDataset(
latent_zarr_path="s3://entsoe-datasets/datasets/latent_states_enabling-muskox.zarr",
entsoe_csv_path="entsoe_data_2023.csv",
target_variable="price_eur_mwh", # Can be any column from the CSV
start_date="2023-01-01",
end_date="2023-12-25",
channels=None, # Use all 64 channels (or select subset like [0, 1, 2])
normalize_target=True
)
# Get a sample
latent_input, target_value, metadata = dataset[0]
print(f"Latent input shape: {latent_input.shape}") # (64, 3, 90, 180) - single state
print(f"Target value shape: {target_value.shape}") # () - scalar
print(f"Target value: {target_value.item():.2f}") # e.g., -0.45 (normalized)
print(f"Timestamp: {metadata['timestamp']}") # e.g., 2023-01-01 06:00:00
# Multiple target variables (e.g., price + solar + load)
dataset_multi = LatentForecastDataset(
latent_zarr_path="s3://entsoe-datasets/datasets/latent_states_enabling-muskox.zarr",
entsoe_csv_path="entsoe_data_2023.csv",
target_variable=["price_eur_mwh", "gen_actual_solar_actual_aggregated_mw", "load_actual_mw"],
start_date="2023-01-01",
end_date="2023-12-25",
channels=None,
normalize_target=True
)
latent_input, target_values, metadata = dataset_multi[0]
print(f"Latent input shape: {latent_input.shape}") # (64, 3, 90, 180) - single state
print(f"Target values shape: {target_values.shape}") # (3,) - one value per targetNote: LatentPriceDataset is still available for backward compatibility but is deprecated. Use LatentForecastDataset instead.
The dataset implements a direct mapping approach: each sample consists of a single latent weather forecast state and its corresponding ENTSO-E value at the same valid time. This design reflects the fact that:
- Latent states are already forecasts: Each latent state from the EPT2 model represents a weather prediction valid at a specific time (initialization + leadtime)
- Simple, clean I/O: One weather state → one market value makes training straightforward
- Flexible composition: You can create sequences or forecast horizons in your model architecture, not in the dataset
Shape Reference:
- Single target:
latent (C, L, H, W)→target ()scalar value - Multi-target:
latent (C, L, H, W)→target (N,)one value per target
where C=channels, L=3 levels, H=90 height, W=180 width, N=num_targets
What the dataset does:
- Loads and aligns data sources:
# From __init__ method
self.ds = xr.open_zarr(latent_zarr_path) # Load weather latents
self.entsoe_df = pd.read_csv(entsoe_csv_path, parse_dates=True) # Load ENTSO-E data
self.entsoe_df = self.entsoe_df.dropna(subset=self.target_variables) # Keep only complete data
# Match latent timestamps with ENTSO-E timestamps
valid_entsoe_times = set(self.entsoe_df.index)
entsoe_mask = np.array([ts in valid_entsoe_times for ts in self.latent_timestamps])
self.latent_timestamps = self.latent_timestamps[entsoe_mask] # Keep only matching times- Returns single latent state and corresponding target value(s):
# From __getitem__ method
timestamp = self.latent_timestamps[idx]
zarr_idx = self.latent_time_indices[idx]
# Load single latent state at this timestamp
latent_data = self.ds.latent_states_input.isel(
time_idx=zarr_idx,
channel=self.channels,
).values # Shape: (channels, levels, height, width)
# Get target value(s) at the same timestamp
target_data = self.entsoe_df.loc[timestamp, self.target_variables].values
# Normalize if requested
if self.normalize_target:
mean = self.target_stats[var]["mean"]
std = self.target_stats[var]["std"]
target_data = (target_data - mean) / stdKey Features:
- Flexible target variables: Use any column(s) from the ENTSO-E CSV as prediction targets
- Single or multi-target: Predict one variable or multiple variables simultaneously
- Direct time matching: Each latent state maps to its corresponding ENTSO-E value at the same time
- Clean I/O: Single latent state in → single (or multi) value out
- Optional normalization for stable training (per-variable statistics)
- Returns scalar for single target, 1D array for multiple targets
latent_price_dataset.py: PyTorch Dataset class that loads latent weather forecasts and aligns them with any ENTSO-E target variable(s)test_dataset_integration.py: Comprehensive integration tests for the dataset (works with synthetic data, no credentials needed)test_dataset_refactor.py: Unit tests for dataset logic and functionalityload_entsoe_data.py: Script to download and plot historical ENTSO-E market data (generation, demand, flows, prices) via API⚠️ Data has been already downloaded for you, but feel free to look at the code or download more if needed
plot_entsoe_data.py: Visualization functions for market data overview
Test the dataset implementation without requiring cloud credentials:
uv run python test_dataset_integration.pyThis runs comprehensive tests including:
- Single and multiple target variables
- Normalized and non-normalized outputs
- Backward compatibility checks
- Edge cases and robustness
- Optional real data test (if credentials available)
- The latent states are FORECASTED atmospheric conditions, not historical observations
- This dataset shows one possible approach: using weather forecasts to predict price movements
- The weather-price relationship is particularly relevant for Germany with high renewable penetration
🔓 Participants are encouraged to extend this dataset
You should feel free to:
- Add HISTORICAL data from ENTSO-E (past generation, demand, prices, flows)
- Incorporate time-based features (hour of day, day of week, seasonality)
- Use lagged price values (autoregressive features)
- Add fuel prices, carbon prices, or other market indicators
- Engineer features from cross-border flow patterns
- Combine multiple data sources in creative ways
The provided code in load_entsoe_data.py shows how to fetch historical market data from the ENTSO-E API.
Training data covers: January 2023 - December 2023
-
Set up Cloudflare R2 credentials (required to access the latent states dataset):
export R2_ACCESS_KEY=your_access_key_here export R2_SECRET_KEY=your_secret_key_here
-
Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | shor
wget -qO- https://astral.sh/uv/install.sh | sh
Run the provided dataset example to load and visualize the latent-price dataset:
uv run python3 latent_price_dataset.pyThis will:
- Load the latent weather states from S3/R2
- Load the ENTSO-E price data from CSV
- Create PyTorch datasets (single and multi-target examples)
- Print dataset statistics and sample information
- Demonstrate both single-target and multi-target usage
Note: The demo shows how latent weather forecasts at specific times map to corresponding electricity market variables at those same times.
If you prefer to work with raw weather forecast data instead of the pre-computed latent states, you can access:
🌍 NOAA GFS (Global Forecast System) - Real-time global weather forecasts
- Source: dynamical.org
- Coverage: Global, 0.25° resolution (~20km)
- Variables: Temperature, wind, precipitation, pressure, humidity, radiation, and more
- Forecast horizon: 0-384 hours (16 days)
🌍 ERA5 ARCO (ECMWF Reanalysis) - Historical weather reanalysis
- Source: Google Cloud Public Datasets
- Coverage: Global, 0.25° resolution
- Variables: 100+ atmospheric variables at multiple pressure levels
- Time range: 1940-present, hourly resolution
Loading GFS forecast data for 2023:
import xarray as xr
# Open GFS forecast dataset
ds_gfs = xr.open_zarr(
"https://data.dynamical.org/noaa/gfs/forecast/[email protected]",
chunks="auto"
)
# Filter to 2023 date range
ds_gfs_2023 = ds_gfs.sel(init_time=slice("2023-01-01", "2023-12-31"))
print(f"Available variables: {list(ds_gfs_2023.data_vars)}")
# Access specific variables like temperature_2m, wind_u_10m, precipitation_surface, etc.Loading ERA5 ARCO data for 2023:
import xarray as xr
# Open ERA5 ARCO dataset (single-level variables)
ds_era5 = xr.open_zarr(
"gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
chunks="auto",
storage_options={"token": "anon"} # Anonymous access
)
# Filter to 2023 date range
ds_era5_2023 = ds_era5.sel(time=slice("2023-01-01", "2023-12-31"))
print(f"Available variables: {list(ds_era5_2023.data_vars)}")
# Access variables like wind components, temperature, pressure, etc.These raw weather datasets can be combined with ENTSO-E market data to create custom features for price forecasting.