Spark
Learn how to run TimeGPT in a distributed manner on Spark for scalable forecasting and cross-validation.
Run TimeGPT in a Distributed Manner on Spark
Spark is an open-source distributed compute framework designed for large-scale data processing. With Spark, you can seamlessly scale your Python-based workflows for big data analytics and machine learning tasks. This tutorial demonstrates how to use TimeGPT with Spark to perform forecasting and cross-validation.
If executing on a distributed Spark cluster, be sure the nixtla
library (and any dependencies) are installed on all worker nodes to ensure consistent execution across the cluster.
1. Installation
Fugue provides a convenient interface to distribute Python code across frameworks like Spark.
Install fugue
with Spark support:
To work with TimeGPT, make sure you have the nixtla
library installed as well.
2. Load Data
Load the dataset into a pandas DataFrame. In this example, we use hourly electricity price data from different markets.
3. Initialize Spark
Create a Spark session and convert your pandas DataFrame to a Spark DataFrame:
4. Use TimeGPT on Spark
Using TimeGPT with Spark is very similar to using it locally. The main difference is that you work with Spark DataFrames instead of pandas DataFrames.
TimeGPT can handle large-scale data when distributed via Spark, allowing you to scale your time series forecasting tasks efficiently.
For including exogenous variables with TimeGPT on Spark, use Spark DataFrames instead of pandas DataFrames, as demonstrated in the Exogenous Variables tutorial.
5. Stop Spark
After completing your tasks, stop the Spark session to free resources: