Computing at Scale Tutorial
Learn how to use TimeGPT with distributed computing frameworks for processing large datasets.
Handling large datasets is a common challenge in time series forecasting. For example, when working with retail data, you may need to forecast sales for thousands of products across hundreds of stores. Similarly, when dealing with electricity consumption data, you may need to predict consumption for thousands of households across multiple regions.
Nixtla’s TimeGPT enables you to efficiently scale these operations by integrating several distributed computing frameworks. Currently, Spark, Dask, and Ray are supported through Fugue.
High-level overview of distributed time series forecasting with TimeGPT
Outline
1. Getting Started
Upon registration, you will receive an email prompting you to confirm your signup. Once confirmed, you can access your dashboard. Navigate to the API Keys section to retrieve your key.
2. Forecasting at Scale
Using TimeGPT with distributed computing frameworks is straightforward. The process only slightly differs from non-distributed usage.
1. Instantiate a NixtlaClient class
2. Load your data into a pandas DataFrame
Make sure your data is properly formatted, with each time series uniquely identified (e.g., by store or product).
4. Use NixtlaClient methods to forecast at scale
Once your framework is initialized and your data is loaded, you can apply the forecasting methods:
5. Stop the distributed computing framework
When you’re finished, you may need to terminate your Spark, Dask, or Ray session. This depends on your environment and setup.
3. Important Considerations
Key Concept: Time Series Forecasting at Scale
Distribute your forecasts across multiple compute nodes to handle huge datasets without clogging up memory or single-machine resources.
Key Concept: Parallelization
Make sure your data has distinct identifiers for each series. Correct labeling is crucial for successful multi-series parallel forecasts.
With these guidelines, you can efficiently forecast large-scale time series data using TimeGPT and the distributed computing framework that best fits your environment.