🔥 Revolutionizing Time Series Forecasting with Autoformers: Exploring the Future of Predictive Analytics

The term "Transformers" is well-known today Transformers have significantly broadened the potential for computational models by enhancing their ability to process data in parallel, thus improving efficiency and performance. In NLP, the success of Transformers has been most notable, leading to the development of Large Language Models (LLMs) such as the ChatGPT, Claude, and many others. These models excel in understanding and generating human-like text, making them vital for applications ranging from interactive chatbots to advanced text analysis.

This article explores one aspect of Transformers, the Autoformer and its impact on time series forecasting. We will examine the unique features of Autoformers within Transformers, how they improve upon traditional methods for time series analysis, and their practical uses in various industries.

What is Time Series Forecasting?

Time series data is a sequence of data points collected or recorded at specific time intervals, for instance, monitoring your daily steps using a fitness tracker to analyze activity patterns over time. This type of data is common in various fields, such as finance, weather, healthcare, and retail, where tracking changes over time is crucial for analysis and decision-making.

Traditionally, time series forecasting has relied on models like ARIMA (AutoRegressive Integrated Moving Average) and Exponential Smoothing. ARIMA models use past values and the relationships between them to predict future points. Exponential Smoothing methods apply weighted averages of past observations, meaning they give more importance to recent data points while still considering older data. This helps to smooth out fluctuations and highlight trends more clearly.

But as usual, not all that glitters is gold and traditional methods face several challenges. They often struggle with handling large volumes of data and capturing complex patterns, especially when dealing with non-linear relationships and long-term dependencies. Plus, these models usually require significant manual tuning and domain expertise to achieve accurate forecasts, limiting their scalability and adaptability to diverse datasets.

Transformers

Unlike older models, such as Recurrent Neural Networks (RNNs), Transformers can process entire sequences of data at once, rather than piece by piece. This makes them incredibly fast and efficient, especially when dealing with large datasets.

What makes Transformers special is their self-attention mechanism. This allows the model to dynamically focus on the most relevant parts of the input data. Think of it like having a built-in highlighter that enables the model to focus on important details. This feature is crucial for capturing long-range dependencies and complex patterns, areas where traditional models often struggle.

Transformers X-Ray

Now, let us try to understand a bit more what makes them special by explaining its architecture and mechanisms.

**Figure 1**. Overall architecture of Transformers.

Encoder-Decoder Structure

Transformers use an encoder-decoder structure, where:

Encoder: This part takes in the input data and turns it into an internal format that the model can understand.
Decoder: This part takes the internal format from the encoder and turns it into the final output data.

In time series forecasting, we usually focus more on the encoder part. But it's good to know how both parts work together to make the magic happen.

Self-Attention Mechanism

That is it, the heart of it all: the self-attention mechanism. In simpler words, this amazing system helps the model decide which parts of the input data are most important. Let’s break it down:

Query (Q): Think of this as the part of the data we're currently interested in.
Key (K): These are all the other parts of the data that might be relevant.
Value (V): This is the actual information or content associated with each part of the data.

Here’s how it works: the model looks at the Query and compares it to all the Keys. This comparison helps the model figure out which Values (pieces of data) are the most important to focus on by calculating attention scores. In simple terms, it’s like reading a book and using a highlighter to mark the important sentences.

Positional Encoding

One key difference with their predecessors, the RNNs, is that instead of processing each piece of information in a sequential manner it does so by using the entire sequence at once. So because of that, it loses the ability to know exactly which is the original order of the data points. This is where positional encoding comes in.

Positional encoding adds information about the order of the data points to the input. It uses patterns (sine and cosine functions) to give each position a unique code. This way, the model knows, for example when using text, which words come first, second and so on. And in relation to time series, it helps to know the natural temporal dependency of the measurements.

Multi-Head Attention

To capture different types of relationships in the data, Transformers use a technique called multi-head attention. It is like having multiple pairs of eyes, each looking at the data from a different angle. Each "eye" (or head) focuses on different parts of the data, for example, when analyzing a sentence, one head might focus on the relationships between subjects and verbs, while another might focus on the connections between adjectives and nouns. These multiple perspectives are then combined to form a comprehensive view, allowing the model to understand complex patterns and relationships much better than looking at it from just one angle.

Feed-Forward Networks

After the multi-head attention has done its job, the data moves through a feed-forward network. This unit further refines the data by passing it through two layers of transformation, with a ReLU activation in between that adds a bit of flexibility and complexity. This extra processing helps the Transformer to make more accurate and detailed predictions.

Layer Normalization and Residual Connections

To make the training process smoother and faster, Transformers use two key techniques: layer normalization and residual connections.

Layer Normalization: This adjusts the data within each layer so that it has a consistent scale and distribution, which helps the model learn more effectively.
Residual Connections: These act like shortcuts, adding the input of a layer to its output. This helps prevent problems during training, such as the vanishing gradient problem, and makes it easier for the model to learn complex patterns without getting stuck.

Introducing Autoformers: Tailoring Transformers for Time Series

Autoformers are a specialized evolution of transformers designed specifically for long-term time series forecasting. Below is a small breakout of its key pieces and architecture so we can understand what makes them stand out. For further details, you can access the original paper here.

‍

**Figure 2**. Overall architecture of Autoformer.

Series Decomposition Block

Autoformer integrates series decomposition directly into its architecture. This block separates the input data into trend and seasonal components, allowing the model to focus on these distinct patterns separately.

- Trend Component: Represents the long-term progression in the data.

- Seasonal Component: Captures repeating patterns and fluctuations.

This method is distinctly different from traditional transformers, which typically process data without such decomposition, as other types of data, like text or images, do not inherently possess these attributes.

Auto-Correlation Mechanism

Instead of relying on the self-attention mechanism used in traditional Transformers, Autoformer employs an Auto-Correlation mechanism. This technique takes advantage of the periodic nature of time series data to find dependencies and aggregate information more efficiently. It does so by identifying and aggregating similar sub-series based on their periodic nature, significantly reducing computational complexity to logarithmic O(L log L). This is a major improvement over the exponential complexity Transformers have, because logarithmic growth means the computational resources required increase much more slowly as the size of the data grows. In practical terms, this means Autoformer can handle larger datasets more efficiently, making it faster and more scalable.

‍

**Figure 3**. Auto-Correlation mechanism.

‍Efficient Encoding and Decoding

The encoder in Autoformer is designed to model the seasonal part of the data, eliminating the long-term trend during processing. Meanwhile, the decoder progressively refines the trend predictions, using information from the encoder to improve accuracy.

Practical Application: Using Autoformer with Hugging Face

To see Autoformer in action, we'll use the Hugging Face library to implement it for a real-world scenario. The example uses a pre-trained model based on the dataset from the paper The tourism forecasting competition, that can be found in the Monash Time Series Forecast Repository on Hugging Face. You can also access the original GitHub repository from the paper here.

Installation and Setup

I always recommend using a virtual environment like conda to setup all the necessary libraries for your project. This example uses python 3.10.12.

pip install torch
pip install transformers

Import all necessary libraries

import torch
import seaborn as sns
import matplotlib.pyplot as plt

from huggingface_hub import hf_hub_download
from transformers import AutoformerForPrediction

sns.set_style('darkgrid')

Download the dataset, load the model and perform an inference (prediction)

#Download and load the dataset
file = hf_hub_download(
    repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
)
batch = torch.load(file)

# Load the pre-trained Autoformer model
model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly")

# During training, one provides both past and future values
# as well as possible additional features
outputs = model(
    past_values=batch["past_values"],
    past_time_features=batch["past_time_features"],
    past_observed_mask=batch["past_observed_mask"],
    static_categorical_features=batch["static_categorical_features"],
    future_values=batch["future_values"],
    future_time_features=batch["future_time_features"],
)

loss = outputs.loss
loss.backward()

# During inference, one only provides past values
# as well as possible additional features
# the model autoregressively generates future values
outputs = model.generate(
    past_values=batch["past_values"],
    past_time_features=batch["past_time_features"],
    past_observed_mask=batch["past_observed_mask"],
    static_categorical_features=batch["static_categorical_features"],
    future_time_features=batch["future_time_features"],
)

mean_prediction = outputs.sequences.mean(dim=1)

Visualize your results!

The most exciting part is to check how the model actually works and that is, by matching the ground truth (the real future values) against the ones predicted by the model using only the past data.

# Plotting
past_values = batch["past_values"].squeeze().numpy()
future_values = batch["future_values"].squeeze().numpy()
predicted_values = mean_prediction.squeeze().detach().numpy()

# Single sample for simplicity
past_values = past_values[0]
future_values = future_values[0]
predicted_values = predicted_values[0]

plt.figure(figsize=(12, 6))
plt.plot(range(len(past_values)), past_values, label='Past Values')
plt.plot(range(len(past_values), len(past_values) + len(future_values)), future_values, label='True Future Values')
plt.plot(range(len(past_values), len(past_values) + len(predicted_values)), predicted_values, label='Predicted Future Values', linestyle='dashed')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Values')
plt.title('Original Time Series and Predictions')
plt.show()

And here is the resulting plot. As you can see, the model did a pretty good job by forecasting the amount of tourism expected in the future. Of course, minor discrepancies are expected and usually the more acute or sudden the fluctuations are, the more difficult it is for the model to generalize well and predict with perfect accuracy. Nevertheless, we should be really proud of our model!

**Figure 4:** Original Time Series and Predictions

Final Words

Transformers have truly revolutionized the field of machine learning, and their impact on time series forecasting is no exception. By introducing innovations like the Autoformer model, we can now handle long-term dependencies and complex patterns in data with unprecedented efficiency and accuracy. Autoformer, with its unique and innovative mechanisms, provides a significant leap forward from traditional transformers. Its ability to manage the inherent periodicity of time series data, along with the reduced computational complexity, makes it an invaluable tool for various applications—from predicting stock prices to forecasting weather patterns and beyond.

The practical example using Hugging Face demonstrated how easily we can implement and visualize the power of Autoformer in real-world scenarios. Through the application of these innovative models, we can make more informed decisions and strategic plans, by going above and beyond what's currently possible in data analysis.

In conclusion, my dear reader, the Autoformer stands as a testament to the ongoing innovation in the field, showing us that with the right tools, we can predict the future more accurately than ever before. So, as you dive into your next time series project, remember the power of these models. Keep experimenting, stay curious, and embrace the advancements that are reshaping the landscape of data forecasting. The future never looked brighter. Happy coding!

Text Link Text Link