TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents
TL;DR Highlight
Splitting two GPT-4 agents to first contextualize time series data as text then predict raises F1 score by an average of 28.75%.
Who Should Read
ML engineers designing time series event classification pipelines in weather, finance, or healthcare. Developers wanting to use LLMs as data preprocessing agents rather than just predictors.
Core Mechanics
- Uses LLM as a 'contextualizer' before prediction — GPT-4 agent #1 converts time series to text summaries, agent #2 predicts events from those summaries (2-agent structure)
- Multi-Modal Encoder (BERT + Patch Transformer) learns from raw time series + text summaries together, retrieves k=5 similar examples as in-context examples for GPT-4 prediction
- Linear combination of Encoder predictions and LLM predictions with lambda weight (Fused Prediction) achieves higher performance than either alone
- Even in zero-shot scenarios (0% training data), outperforms existing LLM-based methods (PromptCast, LLMTime) on F1
- Provides interpretable reasoning via Implicit (LLM generates rationale directly) and Explicit (points to most similar in-context example) approaches
- 7 real datasets (weather 3, finance 2, healthcare 2) and GPT-4 generated text summaries publicly available on GitHub
Evidence
- Average F1 score improvement of 28.75% vs existing SOTA, up to 157% improvement on some datasets
- TimeCP alone (contextualization only) outperforms zero-shot LLM baseline (PromptCast) across all datasets — NY weather F1: 0.499 to 0.625
- Multi-Modal Encoder in-context sampling outperforms PatchTST-based KNN — healthcare domain 0.657 to 0.736
- With only 10% training data, TimeCAP shows smaller performance drop vs PatchTST and GPT4TS
How to Apply
- Add GPT-4 calls as preprocessing to existing time series classification pipelines: pass raw time series numbers in a prompt asking 'summarize the domain context as text', then use that summary as a feature.
- If training data is sufficient, fine-tune a multimodal encoder (BERT + Patch Transformer), then retrieve 5 relevant examples via embedding similarity at inference and inject into GPT-4 prompts.
- In cold-start situations with minimal data, using just the TimeCP structure (2 agents, no training) can deliver better results than existing zero-shot approaches.
Code Example
# TimeCAP core flow pseudocode
import openai
from sentence_transformers import util
# Step 1: Agent AC — time series → text summary
def contextualize(time_series_str: str) -> str:
prompt = f"""The following is weather time-series data from the past 24 hours.
Temperature, humidity, pressure, wind speed, and wind direction values are listed in chronological order.
Summarize the weather patterns and domain context of this data from an expert perspective.
Data: {time_series_str}"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Step 2: Generate embeddings with Multi-Modal Encoder and retrieve similar examples
def retrieve_in_context_examples(query_embedding, train_embeddings, train_summaries, train_labels, k=5):
scores = util.cos_sim(query_embedding, train_embeddings)[0]
top_k_idx = scores.topk(k).indices
return [(train_summaries[i], train_labels[i]) for i in top_k_idx]
# Step 3: Agent AP — context summary + in-context examples → event prediction
def predict_with_context(summary: str, in_context_examples: list) -> str:
examples_str = "\n".join([
f"Example {i+1}: {s}\nOutcome: {l}"
for i, (s, l) in enumerate(in_context_examples)
])
prompt = f"""Refer to the past cases below to predict the event for the current situation.
[Past Cases]
{examples_str}
[Current Situation Summary]
{summary}
Predict whether it will rain tomorrow and explain your reasoning. Answer: Rain or Not Rain"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Execution example
ts_data = "temp: [15.2, 14.8, 14.1, ...], humidity: [72, 75, 80, ...]"
summary = contextualize(ts_data)
examples = retrieve_in_context_examples(query_emb, train_embs, train_summaries, train_labels)
result = predict_with_context(summary, examples)
print(result)Terminology
Related Resources
Original Abstract (Expand)
Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.