TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

Feb 17, 2025•Geon Lee, Wenchao Yu, Kijung Shin +2•View PDF

TL;DR Highlight

Splitting two GPT-4 agents to first contextualize time series data as text then predict raises F1 score by an average of 28.75%.

Who Should Read

ML engineers designing time series event classification pipelines in weather, finance, or healthcare. Developers wanting to use LLMs as data preprocessing agents rather than just predictors.

Core Mechanics

Uses LLM as a 'contextualizer' before prediction — GPT-4 agent #1 converts time series to text summaries, agent #2 predicts events from those summaries (2-agent structure)
Multi-Modal Encoder (BERT + Patch Transformer) learns from raw time series + text summaries together, retrieves k=5 similar examples as in-context examples for GPT-4 prediction
Linear combination of Encoder predictions and LLM predictions with lambda weight (Fused Prediction) achieves higher performance than either alone
Even in zero-shot scenarios (0% training data), outperforms existing LLM-based methods (PromptCast, LLMTime) on F1
Provides interpretable reasoning via Implicit (LLM generates rationale directly) and Explicit (points to most similar in-context example) approaches
7 real datasets (weather 3, finance 2, healthcare 2) and GPT-4 generated text summaries publicly available on GitHub

Evidence

Average F1 score improvement of 28.75% vs existing SOTA, up to 157% improvement on some datasets
TimeCP alone (contextualization only) outperforms zero-shot LLM baseline (PromptCast) across all datasets — NY weather F1: 0.499 to 0.625
Multi-Modal Encoder in-context sampling outperforms PatchTST-based KNN — healthcare domain 0.657 to 0.736
With only 10% training data, TimeCAP shows smaller performance drop vs PatchTST and GPT4TS

How to Apply

Add GPT-4 calls as preprocessing to existing time series classification pipelines: pass raw time series numbers in a prompt asking 'summarize the domain context as text', then use that summary as a feature.
If training data is sufficient, fine-tune a multimodal encoder (BERT + Patch Transformer), then retrieve 5 relevant examples via embedding similarity at inference and inject into GPT-4 prompts.
In cold-start situations with minimal data, using just the TimeCP structure (2 agents, no training) can deliver better results than existing zero-shot approaches.

Code Example

snippet

# TimeCAP core flow pseudocode
import openai
from sentence_transformers import util

# Step 1: Agent AC — time series → text summary
def contextualize(time_series_str: str) -> str:
    prompt = f"""The following is weather time-series data from the past 24 hours.
Temperature, humidity, pressure, wind speed, and wind direction values are listed in chronological order.
Summarize the weather patterns and domain context of this data from an expert perspective.

Data: {time_series_str}"""
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Step 2: Generate embeddings with Multi-Modal Encoder and retrieve similar examples
def retrieve_in_context_examples(query_embedding, train_embeddings, train_summaries, train_labels, k=5):
    scores = util.cos_sim(query_embedding, train_embeddings)[0]
    top_k_idx = scores.topk(k).indices
    return [(train_summaries[i], train_labels[i]) for i in top_k_idx]

# Step 3: Agent AP — context summary + in-context examples → event prediction
def predict_with_context(summary: str, in_context_examples: list) -> str:
    examples_str = "\n".join([
        f"Example {i+1}: {s}\nOutcome: {l}"
        for i, (s, l) in enumerate(in_context_examples)
    ])
    prompt = f"""Refer to the past cases below to predict the event for the current situation.

[Past Cases]
{examples_str}

[Current Situation Summary]
{summary}

Predict whether it will rain tomorrow and explain your reasoning. Answer: Rain or Not Rain"""
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Execution example
ts_data = "temp: [15.2, 14.8, 14.1, ...], humidity: [72, 75, 80, ...]"
summary = contextualize(ts_data)
examples = retrieve_in_context_examples(query_emb, train_embs, train_summaries, train_labels)
result = predict_with_context(summary, examples)
print(result)

Terminology

LMaaSLanguage Model as a Service. Using LLMs only through API calls without accessing model internals.

In-Context LearningThe model figures out rules from a few examples in the prompt and applies them to new inputs — without training.

PatchingCutting time series into fixed-length small pieces (patches) for Transformer input. Same principle as image patch embedding.

CLS TokenA special token in models like BERT representing the entire sentence. Its embedding is used as a summary vector.

Multi-Modal EncoderA model simultaneously receiving different data types (text and time series) and creating a single representation vector.

Zero-shotSolving a problem directly without training examples.

AUROCA metric showing how well a model distinguishes positive from negative (0-1). Closer to 1 is better.

Related Resources

Original Abstract (Expand)

Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.