Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems | AI Paper Digest

TL;DR Highlight

LLM이 규칙을 잘 지키고 있는지 감시하려면 LLM에게 맡기지 말고 LTL(시간 논리 공식) 기반 모니터를 쓰세요.

Who Should Read

LLM 기반 에이전트나 챗봇을 프로덕션에 배포하면서 규정 준수, 안전 제약, 행동 모니터링이 필요한 백엔드/AI 엔지니어. 특히 금융, 의료, 물류 등 규제 산업에서 LLM 제품을 개발하는 팀.

Core Mechanics

LLM-as-a-Judge(LLM이 다른 LLM을 심판하는 방식)는 시간적으로 확장된 행동 패턴(예: '결제 후에만 배송') 감지에 취약함. 이벤트 간 거리가 멀어지거나 제약 조건 수가 늘면 정확도가 랜덤 수준으로 떨어짐.
TRAC(Temporal Rule Assessment and Compliance) 알고리즘은 LTL(Linear Temporal Logic, 시간 흐름에 따른 논리 규칙 언어) 기반으로 LLM의 입출력을 실시간 모니터링하며, 위반 발생 시 어떤 규칙이 언제 깨졌는지 설명하는 witness(증거 추적)도 제공.
핵심 아이디어는 역할 분리: LLM은 '이벤트 라벨링'(지금 무슨 일이 일어났나 판단)만 하고, 시간 패턴 추론은 LTL progression(공식 기반 점진적 평가)이 담당.
Qwen-7B 같은 소형 모델을 라벨러로 쓴 TRAC이 GPT-4.1, Gemini-2.5-Pro 같은 프론티어 LLM을 단독 심판으로 쓰는 것보다 F1 점수가 높음. 즉 비용을 줄이면서 정확도를 높일 수 있음.
TRACP+I는 예측 모니터링(위반이 일어나기 전에 예측)과 개입(intervention)을 추가한 확장 버전. 위반이 예상되면 ①프롬프트 수정, ②여러 출력 중 선택(best-of-n), ③더 안전한 모델로 교체 등 3가지 방식으로 자동 개입.
LLM의 시간 추론 한계를 실험으로 정량화: 이벤트 간 거리(gap)가 늘수록, 동시에 모니터링할 제약 수가 늘수록, 스텝당 명제(proposition) 수가 늘수록 정확도가 체계적으로 떨어짐.

Evidence

TRACR + 소형 LLM 라벨러(Qwen-7B)가 IPC-Trucks, TextWorld, ScienceWorld 3개 환경 모두에서 GPT-4.1, Gemini-2.5-Pro를 단독 LLM-as-a-Judge로 쓸 때보다 F1 점수가 일관되게 높음 (Figure 2).
LLM 라벨러 정확도는 환경별로 Qwen-7B 기준 IPC-Trucks 0.77, TextWorld 0.87, ScienceWorld 0.98로 개별 명제 라벨링 자체는 소형 모델도 매우 잘 수행 (Table 1).
TRACP+I 개입 전략(Inject/Resample/Switch) 적용 시 모든 모델-환경 조합에서 위반율(violation rate)이 베이스라인 대비 감소하며, 태스크 성능(cumulative reward)은 유의미한 저하 없이 유지됨 (Figure 4, Figure 7).
시간 간격 실험에서 소형 모델은 gap ~10 스텝부터, 프론티어 모델은 복잡한 공식에서 gap이 커질수록 정확도가 랜덤(~50%) 수준으로 수렴 (Figure 3a, 3d).

How to Apply

LLM 에이전트에 규칙을 심을 때, 규칙을 자연어로만 프롬프트에 넣지 말고 LTL 공식으로 변환한 뒤 TRAC 모니터를 래퍼로 씌워라. 자연어→LTL 변환은 NL2LTL 같은 autoformalization 도구로 반자동화 가능.
비용이 걱정되면 고가 프론티어 모델(GPT-4.1, Gemini-2.5-Pro)을 심판으로 쓰는 대신, 소형 LLM(Qwen-7B 수준)을 라벨러로만 쓰고 시간 패턴 추론은 LTL progression 엔진에 맡겨라. 성능은 올라가고 비용은 낮아짐.
프로덕션 LLM 에이전트에서 위반이 자주 생기는 규칙이 있다면 TRACP+I의 constraint-guided prompting(잔여 LTL 공식을 자연어로 프롬프트에 주입) 전략을 써라. 같은 모델을 재사용하면서 위반율을 낮출 수 있고, 모델 교체가 필요한 경우에는 model substitution 전략으로 더 안전한 모델로 자동 전환.

Code Example

snippet

# TRAC 모니터링 핵심 로직 (Python 의사코드)
# 자연어 규칙: '결제가 확인되기 전에 제품을 배송하지 마라'
# LTL 공식: G(ship -> (payment_confirmed U ship))
# 즉, 배송이 일어나면 항상 그 전에 결제 확인이 되어 있어야 함

from ltl_monitor import LTLMonitor, LLMLabeler

# 1. 규칙을 LTL 공식으로 정의
formula = "G(ship -> (payment_confirmed U ship))"

# 2. LLM을 라벨러로 사용 (소형 모델도 OK)
labeler = LLMLabeler(
    model="qwen-7b",
    propositions=["ship", "payment_confirmed", "invoice_received"]
)

# 3. TRAC 모니터 초기화
monitor = LTLMonitor(formula=formula, labeler=labeler)

# 4. 에이전트 실행 중 매 스텝 모니터링
for step in agent_execution_loop():
    input_text = step.user_input
    output_text = agent.generate(input_text)
    
    # 라벨러가 현재 스텝에서 참인 명제 추출
    labels = labeler.extract(input_text, output_text)
    
    # LTL progression으로 공식 업데이트 및 판정
    verdict, witness = monitor.step(labels)
    
    if verdict == "VIOLATED":
        print(f"규칙 위반 감지! 증거: {witness}")
        # 개입 로직 실행 (TRACP+I)
        output_text = intervene(output_text, monitor.residual_formula)
    
    yield output_text

# 오프라인 감사 (로그 데이터에 적용)
audit_results = monitor.audit_log(historical_logs)
for violation in audit_results.violations:
    print(f"스텝 {violation.step}: {violation.rule} 위반 - {violation.witness}")

Terminology

LTLLinear Temporal Logic. 시간 흐름에 따른 규칙을 수학적으로 표현하는 언어. '항상 A이면 나중에 반드시 B가 와야 한다'는 식의 규칙을 기호로 표현할 수 있음. 철도 신호 시스템이나 항공 소프트웨어 검증에 오래 쓰인 기술.

LTL ProgressionLTL 공식을 한 스텝씩 소비하며 업데이트하는 기법. '지금까지 본 것을 고려할 때 앞으로 무엇이 더 필요한가?'를 공식 형태로 유지함. 자동화를 재컴파일하지 않고도 규칙을 동적으로 추가/제거 가능.

LLM-as-a-JudgeLLM이 다른 LLM(또는 자기 자신)의 출력을 평가하는 방식. 사람 평가자 대신 GPT-4 같은 모델에게 '이 답변이 규칙을 잘 따랐나?'를 물어보는 것. 비용 효율적이지만 시간 패턴 추론에 약점이 있음.

Formal Methods소프트웨어/시스템의 동작을 수학적으로 명세하고 검증하는 기법들의 총칭. 항공, 원자력 등 안전 필수 시스템에서 수십 년간 쓰여온 검증 방법론.

RLHFReinforcement Learning from Human Feedback. 사람의 선호도 피드백을 보상 신호로 활용해 LLM을 미세 조정하는 학습법. ChatGPT 등에서 사용.

Autoformalization자연어로 쓰인 규칙을 자동으로 형식 언어(LTL 등)로 변환하는 기술. 'LLM에게 규칙 번역을 시키는 것'이라고 볼 수 있음.

Labeling Function에이전트의 입출력 텍스트를 보고 '지금 결제가 완료됐나?', '배송이 시작됐나?' 같은 명제의 참/거짓을 판단하는 함수. TRAC에서는 이 역할을 LLM이 담당함.

Black-box Model내부 구조(가중치, 코드)에 접근하지 못하고 입력과 출력만 볼 수 있는 AI 모델. API로만 사용하는 GPT-4, Claude 등이 해당됨.

Related Papers

Related Resources

Original Abstract (Expand)

We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs' temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.