Online evaluation with monitors and guardrails

Introduction

Weave supports two types of evaluation:

Type	When it runs	Use case
Offline evaluation	Pre-production	Test against a consistent dataset before deployment. See Evaluations.
Online evaluation	Production	Score live inputs and outputs as they flow through your app.

This guide covers online evaluation, which includes:

Monitors: Passively score production traffic to surface trends and issues. No code changes required.
Guardrails: Actively intervene when scores exceed thresholds (for example, block toxic content). Requires code changes.

Aspect	Monitors	Guardrails
Purpose	Passive observation for analysis	Active intervention to prevent issues
Code changes	None required	Required
Timing	Asynchronous, in background	Real-time, before output reaches users
Sampling	Configurable (for example, 10% of calls)	Usually every request
Control flow	No impact on application	Can block or modify outputs

Start with monitors. Most users should set up UI-based monitors first. Guardrails require code changes and are only needed when you must take action based on scores.

Set up a monitor

This feature is only available in Multi-Tenant (MT) SaaS deployments.

A monitor is a background process that:

Watches one or more ops decorated with @weave.op
Scores a sample of calls using an LLM-as-a-judge
Runs automatically without any code changes to your app

Monitors are ideal for:

Evaluating and tracking production behavior
Catching regressions or drift
Collecting real-world performance data over time

Create a monitor

From the left menu, select the Monitors tab.
Click New Monitor.
Configure the monitor:
- Name: Must start with a letter or number. Can contain letters, numbers, hyphens, and underscores.
- Description (optional): Explain what the monitor does.
- Active monitor toggle: Turn the monitor on or off.
- Calls to monitor:
  - Operations: Choose one or more @weave.ops to monitor.
    You must log at least one trace for an op before it appears in the list.
  - Filter (optional): Narrow down which calls are eligible (for example, by max_tokens or top_p).
  - Sampling rate: The percentage of calls to score (0% to 100%).
    A lower sampling rate reduces costs, since each scoring call has an associated cost.
- LLM-as-a-judge configuration:
  - Scorer name: Must start with a letter or number. Can contain letters, numbers, hyphens, and underscores.
  - Judge model: Select the model that scores your ops. Options include:
    - Saved models
    - Models from providers configured by your W&B admin
    - W&B Inference models
  - Configuration name: A name for this model configuration.
  - System prompt: Instructions for the judge model.
  - Response format: The expected output format (for example, json_object or text).
  - Scoring prompt: The prompt used to score your ops. You can reference variables from your function. See Prompt variables.
Click Create Monitor.

Weave automatically begins monitoring and scoring calls that match your criteria. View monitor details in the Monitors tab.

Example: Create a truthfulness monitor

This example creates a monitor that evaluates the truthfulness of generated statements.

Step 1: Create the op to monitor

Define a function that generates statements. Some are truthful, others are not:

import weave
import random
import openai

weave.init("my-team/my-weave-project")

client = openai.OpenAI()

@weave.op()
def generate_statement(ground_truth: str) -> str:
    if random.random() < 0.5:
        response = client.chat.completions.create(
            model="gpt-4.1",
            messages=[
                {
                    "role": "user",
                    "content": f"Generate a statement that is incorrect based on this fact: {ground_truth}"
                }
            ]
        )
        return response.choices[0].message.content
    else:
        return ground_truth

Step 2: Log a trace

Run the function at least once so it appears in the monitor UI:

generate_statement("The Earth revolves around the Sun.")

Step 3: Create the monitor in the UI

Navigate to Monitors and click New Monitor.
Configure:
- Name: truthfulness-monitor
- Description: Evaluates the truthfulness of generated statements.
- Active monitor: Toggle on.
- Operations: Select generate_statement.
- Sampling rate: Set to 100% to score every call.
- Scorer name: truthfulness-scorer
- Judge model: o3-mini-2025-01-31
- System prompt: You are an impartial AI judge. Your task is to evaluate the truthfulness of statements.
- Response format: json_object
- Scoring prompt:
  Evaluate whether the output statement is accurate based on the input statement. This is the input statement: {ground_truth} This is the output statement: {output} The response should be a JSON object with the following fields: - is_true: a boolean stating whether the output statement is true or false based on the input statement. - reasoning: your reasoning as to why the statement is true or false.
Click Create Monitor.

Step 4: Generate statements and view results

Generate statements for the monitor to evaluate:

generate_statement("The Earth revolves around the Sun.")
generate_statement("Water freezes at 0 degrees Celsius.")
generate_statement("The Great Wall of China was built over several centuries.")

In the Weave UI, navigate to the Traces tab and select any LLMAsAJudgeScorer.score trace to see the results.

Prompt variables

In scoring prompts, you can reference variables from your op. These values are automatically extracted when the scorer runs. For a function like:

@weave.op
def my_function(foo: str, bar: str) -> str:
    return f"{foo} and {bar}"

The following variables are available:

Variable	Description
`{foo}`	The value of the input argument `foo`
`{bar}`	The value of the input argument `bar`
`{inputs}`	A JSON dictionary of all input arguments
`{output}`	The result returned by your op

Example scoring prompt:

Input foo: {foo}
Input bar: {bar}
Output: {output}

Set up guardrails

Guardrails actively intervene in your app’s behavior based on scores. Unlike monitors, guardrails require code changes because they need to affect your application’s control flow.

When to use guardrails

Use guardrails when you need to:

Block responses that exceed a toxicity threshold
Modify outputs before they reach users
Enforce content policies in real-time

AWS Bedrock Guardrails

The BedrockGuardrailScorer uses AWS Bedrock’s guardrail feature to detect and filter content based on configured policies. Prerequisites:

An AWS account with Bedrock access
A configured guardrail in the AWS Bedrock console
The boto3 Python package

You don’t need to create your own Bedrock client. Weave creates it for you. To specify a region, pass the bedrock_runtime_kwargs parameter to the scorer.

For details on creating a guardrail in AWS, see the Bedrock guardrails notebook.

import weave
from weave.scorers.bedrock_guardrails import BedrockGuardrailScorer

weave.init("my_app")

guardrail_scorer = BedrockGuardrailScorer(
    guardrail_id="your-guardrail-id",
    guardrail_version="DRAFT",
    source="INPUT",
    bedrock_runtime_kwargs={"region_name": "us-east-1"}
)

@weave.op
def generate_text(prompt: str) -> str:
    # Your text generation logic here
    return "Generated text..."

async def generate_safe_text(prompt: str) -> str:
    result, call = generate_text.call(prompt)

    score = await call.apply_scorer(guardrail_scorer)

    if not score.result.passed:
        if score.result.metadata.get("modified_output"):
            return score.result.metadata["modified_output"]
        return "I cannot generate that content due to content policy restrictions."

    return result

Custom guardrails

For custom guardrail logic, you can create your own scorer and apply it programmatically. See Advanced: Code-based scoring for details.

Built-in scorers

Weave includes predefined scorers you can use with monitors or guardrails:

See Built-in scorers for the complete list.

Advanced: Code-based scoring

Most users should use UI-based monitors instead. Use code-based scoring only when you need custom logic that isn’t available through the monitor UI.

For advanced use cases, you can apply scorers programmatically in your code.

Using the `.call()` method

To apply scorers, you need access to both the operation’s result and its Call object. The .call() method provides both:

# Standard call (no access to Call object):
result = generate_text(input)

# Using .call() to get both result and Call object:
result, call = generate_text.call(input)

The Call object is required to associate scores with calls in the database, making them searchable and exportable.

Creating a custom scorer

A scorer is a class that inherits from Scorer and implements a score method:

import weave
from weave import Scorer

class ToxicityScorer(Scorer):
    @weave.op
    def score(self, output: str) -> dict:
        # Your toxicity detection logic here
        return {
            "flagged": False,
            "reason": None
        }

Applying scorers

Apply a scorer to a call using apply_scorer():

@weave.op
def generate_text(prompt: str) -> str:
    return "Generated response..."

async def evaluate_response(prompt: str):
    result, call = generate_text.call(prompt)
    score = await call.apply_scorer(ToxicityScorer())
    return score.result

Parameter matching

The score method receives:

output: The result from your function (always provided)
Any input parameters that match your function’s parameter names

@weave.op
def generate_styled_text(prompt: str, style: str) -> str:
    return "Generated text..."

class StyleScorer(Scorer):
    @weave.op
    def score(self, output: str, prompt: str, style: str) -> dict:
        # output, prompt, and style are automatically matched
        return {"style_match": 0.9}

Handling parameter name mismatches

Use column_map when your scorer’s parameter names don’t match your function’s:

@weave.op
def generate_text(user_input: str):
    return process(user_input)

class QualityScorer(Scorer):
    @weave.op
    def score(self, output: str, prompt: str):
        return {"quality_score": 0.8}

result, call = generate_text.call(user_input="Say hello")

# Map 'prompt' to 'user_input'
scorer = QualityScorer(column_map={"prompt": "user_input"})
await call.apply_scorer(scorer)

Adding additional parameters

Use additional_scorer_kwargs for parameters that aren’t part of your function:

class ReferenceScorer(Scorer):
    @weave.op
    def score(self, output: str, reference_answer: str):
        similarity = compute_similarity(output, reference_answer)
        return {"matches_reference": similarity > 0.8}

await call.apply_scorer(
    ReferenceScorer(),
    additional_scorer_kwargs={
        "reference_answer": "The Earth orbits around the Sun."
    }
)

Next steps

Explore Built-in scorers
Learn about Weave ops
Set up Offline evaluations

Get Started

Concepts

Guides

Cookbooks

Reference

Details & Support

Open Source

Community

Introduction

Set up a monitor

Create a monitor

Example: Create a truthfulness monitor

Step 1: Create the op to monitor

Step 2: Log a trace

Step 3: Create the monitor in the UI

Step 4: Generate statements and view results

Prompt variables

Set up guardrails

When to use guardrails

AWS Bedrock Guardrails

Custom guardrails

Built-in scorers

Advanced: Code-based scoring

Using the `.call()` method

Creating a custom scorer

Applying scorers

Parameter matching

Handling parameter name mismatches

Adding additional parameters

Next steps

Get Started

Concepts

Guides

Cookbooks

Reference

Details & Support

Open Source

Community

​Introduction

​Set up a monitor

​Create a monitor

​Example: Create a truthfulness monitor

​Step 1: Create the op to monitor

​Step 2: Log a trace

​Step 3: Create the monitor in the UI

​Step 4: Generate statements and view results

​Prompt variables

​Set up guardrails

​When to use guardrails

​AWS Bedrock Guardrails

​Custom guardrails

​Built-in scorers

​Advanced: Code-based scoring

​Using the .call() method

​Creating a custom scorer

​Applying scorers

​Parameter matching

​Handling parameter name mismatches

​Adding additional parameters

​Next steps

Introduction

Set up a monitor

Create a monitor

Example: Create a truthfulness monitor

Step 1: Create the op to monitor

Step 2: Log a trace

Step 3: Create the monitor in the UI

Step 4: Generate statements and view results

Prompt variables

Set up guardrails

When to use guardrails

AWS Bedrock Guardrails

Custom guardrails

Built-in scorers

Advanced: Code-based scoring

Using the `.call()` method

Creating a custom scorer

Applying scorers

Parameter matching

Handling parameter name mismatches

Adding additional parameters

Next steps