Skip to content

NaniDAO/evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

1f98407 · Feb 14, 2025

History

27 Commits
Feb 12, 2025
Feb 14, 2025
Jan 20, 2025
Jan 18, 2025
Jan 18, 2025
Feb 12, 2025
Feb 12, 2025
Feb 12, 2025
Feb 12, 2025
Feb 12, 2025
Feb 14, 2025
Feb 12, 2025

Repository files navigation

NaniDAO/evals

A Python library for generating completions and evaluating NaniDAO models using different datasets, prompts, and configuration files.

Quick Start

  1. Set up environment variables in .env:
NANI_API_KEY=your_nani_api_key
NANI_BASE_URL=http://nani.ooo/api/chat
GEMINI_API_KEY=your_gemini_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key

If not, you can specify individual API_KEY/BASE_URL using CLI arguments.

  1. Install dependencies:
# Using uv (recommended)
git clone https://github.com/NaniDAO/evals.git
cd evals
uv pip install -e .

# Or using pip
git clone https://github.com/NaniDAO/evals.git
cd evals
pip install -e .

data/info

Directory contains previous completions and evaluations run on different LLMs using datasets from nanidao_evals/data/datasets. Inspect individual metadata for details. Not part of the nanidao-evals package.

Basic Usage

Simplest Commands

Generate completions using default settings:

# Using environment variables from .env
nanidao-evals

# Or passing credentials via CLI
nanidao-evals \
  --providers nani \
  --provider-urls "nani:https://nani.ooo/api/chat" \
  --provider-api-keys "nani:your-api-key"

Evaluate existing completions:

# Using environment variables
nanidao-evals --evaluation-judge nani --evaluate-file out/completions.json

# Or passing credentials via CLI
nanidao-evals \
  --evaluation-judge nani \
  --provider-urls "nani:https://nani.ooo/api/chat" \
  --provider-api-keys "nani:your-api-key" \
  --evaluate-file out/completions.json

Exploring Datasets

# List available behaviors (default dataset: JBB)
nanidao-evals --list-behaviors

# List categories in a specific dataset
nanidao-evals --completions-dataset NANI --list-categories

# Show prompts that match specific criteria
nanidao-evals --show-prompts --dataset-category Hardware --completions-dataset NANI

Generating Completions

# Generate completions from specific dataset
nanidao-evals --completions-dataset NANI

# Generate with multiple configurations
nanidao-evals \
  --providers nani \
  --config-file configs/multi_temp.json \
  --completions-dataset NANI

Example configs/multi_temp.json:

[
    {
      "temperature": 0.7,
      "max_tokens": 1000,
      "top_p": 1.0
    },
    {
      "temperature": 0.9,
      "max_tokens": 1500,
      "top_p": 0.9
    }
]

Evaluating Completions

# Generate and evaluate completions using Gemini
nanidao-evals --evaluation-judge gemini

# Evaluate existing completions file
nanidao-evals --evaluation-judge anthropic --evaluate-file out/completions.json

Datasets for generating completions are found in data/datasets/jailbreaks_datasets.json.

Prompts for generating evaluations are found in data/prompts/eval_prompts.json.

Advanced Usage Examples

1. Command Line Interface

Generate completions using specific provider and model:

nanidao-evals \
  --providers nani \
  --provider-urls "nani:https://nani.ooo/api/chat" \
  --provider-models "nani:deepseek-r1-qwen-2.5-32B-ablated" \
  --provider-api-keys "nani:your-api-key" \
  --completions-dataset NANI

2. Python API

from nanidao_evals.generators.completions import CompletionGenerator

# Configure providers
provider_configs = {
    "nani": {
        "base_url": "https://nani.ooo/api/chat",
        "api_key": "your-api-key",
        "model": "deepseek-r1-qwen-2.5-32B-ablated"
    }
}

# Create generator
generator = CompletionGenerator(
    providers=["nani"],
    provider_configs=provider_configs
)

# Generate completions
results = generator.generate_completions(
    dataset_path="data/datasets/nani_dataset.json",
    categories=["Hardware"],
    behaviors=["Engineering"]
)

3. Direct API Usage

from apis.analyzer import create_handler

# Initialize handler
handler = create_handler(
    provider="nani",
    model="deepseek-r1-qwen-2.5-32B-ablated",
    base_url="https://nani.ooo/api/chat",
    api_key="your-api-key"
)

# Generate response
response = handler.generate_response("Your prompt here")

Filtered Generation Examples

1. Filtered Completion Generation

Generate completions for specific categories/behaviors:

nanidao-evals \
  --completions-dataset NANI \
  --dataset-category Hardware \
  --dataset-behavior Engineering

2. Custom Evaluation Settings

Evaluate with specific model and prompt:

nanidao-evals \
  --evaluation-judge anthropic \
  --provider-models "anthropic:claude-3-5-sonnet-20241022" \
  --evaluation-prompt eval0_system_prompt

3. Combined Generation and Evaluation

Generate and evaluate in one run with filters:

nanidao-evals \
  --completions-dataset NANI \
  --evaluation-judge gemini \
  --dataset-category Hardware \
  --dataset-source Original

CLI Arguments Reference

Provider Configuration

  • --providers: List of providers to use (e.g., nani, gemini, anthropic)
  • --provider-urls: Base URLs for providers (format: provider:url)
  • --provider-models: Model names for providers (format: provider:model)
  • --provider-api-keys: API keys for providers (format: provider:key)

Dataset Exploration

  • --list-behaviors: Show available behaviors
  • --list-categories: Show available categories
  • --list-sources: Show available sources
  • --show-prompts: Display prompts matching filters
  • --completions-dataset: Select dataset (default: JBB)

Generation & Filtering

  • --dataset-category: Filter by categories
  • --dataset-behavior: Filter by behaviors
  • --dataset-source: Filter by sources
  • --output-dir: Output directory (default: out)
  • --config-file: Custom model configuration file (supports multiple configs)

Evaluation

  • --evaluation-judge: Judge provider (gemini/anthropic/openai/nani)
  • --evaluation-prompt: Evaluation prompt (default: eval0_system_prompt)
  • --evaluate-file: Existing completions file to evaluate

Default Models

Provider Model
gemini gemini-2.0-flash-exp
anthropic claude-3-5-sonnet-20241022
openai gpt-4o-mini-2024-07-18
nani NaniDAO/deepseek-r1-qwen-2.5-32B-ablated
huggingface tgi

Project Structure

/apis           - LLM provider implementations
/data
  /configs     - Model configurations
  /datasets    - Input datasets
  /prompts     - Evaluation prompts
/generators    - Core generation/evaluation logic

Output Format

Results are saved with timestamps:

out/YYYYMMDD_HHMMSS_completions.json  # For completions
out/YYYYMMDD_HHMMSS_eval_provider.json  # For evaluations

Previous evaluation results are available in data/info/old_evals/.

Provider-Specific Configuration

Credentials

All providers can be configured either through environment variables in .env or via CLI arguments.

HuggingFace

nanidao-evals \
  --providers huggingface \
  --provider-urls "huggingface:https://your-endpoint" \
  --provider-models "huggingface:your-model" \
  --provider-api-keys "huggingface:your-key"

Nani

nanidao-evals \
  --providers nani \
  --provider-urls "nani:https://nani.ooo/api/chat" \
  --provider-models "nani:NaniDAO/deepseek-r1-qwen-2.5-32B-ablated"

Multiple Providers

nanidao-evals \
  --providers nani huggingface \
  --provider-urls "nani:https://nani.ooo/api/chat" "huggingface:https://your-endpoint" \
  --provider-models "nani:model1" "huggingface:model2"

About

Prompt evaluation framework for NaniDAO models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published