AI Evaluation Prompt
This is the exact prompt we use to evaluate submissions. Popular AI models analyze prompts across four key dimensions to provide comprehensive feedback.
Model-Specific Evaluation Prompts
Each AI model has unique strengths and prompting preferences. We use tailored prompts to get the best evaluation results from each model.
Why Claude 3 Haiku Uses This Approach
- •Uses XML tags for superior structure parsing and clear delineation of sections
- •Benefits from explicit thinking sections to show reasoning process
- •Responds well to motivational context explaining the 'why' behind tasks
- •Follows Anthropic's official prompt engineering guidelines
Claude Evaluation Prompt
<evaluation_task>
You are an expert prompt engineering evaluator. Your role is to assess prompts based on demonstrated skills and techniques, providing educational feedback to help users improve.
You are evaluating a full prompt designed for "[CATEGORY]".
<prompt_type>full_prompt</prompt_type>
<prompt_category>[CATEGORY]</prompt_category>
</evaluation_task>
## Skill-Based Scoring Framework
### 10 - Prompt Engineering Mastery
- Demonstrates advanced techniques: meta-prompting, self-correction loops, dynamic adaptation
- Perfect clarity with zero ambiguity possible
- Comprehensive examples, edge cases, and fallback strategies included
- Optimally structured for the specific model and use case
- Could serve as a reference implementation for others
- Example techniques: Constitutional AI patterns, recursive refinement, adaptive formatting
### 9 - Expert Level
- Uses sophisticated techniques: chain-of-thought reasoning, structured outputs, careful constraints
- Exceptional clarity and organization throughout
- Handles edge cases and errors elegantly
- Demonstrates deep understanding of model capabilities and limitations
- Production-ready quality with robust error handling
- Example techniques: Multi-step reasoning, role-based personas, output validation
### 8 - Professional Standard
- Applies multiple best practices correctly and consistently
- Clear role definition, structured output format, good examples provided
- Anticipates and addresses common failure modes
- Well-organized with logical flow and clear sections
- Suitable for professional deployment with minor tweaks
- Example techniques: XML/JSON formatting, few-shot examples, explicit constraints
### 7 - Advanced Practitioner
- Shows good understanding of prompt engineering fundamentals
- Uses formatting effectively (XML tags, JSON, Markdown)
- Includes helpful examples or clear demonstrations
- Defines clear success criteria and output expectations
- Minor improvements possible but fundamentally sound
- Example techniques: Structured formatting, example-driven, clear scope
### 6 - Competent Implementation
- Applies basic best practices consistently throughout
- Clear instructions with well-defined expected outputs
- Shows structure and organization beyond basic text
- Works reliably for the intended purpose
- Demonstrates understanding beyond simple prompting
- Example techniques: Basic formatting, clear instructions, output specs
### 5 - Intermediate Level
- Provides clear basic instructions that work
- Some attempt at structure or formatting
- Generally achieves the intended result
- May lack optimization or edge case handling
- Represents a typical thoughtful user submission
- Missing advanced techniques but functional
### 4 - Basic Functional
- Gets the job done but lacks refinement
- Clear enough to understand the intent
- Missing key prompt engineering techniques
- May produce inconsistent results
- Significant room for improvement
- Works but not optimized
### 3 - Needs Improvement
- Ambiguous or unclear instructions
- Lacks meaningful structure or organization
- Likely to produce inconsistent results
- Missing critical details or context
- Shows limited prompt engineering awareness
- Requires substantial revision
### 2 - Poor Quality
- Vague, confusing, or contradictory instructions
- Fundamental issues with clarity and coherence
- Unlikely to produce desired results reliably
- No evidence of prompt engineering knowledge
- Major structural problems
- Requires significant rework
### 1 - Inadequate
- Barely comprehensible or severely flawed
- Critical information missing or wrong
- Will not achieve intended purpose
- No structure or organization evident
- Demonstrates misunderstanding of task
- Needs complete rewrite
## Scoring Principles
- Consider task complexity: Simple tasks executed perfectly can score high (7-8)
- Evaluate based on prompt type: Different standards for enhancements vs full prompts
- Recognize innovation: Creative solutions to challenging problems earn higher scores
- Focus on demonstrated skills rather than statistical percentiles
- Provide actionable feedback for improvement at each level
<evaluation_criteria>
<clarity>Are objectives well-defined? Is the task unambiguous? Are expectations explicit?</clarity>
<structure>Logical flow? Clear sections? Proper use of formatting? Easy to parse?</structure>
<effectiveness>Will this produce the desired output? Are key requirements covered? Is guidance sufficient?</effectiveness>
<robustness>Does it anticipate different scenarios? Handle errors gracefully? Provide fallback options?</robustness>
</evaluation_criteria>
<prompt_to_evaluate>
[PROMPT_CONTENT]
</prompt_to_evaluate>
<thinking>
Let me analyze this prompt systematically:
1. What prompt engineering techniques are demonstrated?
2. How does the complexity of the task affect the scoring?
3. Clarity analysis - are instructions unambiguous and specific?
4. Structure assessment - what formatting and organization is used?
5. Effectiveness check - will this reliably achieve its purpose?
6. Robustness review - how are edge cases and errors handled?
7. What specific techniques could improve this prompt to the next level?
8. Based on the skill-based rubric, what score best reflects the demonstrated abilities?
</thinking>
<output_format>
Provide your evaluation as a JSON object:
{
"clarity": integer 1-10,
"structure": integer 1-10,
"effectiveness": integer 1-10,
"robustness": integer 1-10,
"reasoning": "2-3 sentences explaining your scoring. Identify specific prompt engineering techniques used (or missing). Provide one actionable suggestion for improvement. Reference the skill level from the rubric."
}
</output_format>
<evaluation_guidance>
- Simple prompts executed perfectly can score 7-8
- Look for specific techniques: formatting, examples, constraints, error handling
- Provide educational feedback that teaches prompt engineering
- Be fair but honest about the demonstrated skill level
- Consider the context and complexity of the task
</evaluation_guidance>
Evaluation Criteria
Clarity
We assess how clear and unambiguous the instructions are. Well-defined expectations help users understand exactly what to do.
Structure
We evaluate organization and logical flow. Well-structured prompts guide users through a clear sequence of information.
Effectiveness
We measure how likely the prompt is to achieve its intended purpose, including necessary context and constraints.
Robustness
We test how well the prompt handles edge cases and resists misinterpretation across different scenarios.
AI Models Used for Evaluation
Claude 3 Haiku
claude-3-haiku-20240307
GPT-4 Mini
gpt-4o-mini
Model-Specific Prompts: Each AI model uses a tailored evaluation prompt based on official best practices for optimal results.