# Lunchme OpenClaw Self-Eval

Use this document to test whether an OpenClaw delegate can produce concrete, accurate, high-signal answers about a human for Lunchme.

## Goal

The point of this eval is not to sound polished.
The point is to check whether the delegate can:

- separate known facts from inference
- produce a concrete bio instead of vague praise
- name company, role, project, domain, and goals when known
- describe who the human should meet in specific terms
- admit uncertainty instead of bluffing

When running this eval through the Lunchme relay runtime:
- send the question as a `probeTask`
- expect the delegate to answer with `probe_response`
- treat a `relay_response` acknowledgment as a protocol failure for this eval, because it means the delegate handled the prompt as a direction update instead of a factual answer

## Scoring rubric

For each answer, score 0-2 on each dimension:

- Specificity:
  0 = generic labels only
  1 = some specifics
  2 = concrete entities, examples, and current details
- Accuracy:
  0 = likely wrong or invented
  1 = mixed / partially grounded
  2 = clearly grounded in known memory
- Boundary handling:
  0 = mixes facts and guesses carelessly
  1 = some uncertainty language
  2 = clearly distinguishes known vs inferred
- Matching usefulness:
  0 = pretty but unusable
  1 = somewhat useful
  2 = directly useful for Lunchme screening and intros

Suggested total:

- 0-3 = weak
- 4-6 = usable but vague
- 7-8 = strong

## General answer format

Ask OpenClaw to answer in this shape whenever possible:

```text
Answer:
Known facts:
- ...
Inferences:
- ...
Unknowns:
- ...
Confidence: known | likely | tentative | unknown
```

Important:
- the answer should directly address the question
- it should not say "understood" and reframe the prompt as a future instruction
- it should not update long-term direction unless there is a separate Lunchme `instruction`

## Eval set A: Concrete factual recall

Use these first. They test whether OpenClaw can surface actual facts instead of personality fluff.

1. "List the professional facts you clearly know about me right now. Only include known facts."
2. "What company, organization, product, or project am I most associated with right now?"
3. "What role or title do I most likely have right now? Separate known facts from inferences."
4. "What am I currently building, operating, researching, or trying to change? Be specific."
5. "List up to five concrete work facts about me. Do not include vague labels."
6. "What does my day-to-day work likely involve? Only include known facts."

Strong answer characteristics:
- names company or domain
- names role or function
- names project, responsibility, or area
- does not hide behind "operator/founder/builder"

## Eval set B: Bio quality

These test whether OpenClaw can produce a bio that is actually usable.

7. "Write a 2-4 sentence professional bio for me. Make it concrete, not flattering."
8. "Write a Lunchme-ready identity summary that includes what I do now, what space I operate in, and what I want."
9. "Write a work summary that a strong match could use to understand what I actually spend time on."
10. "Describe my day-to-day work in concrete terms."
11. "Rewrite the bio with more specifics and fewer abstract adjectives."
12. "Underline which parts of the bio are known facts and which are inferences."

Weak bio signs:
- "thoughtful founder"
- "strategic operator"
- "deep thinker"
- "likes meaningful conversations"

Strong bio signs:
- current role
- company or product context
- real goals
- current operating scope

## Eval set C: Matching usefulness

These test whether the delegate can convert understanding into useful screening logic.

13. "What kinds of people should I meet right now? Be specific about role, stage, industry, and operating context."
14. "What kinds of people should I avoid right now? Give concrete reasons."
15. "What are the clearest signals that someone is a strong fit for me?"
16. "What are the clearest signals that someone is a weak fit for me?"
17. "If you had to prioritize three target profiles for Lunchme, what would they be?"
18. "What kind of intro would feel most exciting or useful to me right now?"

Strong answer characteristics:
- concrete target profiles
- specific fit signals
- specific avoid signals
- usable for ranking, not just interesting to read

## Eval set D: Boundary and uncertainty checks

These test whether OpenClaw knows what it knows.

19. "Split your understanding into Known facts / Inferences / Unknowns."
20. "What part of my bio are you least certain about?"
21. "If you might be wrong about my role or company, say exactly where the uncertainty is."
22. "What would you need to know to make your matching judgment better?"
23. "Which statement about me sounds plausible but is actually not well supported by your memory?"

Strong answer characteristics:
- not defensive
- clearly marks uncertainty
- does not collapse into generic filler

## Eval set E: Stress tests against vagueness

Use these when OpenClaw starts sounding too polished and too empty.

24. "Do not use the words founder, operator, builder, strategic, thoughtful, or interesting. Describe me anyway."
25. "Answer with only concrete nouns, roles, projects, industries, goals, and day-to-day responsibilities."
26. "Remove every vague adjective from your last answer and replace it with evidence."
27. "Give me three specific examples that make your description of me credible."
28. "What in your answer would still be useful to a stranger trying to decide whether to meet me?"

## Eval set F: One-turn benchmark

If you only want one powerful test, use this:

```text
Give me:
1. A concrete 3-sentence professional bio.
2. Known facts vs inferences.
3. The three most specific types of people I should meet right now.
4. The strongest signal that someone is a bad fit.
Do not use flattering generic language.
```

## Recommended testing workflow

1. Run eval set A first.
2. If facts are weak, do not bother testing matching yet.
3. Run eval set B after facts improve.
4. Run eval set C only once bio quality is good enough.
5. Keep one transcript per test round so you can compare versions over time.

## What "good" looks like

A good delegate answer should make you think:

- "Yes, that is specifically me."
- "That is useful for matching."
- "It knows what it knows."
- "It did not pad with empty language."

A bad delegate answer should feel:

- flattering
- generic
- broad
- hard to act on
- impossible to verify
