Evaluation Plan¶
This document describes how to evaluate multilingual frontend claims.
Core Questions¶
- Do distinct surface forms map to equivalent core structure?
- Do equivalent core structures execute identically?
- Do normalization rules reduce naturalness gaps without destabilizing parsing?
Test Matrix¶
- Language pairs: at least English/French + one typologically distinct language.
- Surface variants: canonical form + normalized alternate form.
- Feature slices: variable binding, conditionals, loops, functions, calls.
Required Checks¶
- Parser equivalence:
compare
ASTPrinteroutput for paired programs. - Runtime equivalence: compare final program output and success status.
- Regression checks: run existing parser/executor/surface tests.
Metrics (Lightweight)¶
- Number of equivalent frontend pairs covered by tests.
- Number of deterministic normalization rules per language.
- Parse ambiguity failures caught by tests/validators.
Current Baseline¶
Initial equivalence tests are in:
tests/frontend_equivalence_test.pytests/core_ir_test.py
Recent additions include:
- Japanese/Arabic/Spanish/Portuguese iterable-first loop surface variants
try/except/elsefrontend/runtime equivalence coverage