Adding a New Language (Data-Driven)¶
This project is designed so new programming languages can be added mainly by updating data files, not parser/codegen logic.
Language onboarding follows a controlled-language policy: add deterministic, testable surface forms only. See cnl_scope.md.
Goal¶
Enable a new language code (for example xx) across:
- lexing and parsing (keyword recognition)
- semantic analysis error reporting
- runtime builtins and execution
- REPL command/help localization
1. Add Keyword Mappings¶
File: multilingualprogramming/resources/usm/keywords.json
- Add the new code to
languages. - For every concept in every category, add a translation key for the new language.
Important:
- All concepts must have a translation to keep validation complete.
- Prefer unique tokens per language to avoid ambiguity.
- Keep tokens identifier-safe (letters/underscores, no spaces).
Why this is enough:
KeywordRegistryloads this file dynamically.Lexerdetects/recognizes keywords throughKeywordRegistry.Parserconsumes concept tokens, so syntax support follows automatically.RuntimeBuiltinsmaps builtins from concept IDs, so execution picks up new language keywords automatically.
2. Add Parser/Semantic Error Messages¶
File: multilingualprogramming/resources/parser/error_messages.json
For each message key under messages, add the new language translation (same placeholders, e.g. {token}, {line}).
Why:
ErrorMessageRegistry.format()reads this file dynamically and parser/semantic analyzer use it for diagnostics.
3. Add REPL Localization¶
File: multilingualprogramming/resources/repl/commands.json
Update:
help_titlesfor the language.messageskeys (keywords_title,symbols_title,unsupported_language).commands.<name>.aliasesfor command words.commands.<name>.descriptionsfor help text.
Why:
- REPL command parsing/help is fully catalog-driven from this JSON.
4. (Optional) Add Operator Description Localization¶
File: multilingualprogramming/resources/usm/operators.json
Add the new language under description where available.
Why:
- REPL
:symbolsuses these descriptions when present; otherwise it falls back to English.
5. Add Built-in Aliases (Optional)¶
File: multilingualprogramming/resources/usm/builtins_aliases.json
Add localized aliases for selected universal builtins (for example range, len, sum).
The universal English built-in name remains available; aliases are additive.
Why:
RuntimeBuiltinsloads this file dynamically.- Users can write either universal names or localized aliases in programs/REPL.
6. Add Surface Syntax Patterns (Optional)¶
File: multilingualprogramming/resources/usm/surface_patterns.json
Use this file when keyword translation alone is not enough for natural phrasing. Rules are declarative and normalize alternate surface token order into canonical concept order before parser grammar runs.
Validation is enforced at load time by validate_surface_patterns_config
(multilingualprogramming/parser/surface_normalizer.py), including:
- rule/template schema shape
- language support checks
- exactly one of
normalize_to/normalize_template - slot-reference consistency between
patterncaptures and output rewrite
Typical use:
- iterable-first
forheaders - language-specific particles around loop/condition clauses
- alternate phrase forms that still map to one core AST
Keep rules narrow and test-backed. Prefer additive normalization over parser forks.
How This Connects To The Pipeline¶
Lexertokenizes source and resolves known keywords to concepts.SurfaceNormalizermatches token-level surface rules.- Matched rules rewrite tokens into canonical concept order.
Parserconsumes rewritten tokens with existing grammar.
Important: surface patterns do not replace lexing. They operate on lexer output.
Rule Model¶
surface_patterns.json has two top-level sections:
templates: reusable canonical rewritespatterns: language-scoped matching rules
Each pattern must include:
namelanguagepattern(what to match)- exactly one of:
normalize_template(reference a template)normalize_to(inline rewrite)
Pattern Token Kinds¶
Allowed pattern kinds:
expr: capture an expression span into a slot (for exampleiterable)identifier: capture one identifier token into a slot (for exampletarget)keyword: require a specific concept token (for exampleLOOP_FOR)delimiter: require a delimiter token (for example:)literal: require a literal token value (for particles like内の,ضمن)
Allowed output (normalize_to/template) kinds:
keyword: emit a concept keyword token in the target languagedelimiter: emit a delimiter tokenidentifier_slot: emit captured identifier slotexpr_slot: emit captured expression slot
Example A: Shared Template (Recommended)¶
Use a template when multiple languages share one canonical rewrite target.
{
"templates": {
"for_iterable_first": [
{ "kind": "keyword", "concept": "LOOP_FOR" },
{ "kind": "identifier_slot", "slot": "target" },
{ "kind": "keyword", "concept": "IN" },
{ "kind": "expr_slot", "slot": "iterable" },
{ "kind": "delimiter", "value": ":" }
]
},
"patterns": [
{
"name": "ja_for_iterable_first",
"language": "ja",
"normalize_template": "for_iterable_first",
"pattern": [
{ "kind": "expr", "slot": "iterable" },
{ "kind": "literal", "value": "内の" },
{ "kind": "literal", "value": "各" },
{ "kind": "identifier", "slot": "target" },
{ "kind": "literal", "value": "に対して" },
{ "kind": "delimiter", "value": ":" }
]
}
]
}
Surface input:
Normalized parse shape:
Example B: Inline Rewrite (normalize_to)¶
Use inline output for a one-off rule that is not worth templating.
{
"name": "xx_for_custom",
"language": "xx",
"pattern": [
{ "kind": "expr", "slot": "iterable" },
{ "kind": "literal", "value": "particleA" },
{ "kind": "identifier", "slot": "target" },
{ "kind": "literal", "value": "particleB" },
{ "kind": "delimiter", "value": ":" }
],
"normalize_to": [
{ "kind": "keyword", "concept": "LOOP_FOR" },
{ "kind": "identifier_slot", "slot": "target" },
{ "kind": "keyword", "concept": "IN" },
{ "kind": "expr_slot", "slot": "iterable" },
{ "kind": "delimiter", "value": ":" }
]
}
Authoring Workflow For Complex Grammar¶
- Write 2-3 real source examples from native speakers.
- Tokenize with lexer tests to confirm surface particles are tokenized as expected.
- Add the narrowest possible
patternthat matches those forms. - Rewrite to one canonical concept order via template or inline output.
- Add parser + executor tests before adding more variants.
- Repeat with additional rules rather than broad/fragile mega-rules.
Common Mistakes¶
- Capturing a slot in output that was never captured in
pattern. - Defining both
normalize_toandnormalize_templatein one rule. - Using unsupported language code in
language. - Overly broad
exprpatterns that unintentionally match unrelated lines. - Trying to encode full natural-language grammar in one rule.
Debugging Tips¶
If a surface form does not parse:
- Confirm lexer tokenization first (
tests/lexer_test.pypatterns are good references). - Add a parser unit test for just the failing statement.
- Check that slot names are consistent (
targetvsiterator, etc.). - Confirm template name exists and is spelled exactly.
- Ensure the final normalized sequence is compatible with existing parser grammar.
7. Add Tests¶
Minimum recommended tests:
tests/keyword_registry_test.py- language appears in
get_supported_languages() - concept lookups for representative keywords
tests/executor_test.py- one end-to-end program using new language keywords (
ProgramExecutor) tests/error_messages_test.py- new language included in "all messages have all languages" coverage
tests/runtime_builtins_test.py- localized aliases map to the expected Python built-ins
tests/surface_normalizer_test.py(when adding surface rules)- config stays schema-valid
- invalid rule shapes fail with
ValueError tests/parser_test.py+tests/executor_test.py(when adding surface rules)- parser accepts new surface form
- end-to-end execution still works
This validates lexer -> parser -> semantic -> codegen/runtime in one path.
8. Update Documentation¶
At minimum:
README.mdsupported languages listdocs/reference.mdsupported languages list- link this onboarding guide where relevant
Validation Commands¶
For focused checks while iterating:
python -m pytest -q tests/keyword_registry_test.py tests/error_messages_test.py tests/executor_test.py tests/repl_test.py
Surface-pattern focused checks:
Language-pack smoke checks:
Starter Checklist Template¶
Use this template when opening a PR for a new language pack:
docs/templates/language_pack_checklist.md