multilingual Design Overview¶
This document explains how multilingual works at a design level.
It is intended for contributors, language-onboarding authors, and curious users.
Layered Model¶
The implementation is structured as four explicit layers:
- Concrete surface syntax (
CS_lang): language-specific source text. - Shared Core AST: language-agnostic parser output (
ast_nodes.py). - Typed Core IR container:
CoreIRProgram(multilingualprogramming/core/ir.py). - Python-target lowering/codegen: executable Python generation/runtime.
This makes boundary questions explicit: parsing maps CS_lang to Core AST,
and code generation consumes a typed core object rather than raw frontend text.
Core Concepts¶
Values and literals¶
The language supports:
- numerals across scripts (plus hex/octal/binary/scientific notation)
- strings (including f-strings and triple-quoted strings)
- booleans and none-like literals
- collections: list, dict, set, tuple
- date literals via dedicated delimiters
Types¶
Runtime behavior is Pythonic and dynamically typed. Optional type annotations are supported in:
- variable annotations (
x: int) - parameter annotations (
f(x: int)) - function return annotations (
-> str)
Annotations are preserved through parsing/codegen and emitted to Python output.
Control flow and structure¶
Current constructs include:
if/elif/elsewhile/formatch/casetry/except/else/finallywith(including multiple context managers)- functions, classes, decorators
- async/await,
async for,async with
Keyword Localization Model¶
Localization is concept-driven, not grammar-driven.
- Universal semantic concepts (e.g.,
COND_IF,FUNC_DEF) are stored inmultilingualprogramming/resources/usm/keywords.json. - Each concept maps to language-specific surface keywords (
if,si, etc.). - The lexer resolves concrete keywords to concepts.
- The parser operates on concepts, so grammar logic is shared across languages.
This keeps parser/codegen stable while allowing language growth mostly through data files.
Identifier Interoperability Across Languages¶
Identifiers are Unicode-aware and are not translated.
- Keywords are localized.
- User-defined names stay as written.
- Mixed scripts are allowed (for example, Latin + Devanagari in one file), though a single style per file is recommended for readability.
Interoperability rule of thumb:
- semantic keywords are normalized to concepts
- identifiers remain exact user symbols
So a French keyword file can still call a function named in English (or another script), as long as names match.
Pipeline Summary¶
The execution pipeline is:
Lexertokenizes source and resolves keyword concepts.Parserbuilds a language-agnostic AST.lower_to_core_irwraps parser output intoCoreIRProgram.SemanticAnalyzerchecks scope and structural constraints.PythonCodeGeneratoremits executable Python.- Runtime/executor runs generated code with multilingual built-in aliases.
Frontend Contract¶
Each language is treated as a frontend with a translation function:
T_lang: CS_lang -> CoreAST
Current claim is forward-only: all supported frontends are designed as semantics-preserving embeddings into the shared core. The project does not guarantee lossless round-tripping from core back to original surface form.
See also:
Roadmap (Short)¶
- v0 (today): toy-but-working interpreter/transpiler, multiple languages, core constructs, REPL, tests.
- next:
- better tooling and diagnostics
- stronger IDE/editor integration
- more languages and improved locale quality
- formalized language spec
- possible LLM-assisted translation/refactoring workflows
multilingual is intentionally both serious and experimental: stable enough to use, open enough for community feedback.