# Adding Language Support via the Plugin Path This guide walks through adding full AST-aware language support to lang-check. It uses **TinyLang** (the project's reference demo language) as a running example. ## Overview The plugin path gives a language first-class integration with lang-check: - **AST-aware prose extraction** -- tree-sitter parses the document, and a Rust module walks the syntax tree to collect only the nodes that contain human-written prose. - **Math exclusion zones** -- inline and display math are recognized by the grammar and either skipped entirely or replaced with spaces (preserving byte offsets so diagnostics map back correctly). - **Code block / comment skipping** -- fenced code, inline code, and comments are pruned from the AST walk so they never reach the grammar checker. - **Structural command filtering** -- commands whose arguments are identifiers or metadata (e.g. `@import`, `@ref`) are distinguished from commands whose arguments are prose (e.g. `@title`, `@note`). The end result is that lang-check only grammar-checks real prose, with accurate source positions for every diagnostic. ## Prerequisites - A working Rust toolchain (`cargo`, `cc` crate for C compilation) - [tree-sitter CLI](https://tree-sitter.github.io/tree-sitter/creating-parsers) (`npm install -g tree-sitter-cli` or `cargo install tree-sitter-cli`) - Node.js (tree-sitter grammars are authored in JavaScript) ## Step-by-step Guide Throughout this guide, replace `tinylang` / `TinyLang` / `.tiny` with your language's name and file extension. --- ### Step A: Write the tree-sitter grammar Create a directory for the grammar inside `rust-core/`: ``` rust-core/tree-sitter-tinylang/ grammar.js package.json ``` **`package.json`** -- minimal tree-sitter project metadata: ```json { "name": "tree-sitter-tinylang", "version": "0.1.0", "description": "TinyLang grammar for tree-sitter (demo language for lang-check)", "main": "bindings/node", "keywords": ["parser", "tree-sitter", "tinylang"], "tree-sitter": [ { "scope": "source.tinylang", "file-types": ["tiny"] } ] } ``` **`grammar.js`** -- the grammar itself. The key decisions are: 1. Expose `text` as a leaf node for prose content. 2. Give non-prose constructs their own node kinds (`code_block`, `inline_math`, `comment`, etc.) so the Rust extractor can skip them. 3. Use `externals` if tree-sitter's regex engine cannot handle a construct (e.g. cross-line fenced code blocks). Here is TinyLang's complete grammar: ```text /// // @ts-check module.exports = grammar({ name: "tinylang", extras: $ => [/[ \t\r\n]/], externals: $ => [$.code_block], rules: { source_file: $ => repeat($._node), _node: $ => choice( $.heading, $.command, $.display_math, $.inline_math, $.code_block, $.code_span, $.link, $.comment, $.bold, $.italic, $.text, ), heading: $ => prec.right(seq( token(prec(1, /#{1,6} /)), repeat(choice($.bold, $.italic, $.code_span, $.inline_math, $.text)), )), command: $ => prec.right(seq( $.command_name, optional($.command_arg), )), command_name: $ => /@[a-zA-Z][a-zA-Z0-9_-]*/, command_arg: $ => seq('{', repeat($._node), '}'), link: $ => seq($.link_text, $.link_url), link_text: $ => seq('[', repeat(choice($.bold, $.italic, $.text)), ']'), link_url: $ => seq('(', /[^)]*/, ')'), bold: $ => seq('*', $.text, '*'), italic: $ => seq('_', $.text, '_'), code_span: $ => /`[^`\n]*`/, inline_math: $ => /\$[^$\n]+\$/, display_math: $ => token(seq('$$', /[^$]+/, '$$')), comment: $ => /\/\/[^\n]*/, // Plain text: runs of non-special characters (lowest precedence) text: $ => token(prec(-1, /[^\\\{\}\[\]\(\)\n\t *_`$@#\/]+/)), }, }); ``` Important patterns to follow: - **`text`** must be the lowest-precedence token (`prec(-1, ...)`) so that special constructs win when there is ambiguity. - Use `prec.right(...)` for constructs that should consume as much as possible (headings, commands). - Keep the `_node` choice in priority order -- more specific constructs first. ### Step B: Write the external scanner (optional) If your language has constructs that cannot be expressed with tree-sitter's regex engine (e.g. cross-line delimited blocks), write an external scanner in C. TinyLang needs one for `~~~...~~~` code fences: **`tree-sitter-tinylang/src/scanner.c`**: ```c #include "tree_sitter/parser.h" enum TokenType { CODE_BLOCK, }; void *tree_sitter_tinylang_external_scanner_create(void) { return NULL; } void tree_sitter_tinylang_external_scanner_destroy(void *p) { (void)p; } unsigned tree_sitter_tinylang_external_scanner_serialize(void *p, char *buf) { (void)p; (void)buf; return 0; } void tree_sitter_tinylang_external_scanner_deserialize( void *p, const char *buf, unsigned len ) { (void)p; (void)buf; (void)len; } bool tree_sitter_tinylang_external_scanner_scan( void *payload, TSLexer *lexer, const bool *valid_symbols ) { (void)payload; if (!valid_symbols[CODE_BLOCK]) return false; // Skip whitespace while (lexer->lookahead == ' ' || lexer->lookahead == '\t' || lexer->lookahead == '\r' || lexer->lookahead == '\n') { lexer->advance(lexer, true); } // Match opening ~~~ if (lexer->lookahead != '~') return false; lexer->advance(lexer, false); if (lexer->lookahead != '~') return false; lexer->advance(lexer, false); if (lexer->lookahead != '~') return false; lexer->advance(lexer, false); // Consume until closing ~~~ int tilde_count = 0; while (!lexer->eof(lexer)) { if (lexer->lookahead == '~') { tilde_count++; lexer->advance(lexer, false); if (tilde_count == 3) { lexer->result_symbol = CODE_BLOCK; return true; } } else { tilde_count = 0; lexer->advance(lexer, false); } } return false; } ``` The five `tree_sitter__external_scanner_*` functions are mandatory. If your scanner is stateless (like this one), the serialize/deserialize functions can be empty. ### Step C: Generate the parser From the grammar directory, run: ```sh cd rust-core/tree-sitter-tinylang tree-sitter generate ``` This produces: - `src/parser.c` -- the generated parser - `src/grammar.json` -- serialized grammar - `src/node-types.json` -- node type metadata - `src/tree_sitter/parser.h` (and other headers) Commit all generated files. They are vendored so that building the project does not require the tree-sitter CLI. ### Step D: Create the Rust FFI binding Create `rust-core/src/tinylang_ts.rs`: ```rust use tree_sitter_language::LanguageFn; unsafe extern "C" { fn tree_sitter_tinylang() -> *const (); } pub const LANGUAGE: LanguageFn = unsafe { LanguageFn::from_raw(tree_sitter_tinylang) }; ``` The `tree_sitter_tinylang` symbol is provided by the compiled `parser.c`. The function name **must** follow the convention `tree_sitter_`, where `` matches the `name` field in `grammar.js`. Then register the module in `rust-core/src/lib.rs`: ```rust pub mod tinylang_ts; ``` ### Step E: Write the prose extractor module Create `rust-core/src/prose/tinylang.rs`. This module receives the parsed tree-sitter AST and returns a `Vec` -- the byte ranges of prose text plus any exclusion zones within those ranges. The structure has three parts: **1. Configuration constants** -- lists of node kinds and command names that control what gets skipped: ```rust use tree_sitter::Node; use super::ProseRange; /// Commands whose arguments contain identifiers/metadata, not prose. const STRUCTURAL_COMMANDS: &[&str] = &[ "@author", "@date", "@import", "@ref", "@tag", "@id", "@class", ]; /// Node kinds that are never prose and whose subtrees should be skipped. const SKIP_KINDS: &[&str] = &[ "inline_math", "display_math", "code_block", "code_span", "comment", "command_name", "link_url", ]; ``` **2. AST walk** -- a recursive function that collects `text` leaf node byte ranges, skipping non-prose subtrees: ```rust pub(crate) fn extract(text: &str, root: Node) -> Vec { let mut word_ranges: Vec<(usize, usize)> = Vec::new(); collect_prose_nodes(root, text, false, &mut word_ranges); merge_ranges(&word_ranges, text) } fn collect_prose_nodes( node: Node, text: &str, skip: bool, out: &mut Vec<(usize, usize)>, ) { let kind = node.kind(); if SKIP_KINDS.contains(&kind) { return; } if kind == "command" { if skip || is_structural_command(node, text) { return; } let mut cursor = node.walk(); for child in node.children(&mut cursor) { collect_prose_nodes(child, text, false, out); } return; } if kind == "text" { if !skip { let start = node.start_byte(); let end = node.end_byte(); if start < end { out.push((start, end)); } } return; } let mut cursor = node.walk(); for child in node.children(&mut cursor) { collect_prose_nodes(child, text, skip, out); } } ``` **3. Range merging** -- adjacent text nodes are merged into sentence-level chunks. Gaps between text nodes are analyzed: if a gap contains only whitespace and punctuation (after stripping language-specific noise like math and commands), the ranges merge. If a gap contains a paragraph break (`\n\n`), a new `ProseRange` starts. Math regions within bridgeable gaps are recorded as exclusion zones so the grammar checker sees spaces instead of math content: ```rust fn merge_ranges(words: &[(usize, usize)], text: &str) -> Vec { if words.is_empty() { return Vec::new(); } let mut ranges = Vec::new(); let mut chunk_start = words[0].0; let mut chunk_end = words[0].1; let mut exclusions: Vec<(usize, usize)> = Vec::new(); for &(start, end) in &words[1..] { let gap = &text[chunk_end..start]; if !is_bridgeable_gap(gap) { ranges.push(ProseRange { start_byte: chunk_start, end_byte: chunk_end, exclusions: std::mem::take(&mut exclusions), }); chunk_start = start; } else { collect_math_exclusions(gap, chunk_end, &mut exclusions); } chunk_end = end; } ranges.push(ProseRange { start_byte: chunk_start, end_byte: chunk_end, exclusions, }); ranges } ``` The `is_bridgeable_gap` and `collect_math_exclusions` helper functions are language-specific. See `rust-core/src/prose/tinylang.rs` for the full implementation, including `strip_tinylang_noise` which removes math, commands, code spans, bold/italic markers, and comments from a gap before testing whether it is bridgeable. ### Step F: Wire into the dispatch (`prose/mod.rs`) Register the new module and add a match arm in `ProseExtractor::extract`: ```rust // At the top of rust-core/src/prose/mod.rs: mod tinylang; // In the extract() method: pub fn extract(&mut self, text: &str, lang_id: &str) -> Result> { let tree = self.parser.parse(text, None) .ok_or_else(|| anyhow!("Failed to parse text"))?; let root = tree.root_node(); match lang_id { "latex" => Ok(latex::extract(text, root)), "forester" => Ok(forester::extract(text, root)), "tinylang" => Ok(tinylang::extract(text, root)), lang => query::extract(text, root, &self.language, lang), } } ``` Languages that do not have a dedicated extractor module fall through to the generic `query`-based extractor (the `lang` catch-all arm). The plugin path exists for when you need more control than the query path provides. ### Step G: Add to the language registry (`languages.rs`) Three things to update: **1. File extension mapping** -- add entries to `BUILTIN_EXTENSIONS`: ```rust const BUILTIN_EXTENSIONS: &[(&str, &str)] = &[ // ... existing entries ... ("tiny", "tinylang"), ]; ``` **2. Supported language IDs** -- add to `SUPPORTED_LANGUAGE_IDS`: ```rust pub const SUPPORTED_LANGUAGE_IDS: &[&str] = &[ "markdown", "html", "latex", "forester", "tinylang" ]; ``` **3. Language ID aliases** (optional) -- if VS Code or other editors use a different name for your language, add an entry to `LANGUAGE_ID_ALIASES`: ```rust const LANGUAGE_ID_ALIASES: &[(&str, &str)] = &[ ("mdx", "markdown"), ("xhtml", "html"), // ("mytinylang", "tinylang"), // if needed ]; ``` ### Step H: Update `build.rs` Add a `cc::Build` block to compile the vendored tree-sitter parser: ```rust // Compile vendored tree-sitter-tinylang parser let dir = std::path::Path::new("tree-sitter-tinylang/src"); cc::Build::new() .include(dir) .file(dir.join("parser.c")) .file(dir.join("scanner.c")) // omit if no external scanner .warnings(false) .compile("tree_sitter_tinylang"); ``` The library name passed to `.compile()` must match what the linker expects for the `tree_sitter_tinylang` extern symbol declared in the FFI binding. ### Step I: Update CLI and server binaries Both `language-check` (CLI) and `language-check-server` contain a `resolve_ts_language` function that maps language IDs to tree-sitter `Language` values. Add an arm for your language in each: **`rust-core/src/bin/language-check.rs`**: ```rust fn resolve_ts_language(lang: &str) -> tree_sitter::Language { match lang { "html" => tree_sitter_html::LANGUAGE.into(), "latex" => codebook_tree_sitter_latex::LANGUAGE.into(), "forester" => rust_core::forester_ts::LANGUAGE.into(), "tinylang" => rust_core::tinylang_ts::LANGUAGE.into(), _ => tree_sitter_md::LANGUAGE.into(), } } ``` **`rust-core/src/bin/language-check-server.rs`** -- identical match arm. ### Step J: Update the VS Code extension Two files need changes: **1. `extension/package.json`** -- add an activation event so the extension activates when a file of your language is opened: ```json "activationEvents": [ "onLanguage:markdown", "onLanguage:html", "onLanguage:latex", "onLanguage:forester", "onLanguage:tinylang" ] ``` **2. `extension/src/extension.ts`** -- add your language ID to the `supportedLanguages` array: ```typescript const supportedLanguages = [ 'markdown', 'html', 'latex', 'forester', 'tinylang', 'mdx', 'xhtml' ]; ``` This array controls which VS Code language IDs trigger the on-change diagnostic handler. The server-side `resolve_language_id` handles any alias resolution. --- ## Testing Strategy ### Unit tests (prose extraction) Add tests directly in `rust-core/src/prose/mod.rs` under the existing `#[cfg(test)] mod tests` block. Each test creates a `ProseExtractor`, feeds it a sample document, and asserts on the extracted prose ranges. Typical test cases: ```rust #[test] fn test_tinylang_basic_extraction() -> Result<()> { let language: tree_sitter::Language = crate::tinylang_ts::LANGUAGE.into(); let mut extractor = ProseExtractor::new(language)?; let text = "This is a simple sentence.\n"; let ranges = extractor.extract(text, "tinylang")?; assert!(!ranges.is_empty(), "Should extract prose from plain text"); let prose = ranges[0].extract_text(text); assert!(prose.contains("simple sentence")); Ok(()) } #[test] fn test_tinylang_code_excluded() -> Result<()> { let language: tree_sitter::Language = crate::tinylang_ts::LANGUAGE.into(); let mut extractor = ProseExtractor::new(language)?; let text = "Before code.\n\n~~~\nfn main() {}\n~~~\n\nAfter code.\n"; let ranges = extractor.extract(text, "tinylang")?; let all_prose: String = ranges.iter().map(|r| r.extract_text(text)).collect(); assert!(!all_prose.contains("fn main")); assert!(all_prose.contains("Before code")); Ok(()) } #[test] fn test_tinylang_structural_commands_excluded() -> Result<()> { let language: tree_sitter::Language = crate::tinylang_ts::LANGUAGE.into(); let mut extractor = ProseExtractor::new(language)?; let text = "@author{Jane Doe}\n@date{2025-01-01}\n\nSome prose text here.\n"; let ranges = extractor.extract(text, "tinylang")?; let all_prose: String = ranges.iter().map(|r| r.extract_text(text)).collect(); assert!(!all_prose.contains("Jane Doe")); assert!(all_prose.contains("prose text here")); Ok(()) } ``` Cover at least: - Plain prose extraction - Code block exclusion - Code span exclusion - Comment exclusion - Inline math exclusion - Display math exclusion zones (verify `extract_text` blanks the math) - Structural vs. prose command distinction - Sentence bridging across inline math and formatting commands - Paragraph splitting on `\n\n` ### End-to-end CLI test Create a sample `.tiny` file and run the CLI: ```sh cargo run --bin language-check -- check sample.tiny --lang tinylang ``` Verify that: - Diagnostics appear for intentional typos in prose - No diagnostics appear for code blocks, comments, or math - Line/column positions are correct ### Language registry tests Tests for `detect_language` and friends already exist in `rust-core/src/languages.rs`. Add a case for your new extension: ```rust #[test] fn detect_builtin_tinylang() { let config = default_config(); assert_eq!(detect_language(Path::new("doc.tiny"), &config), "tinylang"); } ``` --- ## Files Checklist When adding a new language via the plugin path, you will touch (or create) these files: | File | Action | |------|--------| | `rust-core/tree-sitter-/grammar.js` | Create -- tree-sitter grammar | | `rust-core/tree-sitter-/package.json` | Create -- tree-sitter project metadata | | `rust-core/tree-sitter-/src/scanner.c` | Create (if needed) -- external scanner | | `rust-core/tree-sitter-/src/parser.c` | Generated -- `tree-sitter generate` | | `rust-core/tree-sitter-/src/*.json` | Generated -- grammar/node-types metadata | | `rust-core/tree-sitter-/src/tree_sitter/*.h` | Generated -- tree-sitter headers | | `rust-core/src/_ts.rs` | Create -- FFI binding | | `rust-core/src/lib.rs` | Edit -- add `pub mod _ts;` | | `rust-core/src/prose/.rs` | Create -- prose extractor | | `rust-core/src/prose/mod.rs` | Edit -- add `mod ;` and match arm | | `rust-core/src/languages.rs` | Edit -- extension mapping + supported IDs | | `rust-core/build.rs` | Edit -- add `cc::Build` for the parser | | `rust-core/src/bin/language-check.rs` | Edit -- add `resolve_ts_language` arm | | `rust-core/src/bin/language-check-server.rs` | Edit -- add `resolve_ts_language` arm | | `extension/package.json` | Edit -- add `onLanguage:` activation event | | `extension/src/extension.ts` | Edit -- add to `supportedLanguages` array |