# Language Support

## Supported File Types

Language Check extracts prose from these file formats using tree-sitter parsers:

| Format              | Language ID | Extensions                | Parser / Strategy                        |
|---------------------|-------------|---------------------------|------------------------------------------|
| Markdown            | `markdown`  | `.md`, `.markdown`        | tree-sitter-markdown                     |
| MDX                 | (alias)     | `.mdx`                    | Treated as Markdown                      |
| HTML                | `html`      | `.html`, `.htm`           | tree-sitter-html                         |
| XHTML               | (alias)     | `.xhtml`                  | Treated as HTML                          |
| LaTeX               | `latex`     | `.tex`, `.latex`, `.ltx`  | tree-sitter-latex                        |
| R Sweave            | `sweave`    | `.Rnw`, `.rnw`            | R chunk preprocessing + tree-sitter-latex|
| reStructuredText    | `rst`       | `.rst`, `.rest`           | tree-sitter-rst                          |
| Org mode            | `org`       | `.org`                    | tree-sitter-org (vendored)               |
| BibTeX              | `bibtex`    | `.bib`                    | tree-sitter-bibtex                       |
| Typst               | `typst`     | `.typ`                    | tree-sitter-typst (vendored)             |
| Forester            | `forester`  | `.tree`                   | tree-sitter-forester (vendored)          |

### Prose extraction details

Each language has a custom prose extractor that understands which parts of a document contain human-readable text:

- **Markdown / HTML** — Uses tree-sitter query patterns to select prose nodes, skipping code blocks, front matter, and inline code.
- **LaTeX** — Tree-walks the AST collecting `word` nodes from `\begin{document}` onward. Skips preamble, math environments, verbatim/minted/algorithm blocks, and structural commands (`\ref`, `\label`, `\includegraphics`, etc.). Display math (`\[...\]`) bridges into surrounding prose as an exclusion zone.
- **R Sweave** — Preprocesses R code chunks (`<<...>>=` through `@`) by blanking them with whitespace, then delegates to the LaTeX extractor.
- **reStructuredText** — Extracts `paragraph` and `title` nodes. Skips code-block, math, raw, and similar directives. Inline literals are marked as exclusion zones.
- **Org mode** — Extracts paragraph text and heading titles. Skips `#+begin_src` blocks, drawers (`:PROPERTIES:`), LaTeX environments, comments, and tables.
- **BibTeX** — Extracts prose from specific fields: `title`, `booktitle`, `abstract`, `note`, `annote`, `annotation`, `howpublished`, and `series`. Other fields (author, journal, year, etc.) are ignored. LaTeX commands inside values (e.g. `\emph{...}`) are handled via exclusion zones.
- **Typst** — Collects `text` nodes from paragraphs, headings, and list items. Skips code blocks (`` ``` ``), inline code (`` ` ``), math (`$...$` and `$ ... $`), `#code` expressions, set/show rules, let bindings, imports, includes, labels, references, URLs, escapes, and comments. Inline markup (`*bold*`, `_italic_`) is bridged through.
- **Forester** — Collects `text` and `escape` nodes, skipping math (`#{...}`, `##{...}`), verbatim fences, wiki links, comments, and structural commands (`\import`, `\ref`, `\def`, etc.). Display math bridges as an exclusion zone.

### Adding more file types

You can add support for extra file types without code in two ways:

- map new extensions onto existing built-in language IDs, or
- define regex-based Simplified Language Schema (SLS) YAML files in
  `.langcheck/schemas/`.

See the [Config-Only Language Guide](../guide-config-language.md) for both
workflows, including a full schema example.

```{tip}
To add support for an entirely new markup language with its own tree-sitter grammar, see the [Plugin Language Guide](../guide-plugin-language.md).
```

## Checking Languages

The spell-check and grammar-check language is separate from the file type. Click the language indicator in the VS Code status bar to switch:

- **EN-US** — American English
- **EN-GB** — British English
- **DE-DE** — German
- **FR** — French
- **ES** — Spanish

Language detection can also be automatic via the [whatlang](https://crates.io/crates/whatlang) crate when no explicit language is set.