TextMateLib (tml)
Modern C++ implementation of the TextMate syntax highlighting engine for cross-platform syntax highlighting with high performance and minimal dependencies.
Overview
TextMateLib provides grammar-based syntax highlighting using TextMate-format grammars and themes. It tokenizes source code line-by-line, returning tokens with scope information that can be styled using a theme.
Key Features:
- ✅ Grammar-based tokenization - TextMate format grammars for any language
- ✅ Stateful incremental parsing - Efficient line-by-line processing with caching
- ✅ Theme support - Apply colors and styles to tokens via scopes
- ✅ Multiple language support - 30+ built-in language grammars
- ✅ High performance - Optimized for editor integration
- ✅ Zero dependencies - Regex (Oniguruma) and JSON (RapidJSON) bundled
- ✅ Cross-platform - Linux, macOS, Windows, WebAssembly
- ✅ Language bindings - C, C#/.NET, JavaScript/WASM, Python (planned)
Quick Start
Basic C++ Usage
using namespace tml;
auto onigLib = std::make_shared<DefaultOnigLib>();
auto registry = std::make_shared<Registry>(onigLib);
registry->addGrammarFromFile("path/to/javascript.json");
auto grammar = registry->loadGrammar("source.javascript");
auto theme = Theme::createFromPath("path/to/dark-plus.json");
auto prevState = StateStack::INITIAL;
auto result = grammar->tokenizeLine("const x = 42;", prevState);
for (const auto& token : result->tokens) {
auto scopePath = token->scopePath();
uint32_t color = theme->match(scopePath);
printf("Token '%s' -> color: #%08x\n", scopePath.c_str(), color);
}
Comprehensive public API header for TextMateLib.
Basic C API Usage
for (int32_t i = 0; i < result->
tokenCount; i++) {
char* scope = token.
scopes[0];
}
C language API for TextMateLib.
void * TextMateGrammar
Handle to a grammar definition for a specific language.
void * TextMateTheme
Handle to a theme object containing color schemes.
void * TextMateRegistry
Handle to the grammar registry managing loaded grammars and themes.
void * TextMateStateStack
Handle to a parsing state stack (immutable, used for incremental tokenization)
void * TextMateOnigLib
Handle to the Oniguruma regex library instance.
TextMateOnigLib textmate_oniglib_create()
Initialize the Oniguruma regular expression library.
void textmate_registry_dispose(TextMateRegistry registry)
Free a registry and all its resources.
int textmate_registry_add_grammar_from_file(TextMateRegistry registry, const char *grammarPath)
Register a grammar from a JSON file.
TextMateRegistry textmate_registry_create(TextMateOnigLib onigLib)
Create a new grammar registry.
TextMateGrammar textmate_registry_load_grammar(TextMateRegistry registry, const char *scopeName)
Load a grammar by scope name.
void textmate_theme_dispose(TextMateTheme theme)
Free a theme object and release resources.
uint32_t textmate_theme_get_foreground(TextMateTheme theme, const char *scopePath, uint32_t defaultColor)
Get the foreground color for a scope path.
TextMateTheme textmate_theme_load_from_file(const char *themePath)
Load a theme from a JSON file.
void textmate_free_tokenize_result(TextMateTokenizeResult *result)
Free a line tokenization result.
void textmate_oniglib_dispose(TextMateOnigLib onigLib)
Free the Oniguruma library.
TextMateStateStack textmate_get_initial_state()
Get the initial parsing state.
TextMateTokenizeResult * textmate_tokenize_line(TextMateGrammar grammar, const char *lineText, TextMateStateStack prevState)
Tokenize a single line of text with decoded scopes.
Represents a single token in tokenized text.
char ** scopes
Array of scope strings (e.g., "keyword.control", "string.quoted.double")
Result from tokenizing a single line with decoded tokens.
TextMateToken * tokens
Array of tokens found in this line.
int32_t tokenCount
Number of tokens in the array.
Core Concepts
Grammar and Tokenization
A grammar defines syntax rules for a language. TextMateLib tokenizes text by:
- Matching text against grammar patterns
- Entering/exiting rules as patterns match/end
- Outputting tokens with scope hierarchies
Tokens represent text ranges with scope information:
- Scope:
source.js keyword.control (space-separated hierarchy)
- Range: character positions in the line
StateStack - Incremental Parsing
The StateStack represents the parsing state at the end of a line:
- Encodes which grammar rules are active
- Enables resuming on the next line
- Two StateStacks that
equals() mean the same parsing position
This enables incremental tokenization: if a line's initial state hasn't changed, its tokens might not have either (early stopping optimization).
Themes - Styling Tokens
A theme maps scopes to colors and font styles:
- Scope path matching:
keyword.control matches source.js keyword.control
- Returns: foreground color, background color, font style (italic/bold/underline)
- Fallback: default theme colors when no match found
API Overview
C++ API (<tt>tml.h</tt>)
High-level, type-safe C++ interface:
Example - Using Session API (recommended for editors):
using namespace tml;
auto session = SessionImpl::create(grammar, 100);
std::vector<std::string> lines = {"const x = 1;", "const y = 2;", "..."};
session->setLines(lines);
auto tokens = session->getLine(0)->tokens;
session->edit(1, 2, {"const a = 1;", "const b = 2;"});
C API (<tt>c_api.h</tt>)
FFI interface for language bindings (C#, Python, Node.js, etc.):
Key differences from C++:
- Opaque handles instead of C++ objects
- Explicit memory management (malloc/free)
- Stateless tokenization (manage state yourself)
- Batch operations for reducing FFI overhead
Common Workflows
1. Basic Syntax Highlighting
auto onigLib = std::make_shared<DefaultOnigLib>();
auto registry = std::make_shared<Registry>(onigLib);
registry->addGrammarFromFile("javascript.json");
auto grammar = registry->loadGrammar("source.javascript");
auto theme = Theme::createFromPath("dark-plus.json");
auto state = StateStack::INITIAL;
auto result = grammar->tokenizeLine("const x = 42;", state);
2. Multi-Line Document (Session API)
auto session = SessionImpl::create(grammar, lines.size());
session->setLines(lines);
session->setTheme(theme);
auto highlightedLine = session->getLine(0);
for (const auto& token : highlightedLine->tokens) {
}
session->edit(5, 7, {"new line 1", "new line 2"});
3. Stateless Tokenization (C API or Low-Level)
for (int i = 0; i < 10; i++) {
}
TextMateStateStack ruleStack
State at end of line (pass to next line's tokenization)
4. Batch Tokenization (Reducing FFI Overhead)
);
for (int i = 0; i < 10; i++) {
}
TextMateTokenizeMultiLinesResult * textmate_tokenize_lines(TextMateGrammar grammar, const char **lines, int32_t lineCount, TextMateStateStack initialState)
Tokenize multiple lines in a single call.
void textmate_free_tokenize_lines_result(TextMateTokenizeMultiLinesResult *result)
Free a batch tokenization result.
Result from batch tokenizing multiple lines.
Performance Tips
1. Use Session API for Editors
The Session API (C++) handles incremental tokenization automatically:
- Caches line tokens
- Detects when state hasn't changed (early stopping)
- Only re-parses affected lines
Impact: Editing a line in a 10,000-line file only re-tokenizes ~10 lines.
2. Reuse Grammar and Theme Objects
Create grammar and theme once, reuse for all tokenization:
auto grammar = registry->loadGrammar("source.javascript");
auto theme = Theme::createFromPath("dark-plus.json");
for (auto& document : documents) {
auto session = SessionImpl::create(grammar, document->lines.size());
}
for (auto& line : lines) {
auto grammar = registry->loadGrammar("source.javascript");
}
3. Use Encoded Tokens (C API)
For performance-critical code, use textmate_tokenize_line2() instead of textmate_tokenize_line():
- Returns compact 32-bit encoded tokens instead of scope arrays
- Reduces memory overhead
- Faster for large files
4. Batch Operations (C API)
Use textmate_tokenize_lines() to tokenize multiple lines in one FFI call:
- Reduces FFI call overhead (important for .NET, Python bindings)
- Better for batch processing
5. Early Stopping
Session API automatically limits tokenization depth:
- Prevents pathological regex cases (catastrophic backtracking)
- Can be tuned via SessionImpl configuration
Architecture
Component Hierarchy
┌─────────────────────────────────────────┐
│ Public APIs (C++, C, WASM) │
├─────────────────────────────────────────┤
│ Session API │ Syntax Highlighter │
│ (Stateful) │ (Convenience) │
├─────────────────────────────────────────┤
│ Grammar (Tokenization Logic) │
│ - Rule matching, state transitions │
│ - Scope stack management │
├─────────────────────────────────────────┤
│ Registry (Grammar/Theme Management) │
│ - Grammar lookup and caching │
│ - Theme application │
├─────────────────────────────────────────┤
│ Core Components │
│ - Regex engine (Oniguruma) │
│ - JSON parser (RapidJSON) │
│ - Scope/attribute providers │
└─────────────────────────────────────────┘
Key Data Structures
| Type | Purpose |
Grammar | Compiled grammar with rule matching logic |
StateStack | Immutable parsing state for incremental resumption |
Theme | Scope-to-color-and-style mapping |
IToken | Single token with scope hierarchy |
Registry | Central manager for grammars and themes |
Session | Stateful document with line caching |
API Reference
Core Modules
Grammar & Tokenization:
Styling:
C++ Convenience:
C Bindings:
Platform Support
| Platform | Status | API |
| Linux | ✅ | C++, C |
| macOS | ✅ | C++, C |
| Windows | ✅ | C++, C |
| WebAssembly | ✅ | C++, C, JavaScript |
| .NET (C#) | ✅ | C (via P/Invoke) |
| Node.js | ✅ | WASM |
Building and Installation
Native Build
./scripts/build.sh
# Output: build/lib/libtm.so, build/include/tml/
WebAssembly Build
./scripts/build-wasm-standard.sh
# Output: build/wasm-standard/browser/tml-standard.js + .wasm
C# / .NET Bindings
./scripts/build-shared.sh
cd tests/csharp/TextMateLib.Tests
dotnet test
See CLAUDE.md in the project root for detailed build instructions.
Examples
Load Multiple Grammars
auto registry = std::make_shared<Registry>(onigLib);
registry->addGrammarFromFile("javascript.json");
registry->addGrammarFromFile("python.json");
registry->addGrammarFromFile("markdown.json");
auto jsGrammar = registry->loadGrammar("source.javascript");
auto pyGrammar = registry->loadGrammar("source.python");
auto mdGrammar = registry->loadGrammar("text.markdown");
Grammar Injections (Embedded Grammars)
const char* injections[] = {"source.regexp"};
registry->setInjections("source.js string.quoted", injections, 1);
auto grammar = registry->loadGrammar("source.javascript");
auto result = grammar->tokenizeLine("const pattern = /[a-z]+/;", state);
Edit a Document
auto session = SessionImpl::create(grammar, 100);
session->setLines(original_lines);
std::vector<std::string> new_lines = {"edited line 1", "edited line 2"};
session->edit(10, 12, new_lines);
auto tokens = session->getLine(10)->tokens;
Troubleshooting
Tokenization Not Working
- Grammar not found: Ensure grammar was registered before loading
registry->addGrammarFromFile("grammar.json");
auto grammar = registry->loadGrammar("source.mylang");
- Empty tokens: Check grammar has valid rules and patterns
- State issues: Always pass the previous line's
ruleStack as prevState auto result1 = grammar->tokenizeLine(line1, INITIAL);
auto result2 = grammar->tokenizeLine(line2, result1->ruleStack);
Theme Colors Not Applying
- Scope not matching: Theme only returns colors for matched scopes
- Use
theme->match(scopePath) to test scope matching
- TextMate scope matching uses prefix rules
- Default color: If scope not found, theme returns
defaultColor
uint32_t color = theme->match("my.custom.scope", 0xFFFFFFFF);
Performance Issues
- Slow tokenization: Use Session API instead of manual state management
- High memory: Reduce
SessionImpl line cache size or use stateless tokenization
- FFI overhead: Use batch operations (
textmate_tokenize_lines) instead of per-line calls
Contributing
TextMateLib is open source! Contributions welcome:
- Report issues: GitHub Issues
- Submit PRs: Bug fixes, optimizations, language bindings
- Add grammars: New language support via grammar files
See project README for contribution guidelines.
License
TextMateLib is distributed under the MIT License.
Further Reading
- Architecture & Design: See
CLAUDE.md in the project root
- API Reference: Browse the detailed API documentation
- Grammars: TextMate grammar format specification
- Themes: TextMate theme format documentation
- Examples: See
examples/ directory in the project