TextMateLib 1.0
Modern C++ implementation of the TextMate syntax highlighting engine
Loading...
Searching...
No Matches
TextMateLib - Modern C++ Syntax Highlighting Engine

TextMateLib (tml)

Modern C++ implementation of the TextMate syntax highlighting engine for cross-platform syntax highlighting with high performance and minimal dependencies.

Overview

TextMateLib provides grammar-based syntax highlighting using TextMate-format grammars and themes. It tokenizes source code line-by-line, returning tokens with scope information that can be styled using a theme.

Key Features:

  • Grammar-based tokenization - TextMate format grammars for any language
  • Stateful incremental parsing - Efficient line-by-line processing with caching
  • Theme support - Apply colors and styles to tokens via scopes
  • Multiple language support - 30+ built-in language grammars
  • High performance - Optimized for editor integration
  • Zero dependencies - Regex (Oniguruma) and JSON (RapidJSON) bundled
  • Cross-platform - Linux, macOS, Windows, WebAssembly
  • Language bindings - C, C#/.NET, JavaScript/WASM, Python (planned)

Quick Start

Basic C++ Usage

#include "tml.h"
using namespace tml;
// 1. Create regex engine and registry
auto onigLib = std::make_shared<DefaultOnigLib>();
auto registry = std::make_shared<Registry>(onigLib);
// 2. Load a grammar
registry->addGrammarFromFile("path/to/javascript.json");
auto grammar = registry->loadGrammar("source.javascript");
// 3. Load a theme
auto theme = Theme::createFromPath("path/to/dark-plus.json");
// 4. Tokenize a line
auto prevState = StateStack::INITIAL;
auto result = grammar->tokenizeLine("const x = 42;", prevState);
// 5. Get colors for tokens
for (const auto& token : result->tokens) {
auto scopePath = token->scopePath();
uint32_t color = theme->match(scopePath);
printf("Token '%s' -> color: #%08x\n", scopePath.c_str(), color);
}
Comprehensive public API header for TextMateLib.

Basic C API Usage

#include "c_api.h"
// 1. Create registry
// 2. Load grammar and theme
textmate_registry_add_grammar_from_file(registry, "javascript.json");
TextMateGrammar grammar = textmate_registry_load_grammar(registry, "source.javascript");
TextMateTheme theme = textmate_theme_load_from_file("dark-plus.json");
// 3. Tokenize a line
TextMateTokenizeResult* result = textmate_tokenize_line(grammar, "const x = 42;", state);
// 4. Get colors for tokens
for (int32_t i = 0; i < result->tokenCount; i++) {
TextMateToken token = result->tokens[i];
char* scope = token.scopes[0];
uint32_t color = textmate_theme_get_foreground(theme, scope, 0xFFFFFFFF);
}
// 5. Cleanup
C language API for TextMateLib.
void * TextMateGrammar
Handle to a grammar definition for a specific language.
Definition c_api.h:38
void * TextMateTheme
Handle to a theme object containing color schemes.
Definition c_api.h:35
void * TextMateRegistry
Handle to the grammar registry managing loaded grammars and themes.
Definition c_api.h:47
void * TextMateStateStack
Handle to a parsing state stack (immutable, used for incremental tokenization)
Definition c_api.h:41
void * TextMateOnigLib
Handle to the Oniguruma regex library instance.
Definition c_api.h:44
TextMateOnigLib textmate_oniglib_create()
Initialize the Oniguruma regular expression library.
Definition c_api.cpp:445
void textmate_registry_dispose(TextMateRegistry registry)
Free a registry and all its resources.
Definition c_api.cpp:505
int textmate_registry_add_grammar_from_file(TextMateRegistry registry, const char *grammarPath)
Register a grammar from a JSON file.
Definition c_api.cpp:513
TextMateRegistry textmate_registry_create(TextMateOnigLib onigLib)
Create a new grammar registry.
Definition c_api.cpp:495
TextMateGrammar textmate_registry_load_grammar(TextMateRegistry registry, const char *scopeName)
Load a grammar by scope name.
Definition c_api.cpp:592
void textmate_theme_dispose(TextMateTheme theme)
Free a theme object and release resources.
Definition c_api.cpp:437
uint32_t textmate_theme_get_foreground(TextMateTheme theme, const char *scopePath, uint32_t defaultColor)
Get the foreground color for a scope path.
Definition c_api.cpp:215
TextMateTheme textmate_theme_load_from_file(const char *themePath)
Load a theme from a JSON file.
Definition c_api.cpp:162
void textmate_free_tokenize_result(TextMateTokenizeResult *result)
Free a line tokenization result.
Definition c_api.cpp:692
void textmate_oniglib_dispose(TextMateOnigLib onigLib)
Free the Oniguruma library.
Definition c_api.cpp:963
TextMateStateStack textmate_get_initial_state()
Get the initial parsing state.
Definition c_api.cpp:610
TextMateTokenizeResult * textmate_tokenize_line(TextMateGrammar grammar, const char *lineText, TextMateStateStack prevState)
Tokenize a single line of text with decoded scopes.
Definition c_api.cpp:615
Represents a single token in tokenized text.
Definition c_api.h:58
char ** scopes
Array of scope strings (e.g., "keyword.control", "string.quoted.double")
Definition c_api.h:62
Result from tokenizing a single line with decoded tokens.
Definition c_api.h:72
TextMateToken * tokens
Array of tokens found in this line.
Definition c_api.h:73
int32_t tokenCount
Number of tokens in the array.
Definition c_api.h:74

Core Concepts

Grammar and Tokenization

A grammar defines syntax rules for a language. TextMateLib tokenizes text by:

  1. Matching text against grammar patterns
  2. Entering/exiting rules as patterns match/end
  3. Outputting tokens with scope hierarchies

Tokens represent text ranges with scope information:

  • Scope: source.js keyword.control (space-separated hierarchy)
  • Range: character positions in the line

StateStack - Incremental Parsing

The StateStack represents the parsing state at the end of a line:

  • Encodes which grammar rules are active
  • Enables resuming on the next line
  • Two StateStacks that equals() mean the same parsing position

This enables incremental tokenization: if a line's initial state hasn't changed, its tokens might not have either (early stopping optimization).

Themes - Styling Tokens

A theme maps scopes to colors and font styles:

  • Scope path matching: keyword.control matches source.js keyword.control
  • Returns: foreground color, background color, font style (italic/bold/underline)
  • Fallback: default theme colors when no match found

API Overview

C++ API (<tt>tml.h</tt>)

High-level, type-safe C++ interface:

Component Purpose
Core Types and Interfaces Grammar, Theme, Registry, StateStack, Token types
Grammar Processing Parse and compile grammar definitions
Tokenization and Tokens Core tokenization logic and token representation
Session API Stateful incremental API with line caching (recommended)
Theme API Load themes and query colors for scopes
Syntax Highlighting Convenience API Convenience wrapper combining grammar + theme
Constants and Initialization Global constants (INITIAL state)

Example - Using Session API (recommended for editors):

#include "tml.h"
using namespace tml;
// Create session for a document
auto session = SessionImpl::create(grammar, /*line count*/ 100);
// Set initial lines
std::vector<std::string> lines = {"const x = 1;", "const y = 2;", "..."};
session->setLines(lines);
// Query tokens for a line
auto tokens = session->getLine(0)->tokens;
// Edit: replace line 1-2 with new content
session->edit(1, 2, {"const a = 1;", "const b = 2;"});
// Incremental tokenization happens automatically
// Only affected lines are re-parsed (early stopping optimization)

C API (<tt>c_api.h</tt>)

FFI interface for language bindings (C#, Python, Node.js, etc.):

Group Functions
Opaque Handle Types Handle types (TextMateRegistry, TextMateGrammar, etc.)
Token and Result Structures Token and result structures (marshalling-friendly)
Theme API Load and query theme colors
Registry and Grammar API Create registries, register/load grammars
Tokenization API Tokenize single/multiple lines, manage state

Key differences from C++:

  • Opaque handles instead of C++ objects
  • Explicit memory management (malloc/free)
  • Stateless tokenization (manage state yourself)
  • Batch operations for reducing FFI overhead

Common Workflows

1. Basic Syntax Highlighting

// Setup (once)
auto onigLib = std::make_shared<DefaultOnigLib>();
auto registry = std::make_shared<Registry>(onigLib);
registry->addGrammarFromFile("javascript.json");
auto grammar = registry->loadGrammar("source.javascript");
auto theme = Theme::createFromPath("dark-plus.json");
// Highlight a single line
auto state = StateStack::INITIAL;
auto result = grammar->tokenizeLine("const x = 42;", state);
// Use result->tokens with theme->match() for colors
// Save result->ruleStack for next line

2. Multi-Line Document (Session API)

// Setup
auto session = SessionImpl::create(grammar, lines.size());
session->setLines(lines); // Initialize with all lines
session->setTheme(theme); // Optional: for HighlightedLine
// Get highlighted tokens for line 0
auto highlightedLine = session->getLine(0);
for (const auto& token : highlightedLine->tokens) {
// token.foreground, token.background, token.fontStyle
}
// Edit: replace lines 5-7
session->edit(5, 7, {"new line 1", "new line 2"});
// Incremental tokenization: only lines 5-10 (approx) are re-parsed

3. Stateless Tokenization (C API or Low-Level)

// Tokenize multiple lines manually, managing state
TextMateTokenizeResult* lineResults[10];
for (int i = 0; i < 10; i++) {
lineResults[i] = textmate_tokenize_line(grammar, lines[i], state);
state = lineResults[i]->ruleStack; // Pass state to next line
}
TextMateStateStack ruleStack
State at end of line (pass to next line's tokenization)
Definition c_api.h:75

4. Batch Tokenization (Reducing FFI Overhead)

// Tokenize all 10 lines in one FFI call (C API)
grammar, lines, 10, textmate_get_initial_state()
);
for (int i = 0; i < 10; i++) {
// batch->lineResults[i] -> tokens for line i
}
TextMateTokenizeMultiLinesResult * textmate_tokenize_lines(TextMateGrammar grammar, const char **lines, int32_t lineCount, TextMateStateStack initialState)
Tokenize multiple lines in a single call.
Definition c_api.cpp:720
void textmate_free_tokenize_lines_result(TextMateTokenizeMultiLinesResult *result)
Free a batch tokenization result.
Definition c_api.cpp:778
Result from batch tokenizing multiple lines.
Definition c_api.h:100

Performance Tips

1. Use Session API for Editors

The Session API (C++) handles incremental tokenization automatically:

  • Caches line tokens
  • Detects when state hasn't changed (early stopping)
  • Only re-parses affected lines

Impact: Editing a line in a 10,000-line file only re-tokenizes ~10 lines.

2. Reuse Grammar and Theme Objects

Create grammar and theme once, reuse for all tokenization:

// GOOD: Create once
auto grammar = registry->loadGrammar("source.javascript");
auto theme = Theme::createFromPath("dark-plus.json");
// Use for many documents
for (auto& document : documents) {
auto session = SessionImpl::create(grammar, document->lines.size());
// ...
}
// BAD: Create repeatedly
for (auto& line : lines) {
auto grammar = registry->loadGrammar("source.javascript"); // Wasteful!
// ...
}

3. Use Encoded Tokens (C API)

For performance-critical code, use textmate_tokenize_line2() instead of textmate_tokenize_line():

  • Returns compact 32-bit encoded tokens instead of scope arrays
  • Reduces memory overhead
  • Faster for large files

4. Batch Operations (C API)

Use textmate_tokenize_lines() to tokenize multiple lines in one FFI call:

  • Reduces FFI call overhead (important for .NET, Python bindings)
  • Better for batch processing

5. Early Stopping

Session API automatically limits tokenization depth:

  • Prevents pathological regex cases (catastrophic backtracking)
  • Can be tuned via SessionImpl configuration

Architecture

Component Hierarchy

┌─────────────────────────────────────────┐
│ Public APIs (C++, C, WASM) │
├─────────────────────────────────────────┤
│ Session API │ Syntax Highlighter │
│ (Stateful) │ (Convenience) │
├─────────────────────────────────────────┤
│ Grammar (Tokenization Logic) │
│ - Rule matching, state transitions │
│ - Scope stack management │
├─────────────────────────────────────────┤
│ Registry (Grammar/Theme Management) │
│ - Grammar lookup and caching │
│ - Theme application │
├─────────────────────────────────────────┤
│ Core Components │
│ - Regex engine (Oniguruma) │
│ - JSON parser (RapidJSON) │
│ - Scope/attribute providers │
└─────────────────────────────────────────┘

Key Data Structures

Type Purpose
Grammar Compiled grammar with rule matching logic
StateStack Immutable parsing state for incremental resumption
Theme Scope-to-color-and-style mapping
IToken Single token with scope hierarchy
Registry Central manager for grammars and themes
Session Stateful document with line caching

API Reference

Core Modules

Grammar & Tokenization:

Styling:

C++ Convenience:

C Bindings:


Platform Support

Platform Status API
Linux C++, C
macOS C++, C
Windows C++, C
WebAssembly C++, C, JavaScript
.NET (C#) C (via P/Invoke)
Node.js WASM

Building and Installation

Native Build

./scripts/build.sh
# Output: build/lib/libtm.so, build/include/tml/

WebAssembly Build

./scripts/build-wasm-standard.sh
# Output: build/wasm-standard/browser/tml-standard.js + .wasm

C# / .NET Bindings

./scripts/build-shared.sh
cd tests/csharp/TextMateLib.Tests
dotnet test

See CLAUDE.md in the project root for detailed build instructions.


Examples

Load Multiple Grammars

auto registry = std::make_shared<Registry>(onigLib);
// Register grammars for different languages
registry->addGrammarFromFile("javascript.json");
registry->addGrammarFromFile("python.json");
registry->addGrammarFromFile("markdown.json");
// Load as needed
auto jsGrammar = registry->loadGrammar("source.javascript");
auto pyGrammar = registry->loadGrammar("source.python");
auto mdGrammar = registry->loadGrammar("text.markdown");

Grammar Injections (Embedded Grammars)

// Inject regex highlighting into JavaScript strings
const char* injections[] = {"source.regexp"};
registry->setInjections("source.js string.quoted", injections, 1);
// Now JavaScript strings will highlight regex patterns
auto grammar = registry->loadGrammar("source.javascript");
auto result = grammar->tokenizeLine("const pattern = /[a-z]+/;", state);
// Token scopes: source.js string.quoted source.regexp

Edit a Document

auto session = SessionImpl::create(grammar, 100);
session->setLines(original_lines);
// User edits lines 10-12
std::vector<std::string> new_lines = {"edited line 1", "edited line 2"};
session->edit(10, 12, new_lines);
// Incremental tokenization happens automatically
// Only affected lines (10-20 approx) are re-parsed
auto tokens = session->getLine(10)->tokens;

Troubleshooting

Tokenization Not Working

  1. Grammar not found: Ensure grammar was registered before loading
    registry->addGrammarFromFile("grammar.json"); // Must do this first
    auto grammar = registry->loadGrammar("source.mylang");
  2. Empty tokens: Check grammar has valid rules and patterns
  3. State issues: Always pass the previous line's ruleStack as prevState
    auto result1 = grammar->tokenizeLine(line1, INITIAL);
    auto result2 = grammar->tokenizeLine(line2, result1->ruleStack); // Pass state!

Theme Colors Not Applying

  1. Scope not matching: Theme only returns colors for matched scopes
    • Use theme->match(scopePath) to test scope matching
    • TextMate scope matching uses prefix rules
  2. Default color: If scope not found, theme returns defaultColor
    // If "my.custom.scope" not in theme, returns defaultColor
    uint32_t color = theme->match("my.custom.scope", 0xFFFFFFFF);

Performance Issues

  1. Slow tokenization: Use Session API instead of manual state management
  2. High memory: Reduce SessionImpl line cache size or use stateless tokenization
  3. FFI overhead: Use batch operations (textmate_tokenize_lines) instead of per-line calls

Contributing

TextMateLib is open source! Contributions welcome:

  • Report issues: GitHub Issues
  • Submit PRs: Bug fixes, optimizations, language bindings
  • Add grammars: New language support via grammar files

See project README for contribution guidelines.


License

TextMateLib is distributed under the MIT License.


Further Reading

  • Architecture & Design: See CLAUDE.md in the project root
  • API Reference: Browse the detailed API documentation
  • Grammars: TextMate grammar format specification
  • Themes: TextMate theme format documentation
  • Examples: See examples/ directory in the project