Syntax Highlighting

Probably the most important feature for any text editor is syntax highlighting.

AvalonEdit has a flexible text rendering model, see Text Rendering. Among the text rendering extension points is the support for "visual line transformers" that can change the display of a visual line after it has been constructed by the "visual element generators". A useful base class implementing IVisualLineTransformer for the purpose of syntax highlighting is DocumentColorizingTransformer. Take a look at that class' documentation to see how to write fully custom syntax highlighters. This article only discusses the XML-driven built-in highlighting engine.

The highlighting engine

The highlighting engine in AvalonEdit is implemented in the class DocumentHighlighter. Highlighting is the process of taking a DocumentLine and constructing a HighlightedLine instance for it by assigning colors to different sections of the line. A HighlightedLine is simply a list of (possibly nested) highlighted text sections.

The HighlightingColorizer class is the only link between highlighting and rendering. It uses a DocumentHighlighter to implement a line transformer that applies the highlighting to the visual lines in the rendering process.

Except for this single call, syntax highlighting is independent from the rendering namespace. To help with other potential uses of the highlighting engine, the HighlightedLine class has the method ToHtml() to produce syntax highlighted HTML source code.

The highlighting rules used by the highlighting engine to highlight the document are described by the following classes:

HighlightingRuleSet: Describes a set of highlighting spans and rules.
HighlightingSpan: A span consists of two regular expressions (Start and End), a color, and a child ruleset. The region between Start and End expressions will be assigned the given color, and inside that span, the rules of the child ruleset apply. If the child ruleset also has HighlightingSpans, they can be nested, allowing highlighting constructs like nested comments or one language embedded in another.
HighlightingRule: A highlighting rule is a regular expression with a color. It will highlight matches of the regular expression using that color.
HighlightingColor: A highlighting color isn't just a color: it consists of a foreground color, font weight and font style.

The highlighting engine works by first analyzing the spans: whenever a begin RegEx matches some text, that span is pushed onto a stack. Whenever the end RegEx of the current span matches some text, the span is popped from the stack.

Each span has a nested rule set associated with it, which is empty by default. This is why keywords won't be highlighted inside comments: the span's empty ruleset is active there, so the keyword rule is not applied.

This feature is also used in the string span: the nested span will match when a backslash is encountered, and the character following the backslash will be consumed by the end RegEx of the nested span (. matches any character). This ensures that \" does not denote the end of the string span; but \\" still does.

What's great about the highlighting engine is that it highlights only on-demand, works incrementally, and yet usually requires only a few KB of memory even for large code files.

On-demand means that when a document is opened, only the lines initially visible will be highlighted. When the user scrolls down, highlighting will continue from the point where it stopped the last time. If the user scrolls quickly, so that the first visible line is far below the last highlighted line, then the highlighting engine still has to process all the lines in between – there might be comment starts in them. However, it will only scan that region for changes in the span stack; highlighting rules will not be tested.

The stack of active spans is stored at the beginning of every line. If the user scrolls back up, the lines getting into view can be highlighted immediately because the necessary context (the span stack) is still available.

Incrementally means that even if the document is changed, the stored span stacks will be reused as far as possible. If the user types /*, that would theoretically cause the whole remainder of the file to become highlighted in the comment color. However, because the engine works on-demand, it will only update the span stacks within the currently visible region and keep a notice 'the highlighting state is not consistent between line X and line X+1', where X is the last line in the visible region. Now, if the user would scroll down, the highlighting state would be updated and the 'not consistent' notice would be moved down. But usually, the user will continue typing and type */ only a few lines later. Now the highlighting state in the visible region will revert to the normal 'only the main ruleset is on the stack of active spans'. When the user now scrolls down below the line with the 'not consistent' marker; the engine will notice that the old stack and the new stack are identical; and will remove the 'not consistent' marker. This allows reusing the stored span stacks cached from before the user typed /*.

While the stack of active spans might change frequently inside the lines, it rarely changes from the beginning of one line to the beginning of the next line. With most languages, such changes happen only at the start and end of multiline comments. The highlighting engine exploits this property by storing the list of span stacks in a special data structure (CompressingTreeList<(Of <(<'T>)>)>). The memory usage of the highlighting engine is linear to the number of span stack changes; not to the total number of lines. This allows the highlighting engine to store the span stacks for big code files using only a tiny amount of memory, especially in languages like C# where sequences of // or /// are more popular than /* */ comments.

XML highlighting definitions

AvalonEdit supports XML syntax highlighting definitions (.xshd files).

In the AvalonEdit source code, you can find the file ICSharpCode.AvalonEdit\Highlighting\Resources\ModeV2.xsd. This is an XML schema for the .xshd file format; you can use it to code completion for .xshd files in XML editors.

Here is an example highlighting definition for a sub-set of C#:

Xml	Copy
<SyntaxDefinition name="C#" xmlns="http://icsharpcode.net/sharpdevelop/syntaxdefinition/2008"> <Color name="Comment" foreground="Green" /> <Color name="String" foreground="Blue" /> <!-- This is the main ruleset. --> <RuleSet> <Span color="Comment" begin="//" /> <Span color="Comment" multiline="true" begin="/\" end="\/" /> <Span color="String"> <Begin>"</Begin> <End>"</End> <RuleSet> <!-- nested span for escape sequences --> <Span begin="\\" end="." /> </RuleSet> </Span> <Keywords fontWeight="bold" foreground="Blue"> <Word>if</Word> <Word>else</Word> <!-- ... --> </Keywords> <!-- Digits --> <Rule foreground="DarkBlue"> \b0[xX][0-9a-fA-F]+ # hex number \| \b ( \d+(\.[0-9]+)? #number with optional floating point \| \.[0-9]+ #or just starting with floating point ) ([eE][+-]?[0-9]+)? # optional exponent </Rule> </RuleSet> </SyntaxDefinition>

Xml

Copy

<SyntaxDefinition name="C#"
        xmlns="http://icsharpcode.net/sharpdevelop/syntaxdefinition/2008">
    <Color name="Comment" foreground="Green" />
    <Color name="String" foreground="Blue" />

    <!-- This is the main ruleset. -->
    <RuleSet>
        <Span color="Comment" begin="//" />
        <Span color="Comment" multiline="true" begin="/\*" end="\*/" />

        <Span color="String">
            <Begin>"</Begin>
            <End>"</End>
            <RuleSet>
                <!-- nested span for escape sequences -->
                <Span begin="\\" end="." />
            </RuleSet>
        </Span>

        <Keywords fontWeight="bold" foreground="Blue">
            <Word>if</Word>
            <Word>else</Word>
            <!-- ... -->
        </Keywords>

        <!-- Digits -->
        <Rule foreground="DarkBlue">
            \b0[xX][0-9a-fA-F]+  # hex number
        |    \b
            (    \d+(\.[0-9]+)?   #number with optional floating point
            |    \.[0-9]+         #or just starting with floating point
            )
            ([eE][+-]?[0-9]+)? # optional exponent
        </Rule>
    </RuleSet>
</SyntaxDefinition>

ICSharpCode.TextEditor XML highlighting definitions

ICSharpCode.TextEditor (the predecessor of AvalonEdit) used a different version of the XSHD file format. AvalonEdit detects the difference between the formats using the XML namespace: The new format uses xmlns="http://icsharpcode.net/sharpdevelop/syntaxdefinition/2008", the old format does not use any XML namespace.

AvalonEdit can load .xshd files written in that old format, and even automatically convert them to the new format. However, not all constructs of the old file format are supported by AvalonEdit.

C#	Copy
// convert from old .xshd format to new format XshdSyntaxDefinition xshd; using (XmlTextReader reader = new XmlTextReader("input.xshd")) { xshd = HighlightingLoader.LoadXshd(reader); } using (XmlTextWriter writer = new XmlTextWriter("output.xshd", System.Text.Encoding.UTF8)) { writer.Formatting = Formatting.Indented; new SaveXshdVisitor(writer).WriteDefinition(xshd); }

Copy

// convert from old .xshd format to new format
XshdSyntaxDefinition xshd;
using (XmlTextReader reader = new XmlTextReader("input.xshd")) {
    xshd = HighlightingLoader.LoadXshd(reader);
}
using (XmlTextWriter writer = new XmlTextWriter("output.xshd", System.Text.Encoding.UTF8)) {
    writer.Formatting = Formatting.Indented;
    new SaveXshdVisitor(writer).WriteDefinition(xshd);
}

Programmatically accessing highlighting information

As described above, the highlighting engine only stores the "span stack" at the start of each line. This information can be retrieved using the GetSpanStack(Int32) method:

C#	Copy
bool isInComment = documentHighlighter.GetSpanStack(1).Any( s => s.SpanColor != null && s.SpanColor.Name == "Comment"); // returns true if the end of line 1 (=start of line 2) is inside a multiline comment

Copy

bool isInComment = documentHighlighter.GetSpanStack(1).Any(
    s => s.SpanColor != null && s.SpanColor.Name == "Comment");
// returns true if the end of line 1 (=start of line 2) is inside a multiline comment

Spans can be identified using their color. For this purpose, named colors should be used in the syntax definition.

For more detailed results inside lines, the highlighting algorithm must be executed for that line:

C#	Copy
int off = document.GetOffset(7, 22); HighlightedLine result = documentHighlighter.HighlightLine(document.GetLineByNumber(7)); bool isInComment = result.Sections.Any( s => s.Offset <= off && s.Offset+s.Length >= off && s.Color.Name == "Comment");

Copy

int off = document.GetOffset(7, 22);
HighlightedLine result = documentHighlighter.HighlightLine(document.GetLineByNumber(7));
bool isInComment = result.Sections.Any(
    s => s.Offset <= off && s.Offset+s.Length >= off
         && s.Color.Name == "Comment");