Robert Tamayo | July 07, 2019

Super Simple Javascript and HTML Syntax Highlighter

Before last week, all the code on my posts was in one color. I built this blog myself, and so instead of finding a 3rd party solution, I decided to build the syntax highlighter myself.

My solution was very simple. The constraint was that it had to be able to handle html and javascript, not just one or the other. Usually, different languages require different highlighting. In this case, because I didn't have anything that distinguished the language when writing the code block initially, I needed to be able to highlight both html and javascript for older posts. This is why there are some quirks with the highlighting in some cases.

Disclaimer: I am aware of several bugs. This post will outline the pattern I chose to do syntax highlighting, but it is not complete and still a work in progress. Here are the bugs I'm aware of:

Regex in javascript triggers a quote start but with no end
Html text still gets the javascript syntax highlighting for javascript keywords, such as "for"

Let's get into it.

1. Defining and Using the SyntaxHighlighter Module

var SyntaxHighlighter = (function() {
    // code here
    return {
        getFormattedHTML: function(element){
            return format(element);
        }
    }
}());

Since my blog was built before I started using webpack, I'm using the old fashioned javascript module approach with immediately executed function. Using this module goes like this:

$(document).ready(function(){
    $('pre').each((index, el) => {
        let text = $(el).html();
        let formattedHTML = SyntaxHighlighter.getFormattedHTML(text);
        $(el).html(formattedHTML);
        $(el).addClass('formatted');
    });
});

I thought about how to best handle the formatting, and in the end I chose to step through the text character-by-character and analyze it. The other approach would have been to use regex to match certain patterns and modify the string directly. I decided the character-by-character approach was better because I wouldn't have to worry about so many large string replacement operations, and I would also be less likely to collide with my own formatted html as I parse the document.

Here's the character-by-character approach.

2. The Prerequisites

First we'll need to know all of the javascript reserved words.

    let reserved = [
        "abstract",
        "arguments",
        "await",
        "boolean",
        // and so on...

I also want named regex values so that I don't have to write them out every time I want to test:

let regex = {
            double: /"/,
            single: /'/,
            backtick: /`/,
            punctuation: /(\.|,|:|\{|\}|\[|\]|;|\+|=|-|\(|\)|\/)/,
            tags: /(<|>)/,
            reserved: /()/,
            whitespace: /\s/,
            nonwhitespace: /\S/,
            letter: /[A-Za-z]/,
            newline: /\n/,
            tab: /\t/,
            openTag: /</,
            closeTag: />/,
            slash: /\//,
            asterisk: /\*/,
}

3. The Plan

I want to process the text using some rules:

If we're inside a quote, we want to ignore all other rules until we are outside of the quote
If we're inside an html tag, we need to know what the tag name is so that we can highlight it
If we're inside a comment block, we want to ignore all other rules until outside it
If we're inside an inline comment, ignore every rule until we encounter a newline character
If we match a letter, we want to keep that value in a working variable.
If we match a non-letter, we want to test the current working variable against all of the javascript reserved words.
If we match punctuation, we want to wrap it in a punctuation span.

That's about all the basic rules.

4. Implementing the Rules

Start off the whole thing like so:

for (let i = 0; i < html.length; i++) {
        let char = html.charAt(i);
        let outputChar = '&#' + char.charCodeAt(0);

The output of "<" must be the equivalent of an html entity. So that's why I used the charCode instead of the actual character, but that's only in some cases.

First up, we have the quotes check:

if (inQuotes) {
    if (quoteType.test(char)) {
        output += outputChar + '</span>';
        inQuotes = false;
    } else {
        if (regex.whitespace.test(char)) {
            writeWhiteSpace(char);
        } else {
            output += outputChar;
        }
    }

Notice that I'm referencing quoteType, which is defined as a regex matching either a single or double quote. We modify its value whenever we open a new quote.

Another thing to notice is the writeWhiteSpace(char) function. I noticed that in order to preserve whitespace in html, we need to use   entities instead of normal spaces. Also, I use "<br>" tags to indicate newlines.

Next up, we have the comments check:

} else if (inComment) {
    if (commentType == 'inline') {
        if (regex.newline.test(char)) {
            inComment = false;
            commentType = '';
            output += '</span>';
            writeWhiteSpace(char);
        } else {
            if (regex.whitespace.test(char)) {
                writeWhiteSpace(char);
            } else {
                output += outputChar;
            }
        }
    } else {
        if (regex.asterisk.test(char)) {
            let nextIndex = i + 1;
            if (nextIndex < html.length) {
                let nextChar = html.charAt(nextIndex);
                if (regex.slash.test(nextChar)) {
                    inComment = false;
                    commentType = '';
                    output += '*/</span>';
                    i = nextIndex;
                    continue;
                }
            }
        } else {
            if (regex.whitespace.test(char)) {
                writeWhiteSpace(char);
            } else {
                output += outputChar;
            }
        }
    }

The only tricky thing here is when we are in block comments. In that case, they are closed by "*/". Instead of storing a boolean and checking whether the last character was an asterisk and we are in block comments, I simply decided to look ahead and the next character. If we matched an asterisk, and the next character is a slash, then the comments are closed.

Now we can get on to the character testing:

} else { // not in comments or quotes...
    if (!regex.whitespace.test(char)) {
        workingTest();
        writeWhiteSpace(char);
    } else { // not a whitespace character

If the character is a whitespace character, we are done checking for javascript keywords. The workingTest() function takes care of writing to the output, and the writeWhiteSpace function appends the appropriate whitespace to the output.

Next up is letter matching.

if (regex.letter.test(char)) {
    working += char;
    continue;
// } else if ...

If the current character is a letter, we don't care about anything else for now. Just add it to the working variable and keep going. The following tests occur only if the character is not a letter.

Let's discuss slashes next:

} else if (regex.slash.test(char)) {
    // lookahead
    let nextIndex = i + 1;
    if (nextIndex < html.length) {
        let nextChar = html.charAt(nextIndex);
        if (regex.asterisk.test(nextChar)){
            inComment = true;
            commentType = 'block';
            output += '<span class="comm">/*';
            i = nextIndex;
            continue;
        } else if (regex.slash.test(nextChar)) {
            inComment = true;
            commentType = 'inline';
            output += '<span class="comm">//';
            i = nextIndex;
            continue;
        }
    }

The only difference is what type of comment we encountered. In either case, we manually write the type of comment it is to the output and include the opening span tag for the comment. When we encounter the end of the comment, we will close the span tag.

If a slash is detected but the next character is not an asterisk or a slash, we just continue along as though it were normal punctuation. But first, here are a few other tests we do:

} else if (inTag && regex.closeTag.test(char)) {
    workingTest(true);
    inTag = false;
    // output += `<span class="punc">${outputChar}</span>`;
} else if (regex.openTag.test(char)) {
    workingTest(true);
    inTag = true;
    hasName = false;
    // output += `<span class="punc">${outputChar}</span>`;
} else {
    workingTest();
}

The above tests are mainly concerned about html tags. HTML would ideally be supported by its own syntax highlighter, but as I said earlier, I wasn't able to make something backwards compatible at this time. The main issue with this is that a JavaScript less than operator will trigger the opening tag for the html, and that will ruin JS syntax highlighting for the rest of the block, such as a for loop containing other keywords.

When things go well, however, this works just fine. The workingTest has a special test case in it for when we are in a tag and the name has not been set. In that case, it assumes the value of working is an html tag and highlights it.

Except for the continue statements above, the following tests occur in addition to the above ones for non-letter characters.

if (regex.punctuation.test(char)) {
    output += `<span class="punc">${outputChar}</span>`;
} else if (regex.tags.test(char)) {
    output += `<span class="tags">${outputChar}</span>`;
} else if (regex.double.test(char)) {
    inQuotes = true;
    quoteType = regex.double;
    output += `<span class="quot">${outputChar}`;
} else if (regex.single.test(char)) {
    inQuotes = true;
    quoteType = regex.single;
    output += `<span class="quot">${outputChar}`;
} else if (regex.backtick.test(char)) {
    inQuotes = true;
    quoteType = regex.backtick;
    output += `<span class="quot">${outputChar}`;
} else {
    output += outputChar;
}

At this point, we've made it through all the tests. Before returning the output, we want to check if we are still inside a comment or a quote. If we are, it means there is an open span tag in our output without a closing one. So we just close them off. We can't be inside both a quotes and a comment, so one test will work:

if (inQuotes || inComment) {
    output += '</span>';
}

Finally, let's look at some of our helper functions we met along the way. First up is writeWhiteSpace(char):

function writeWhiteSpace(char){
    if (regex.newline.test(char)) {
        output += '<br>';
    } else if (regex.tab.test(char)) {
        output += '&nbsp;&nbsp;&nbsp;&nbsp;';
    } else {
        output += '&nbsp;';
    }
}

Nothing crazy there. The next one is workingTest():

function workingTest(){
    if (!inTag && reserved.indexOf(working) != -1) {
        output += `<span class="lang">${working}</span>`;
    } else if (inTag && !hasName) {
        output += `<span class="name">${working}</span>`;
        hasName = true;
    } else {
        output += working;
    }
    working = ''; // reset value of working
}

You can see that inside of html tags, we are not concerned with JavaScript keywords. This only applies to inside the actual tag, though, and not the element. As I said earlier, ideally I would have an html syntax highlighter and a javascript one. For now, one of the side effects is that within the html text itself (not in the tag, the actual element's text), javascript keywords are highlighted. This is not the worst thing in the world for me, and I know exactly what causes it, so I'm not concerned about it.

If it's not a keyword, it's just text, and so we simply write it to the page.

5. A Walkthrough

So we've seen it all, but let's walk through a few examples to get the flow of the logic down.

Let's say we have a simple javascript function to format, like so:

function testFormat(message){
    if (message) {
        console.log(`Your message is: ${message}`);
    } else {
        console.log('You must supply a message for this function to work');
    }
}

Really quickly, we can see that the first match is "f". That's a letter, so it goes into the working variable. The next characters are all letters until the space, at which point the working variable has a value of "function". workingTest() is called, and the reserved keywords array is tested to see if the word "function" is found inside it. It is, and so the word "function" gets added to the output wrapped inside a "lang" span.

Whitespace gets added, and then letters are aded until "(" matches a punctuation and triggers a workingTest(). "testFormat" is not a keyword match, so it gets no special treatment. The parenthesis is added to the output and the script continues.

The last cool thing is the quote test. Once the backtick is found, every rule after it is ignored until another one is found. In the next set of quotes, the keyword "function" is not matched as a keyword because no such tests are taking place.

That's all there is for syntax highlighting. For the full source, check it out on github.

Comments: