From 60% to 93.3% Accuracy: Building ML-Powered Syntax Highlighting with Shiki

From 60% to 93.3% Accuracy: Building ML-Powered Syntax Highlighting with Shiki

When your regex patterns think everything is JavaScript

The Migration Challenge

Last month, I decided to upgrade my blog’s syntax highlighting from Prism.js to Shiki. Why? Because Shiki uses the same engine as VS Code, providing that beautiful, familiar syntax highlighting we all love. But there was a catch: my blog had hundreds of code blocks without language specifiers.

function detectLanguage(code) {
  // This could be JavaScript... or Python... or Ruby?
  return 'javascript'; // When in doubt, it's probably JS đŸ€·
}

Sound familiar? If you’ve ever migrated between syntax highlighters, you know this pain. The easy solution would be to manually add language specifiers to every code block. But where’s the fun in that?

Starting with Pattern Matching (60% Accuracy)

My first attempt was a classic pattern-based detector. You know the drill: look for import statements, check for semicolons, scan for language-specific keywords. Here’s what that looked like:

function detectLanguage(code) {
  // Python?
  if (code.includes('def ') || code.includes('import ')) {
    return 'python';
  }
  
  // JavaScript?
  if (code.includes('const ') || code.includes('function ')) {
    return 'javascript';
  }
  
  // When all else fails...
  return 'text';
}

The results were
 not great:

=== Pattern Detection Results ===
Total tests: 30
Passed: 18 (60.0%)
Failed: 12 (40.0%)

Failed Languages:
- Python (0/2) - Misdetected as JavaScript
- Ruby (0/1) - Misdetected as JavaScript  
- Java (0/1) - Misdetected as JavaScript
- Go (0/1) - Misdetected as TypeScript
- Rust (0/1) - Misdetected as Python

Apparently, my detector thought everything was JavaScript. It’s like that hammer-nail situation, but for programming languages.

The Breakthrough: VS Code’s ML Models

Then I discovered Microsoft’s @vscode/vscode-languagedetection package. This isn’t just another pattern matcher – it’s the same ML model that VS Code uses for language detection, trained on millions of code samples.

The transformation was dramatic:

=== ML Detection Results ===
Total tests: 30
Passed: 28 (93.3%)
Failed: 2 (6.7%)

Perfect Detection (100%):
- Python, Ruby, Go, Rust, C/C++, Java
- Swift, Kotlin, Lua, R
- SQL, YAML, JSON, HTML/XML
- Bash, PowerShell, Dockerfile

Only 2 Failures:
- React JSX → TypeScript (understandable)
- SCSS → CSS (very similar)

The Synchronous vs Asynchronous Challenge

But here’s where things got interesting. Eleventy’s markdown processing is synchronous, while ML detection is asynchronous. It’s like trying to fit a square peg in a round hole.

// What Eleventy expects
function processMarkdown(content) {
  return processedContent; // Sync
}

// What ML detection provides
async function detectLanguage(code) {
  return await mlModel.detect(code); // Async
}

The solution? A two-stage approach:

  1. Development Mode: Use improved pattern detection (80-85% accuracy) for instant feedback
  2. Build Mode: Pre-process files with ML detection before Eleventy sees them

Building the Hybrid Solution

Stage 1: Improved Pattern Detection

First, I built a smarter pattern detector that checks languages in a specific order to avoid false positives:

function detectLanguage(code) {
  const firstLine = code.split('\n')[0].trim();
  
  // 1. Shebang detection (highest priority)
  if (firstLine.startsWith('#!')) {
    if (firstLine.includes('python')) return 'python';
    if (firstLine.includes('node')) return 'javascript';
    if (firstLine.includes('bash')) return 'bash';
  }
  
  // 2. Check Python BEFORE JavaScript
  if (code.match(/^def\s+\w+.*:/m) || 
      code.includes('self.') ||
      code.includes('__init__')) {
    return 'python';
  }
  
  // 3. Ruby (also before JS)
  if (code.match(/^class\s+\w+\s*</m) ||
      code.includes('puts ') ||
      code.includes('do |')) {
    return 'ruby';
  }
  
  // ... more patterns
}

This boosted accuracy to ~85% – good enough for development.

Stage 2: ML Pre-processing

For production builds, I created a pre-processor that runs before Eleventy:

async function preprocessMarkdown(content) {
  const codeBlockRegex = /^```(\w*)\n([\s\S]*?)^```/gm;
  const blocksToProcess = [];
  
  // Find unmarked code blocks
  let match;
  while ((match = codeBlockRegex.exec(content)) !== null) {
    if (!match[1]) { // No language specified
      blocksToProcess.push({
        fullMatch: match[0],
        code: match[2],
        index: match.index
      });
    }
  }
  
  // Detect languages in parallel
  const detectedLanguages = await Promise.all(
    blocksToProcess.map(block => 
      detectLanguage(block.code)
    )
  );
  
  // Replace blocks (in reverse to maintain indices)
  let processedContent = content;
  for (let i = blocksToProcess.length - 1; i >= 0; i--) {
    const block = blocksToProcess[i];
    const lang = detectedLanguages[i];
    const newBlock = `\`\`\`${lang}\n${block.code}\`\`\``;
    
    processedContent = 
      processedContent.substring(0, block.index) + 
      newBlock + 
      processedContent.substring(block.index + block.fullMatch.length);
  }
  
  return processedContent;
}

Integration with the Build Pipeline

The beauty of this approach is its seamless integration:

{
  "scripts": {
    "dev": "eleventy --serve",
    "build": "npm run ml-detect && eleventy",
    "build:no-ml": "eleventy",
    "ml-detect": "node scripts/apply-ml-detection.js"
  }
}

During development, you get instant feedback with pattern detection. For production builds, ML detection runs automatically, modifying your markdown files in-place before Eleventy processes them.

Real-World Performance

Here’s what this looks like in practice:

$ npm run build

đŸ€– Applying ML language detection...
Initializing ML model...
Model ready

Processing ./src/en/posts...
✓ 2024-07-26-git-worktree.md (142ms)
✓ 2024-08-15-docker-optimization.md (98ms)
✓ 2024-09-01-rust-async-patterns.md (156ms)

Processed 27 files in 3.2s
Average: 118ms per file

Lessons Learned

1. Pattern Matching Has Its Limits

No matter how clever your regex, it can’t compete with ML models trained on millions of examples. My pattern detector confused Python with JavaScript because both use import statements. The ML model understands context.

2. Hybrid Approaches Work

You don’t always need the “perfect” solution everywhere. Using patterns in development (where speed matters) and ML in production (where accuracy matters) gives you the best of both worlds.

3. Pre-processing > Runtime Processing

Instead of fighting with Eleventy’s synchronous nature, working around it with pre-processing was simpler and more maintainable.

4. Cache Everything

ML detection is expensive. Caching results based on code snippets reduced detection time by ~70% on subsequent runs.

Implementation Tips

If you’re building something similar, here are my recommendations:

  1. Start with the simplest approach: Get pattern detection working first
  2. Test with real data: My 30-language test suite caught issues I never would have imagined
  3. Make it idempotent: Running ML detection multiple times should be safe
  4. Add escape hatches: The build:no-ml command saved me during debugging
  5. Provide progress feedback: ML detection can be slow – let users know it’s working

The Code

The complete implementation is available in my blog’s repository. Here are the key files:

  • Pattern Detector: src/_utils/markdown-language-detector-improved.js
  • ML Preprocessor: src/_utils/markdown-preprocessor-ml.js
  • Build Script: scripts/apply-ml-detection.js
  • Shiki Config: .eleventy.js

What’s Next?

This implementation has been running in production for a month now, and the results speak for themselves. Code blocks are properly highlighted, the build process is smooth, and I haven’t had to manually specify a language in weeks.

Future improvements could include:

  • Streaming processing for large files
  • WebAssembly version for client-side detection
  • Custom training for domain-specific languages
  • Integration with git hooks for automatic detection on commit

Conclusion

Sometimes the best solution isn’t the most elegant one – it’s the one that works. By combining pattern matching for development speed with ML detection for production accuracy, I achieved a 93.3% detection rate while keeping my development workflow fast and responsive.

The next time you’re faced with a similar challenge, remember: you don’t have to choose between speed and accuracy. Sometimes, you can have both.


Have you dealt with automatic language detection in your projects? I’d love to hear about your approach! Drop me a line on Twitter or check out the full implementation on GitHub.

Related Reading: