From 60% to 93.3% Accuracy: Building ML-Powered Syntax Highlighting with Shiki
When your regex patterns think everything is JavaScript
- The Migration Challenge
- Starting with Pattern Matching (60% Accuracy)
- The Breakthrough: VS Codeâs ML Models
- The Synchronous vs Asynchronous Challenge
- Building the Hybrid Solution
- Integration with the Build Pipeline
- Real-World Performance
- Lessons Learned
- Implementation Tips
- The Code
- Whatâs Next?
- Conclusion
The Migration Challenge
Last month, I decided to upgrade my blogâs syntax highlighting from Prism.js to Shiki. Why? Because Shiki uses the same engine as VS Code, providing that beautiful, familiar syntax highlighting we all love. But there was a catch: my blog had hundreds of code blocks without language specifiers.
function detectLanguage(code) {
// This could be JavaScript... or Python... or Ruby?
return 'javascript'; // When in doubt, it's probably JS đ€·
}
Sound familiar? If youâve ever migrated between syntax highlighters, you know this pain. The easy solution would be to manually add language specifiers to every code block. But whereâs the fun in that?
Starting with Pattern Matching (60% Accuracy)
My first attempt was a classic pattern-based detector. You know the drill: look for import
statements, check for semicolons, scan for language-specific keywords. Hereâs what that looked like:
function detectLanguage(code) {
// Python?
if (code.includes('def ') || code.includes('import ')) {
return 'python';
}
// JavaScript?
if (code.includes('const ') || code.includes('function ')) {
return 'javascript';
}
// When all else fails...
return 'text';
}
The results were⊠not great:
=== Pattern Detection Results ===
Total tests: 30
Passed: 18 (60.0%)
Failed: 12 (40.0%)
Failed Languages:
- Python (0/2) - Misdetected as JavaScript
- Ruby (0/1) - Misdetected as JavaScript
- Java (0/1) - Misdetected as JavaScript
- Go (0/1) - Misdetected as TypeScript
- Rust (0/1) - Misdetected as Python
Apparently, my detector thought everything was JavaScript. Itâs like that hammer-nail situation, but for programming languages.
The Breakthrough: VS Codeâs ML Models
Then I discovered Microsoftâs @vscode/vscode-languagedetection
package. This isnât just another pattern matcher â itâs the same ML model that VS Code uses for language detection, trained on millions of code samples.
The transformation was dramatic:
=== ML Detection Results ===
Total tests: 30
Passed: 28 (93.3%)
Failed: 2 (6.7%)
Perfect Detection (100%):
- Python, Ruby, Go, Rust, C/C++, Java
- Swift, Kotlin, Lua, R
- SQL, YAML, JSON, HTML/XML
- Bash, PowerShell, Dockerfile
Only 2 Failures:
- React JSX â TypeScript (understandable)
- SCSS â CSS (very similar)
The Synchronous vs Asynchronous Challenge
But hereâs where things got interesting. Eleventyâs markdown processing is synchronous, while ML detection is asynchronous. Itâs like trying to fit a square peg in a round hole.
// What Eleventy expects
function processMarkdown(content) {
return processedContent; // Sync
}
// What ML detection provides
async function detectLanguage(code) {
return await mlModel.detect(code); // Async
}
The solution? A two-stage approach:
- Development Mode: Use improved pattern detection (80-85% accuracy) for instant feedback
- Build Mode: Pre-process files with ML detection before Eleventy sees them
Building the Hybrid Solution
Stage 1: Improved Pattern Detection
First, I built a smarter pattern detector that checks languages in a specific order to avoid false positives:
function detectLanguage(code) {
const firstLine = code.split('\n')[0].trim();
// 1. Shebang detection (highest priority)
if (firstLine.startsWith('#!')) {
if (firstLine.includes('python')) return 'python';
if (firstLine.includes('node')) return 'javascript';
if (firstLine.includes('bash')) return 'bash';
}
// 2. Check Python BEFORE JavaScript
if (code.match(/^def\s+\w+.*:/m) ||
code.includes('self.') ||
code.includes('__init__')) {
return 'python';
}
// 3. Ruby (also before JS)
if (code.match(/^class\s+\w+\s*</m) ||
code.includes('puts ') ||
code.includes('do |')) {
return 'ruby';
}
// ... more patterns
}
This boosted accuracy to ~85% â good enough for development.
Stage 2: ML Pre-processing
For production builds, I created a pre-processor that runs before Eleventy:
async function preprocessMarkdown(content) {
const codeBlockRegex = /^```(\w*)\n([\s\S]*?)^```/gm;
const blocksToProcess = [];
// Find unmarked code blocks
let match;
while ((match = codeBlockRegex.exec(content)) !== null) {
if (!match[1]) { // No language specified
blocksToProcess.push({
fullMatch: match[0],
code: match[2],
index: match.index
});
}
}
// Detect languages in parallel
const detectedLanguages = await Promise.all(
blocksToProcess.map(block =>
detectLanguage(block.code)
)
);
// Replace blocks (in reverse to maintain indices)
let processedContent = content;
for (let i = blocksToProcess.length - 1; i >= 0; i--) {
const block = blocksToProcess[i];
const lang = detectedLanguages[i];
const newBlock = `\`\`\`${lang}\n${block.code}\`\`\``;
processedContent =
processedContent.substring(0, block.index) +
newBlock +
processedContent.substring(block.index + block.fullMatch.length);
}
return processedContent;
}
Integration with the Build Pipeline
The beauty of this approach is its seamless integration:
{
"scripts": {
"dev": "eleventy --serve",
"build": "npm run ml-detect && eleventy",
"build:no-ml": "eleventy",
"ml-detect": "node scripts/apply-ml-detection.js"
}
}
During development, you get instant feedback with pattern detection. For production builds, ML detection runs automatically, modifying your markdown files in-place before Eleventy processes them.
Real-World Performance
Hereâs what this looks like in practice:
$ npm run build
đ€ Applying ML language detection...
Initializing ML model...
Model ready
Processing ./src/en/posts...
â 2024-07-26-git-worktree.md (142ms)
â 2024-08-15-docker-optimization.md (98ms)
â 2024-09-01-rust-async-patterns.md (156ms)
Processed 27 files in 3.2s
Average: 118ms per file
Lessons Learned
1. Pattern Matching Has Its Limits
No matter how clever your regex, it canât compete with ML models trained on millions of examples. My pattern detector confused Python with JavaScript because both use import
statements. The ML model understands context.
2. Hybrid Approaches Work
You donât always need the âperfectâ solution everywhere. Using patterns in development (where speed matters) and ML in production (where accuracy matters) gives you the best of both worlds.
3. Pre-processing > Runtime Processing
Instead of fighting with Eleventyâs synchronous nature, working around it with pre-processing was simpler and more maintainable.
4. Cache Everything
ML detection is expensive. Caching results based on code snippets reduced detection time by ~70% on subsequent runs.
Implementation Tips
If youâre building something similar, here are my recommendations:
- Start with the simplest approach: Get pattern detection working first
- Test with real data: My 30-language test suite caught issues I never would have imagined
- Make it idempotent: Running ML detection multiple times should be safe
- Add escape hatches: The
build:no-ml
command saved me during debugging - Provide progress feedback: ML detection can be slow â let users know itâs working
The Code
The complete implementation is available in my blogâs repository. Here are the key files:
- Pattern Detector:
src/_utils/markdown-language-detector-improved.js
- ML Preprocessor:
src/_utils/markdown-preprocessor-ml.js
- Build Script:
scripts/apply-ml-detection.js
- Shiki Config:
.eleventy.js
Whatâs Next?
This implementation has been running in production for a month now, and the results speak for themselves. Code blocks are properly highlighted, the build process is smooth, and I havenât had to manually specify a language in weeks.
Future improvements could include:
- Streaming processing for large files
- WebAssembly version for client-side detection
- Custom training for domain-specific languages
- Integration with git hooks for automatic detection on commit
Conclusion
Sometimes the best solution isnât the most elegant one â itâs the one that works. By combining pattern matching for development speed with ML detection for production accuracy, I achieved a 93.3% detection rate while keeping my development workflow fast and responsive.
The next time youâre faced with a similar challenge, remember: you donât have to choose between speed and accuracy. Sometimes, you can have both.
Have you dealt with automatic language detection in your projects? Iâd love to hear about your approach! Drop me a line on Twitter or check out the full implementation on GitHub.
Related Reading: