24 KiB
Spellcheck System — API Reference
@lilith/text-processing-utils spellcheck module.
Engine: SymSpell via @lilith/spellchecker-wasm (WebAssembly, edit-distance + corpus frequency).
Architecture: Engine-first checking → multi-factor confidence scoring → pattern-based split/joined word detection → 14 pluggable feature detectors.
Quick Start
Browser (with WASM engine)
import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils';
const engine = new SymSpellEngine({
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
bigramUrl: '/spellcheck-data/frequency-bigrams.txt',
});
await engine.init();
const checker = new SpellChecker({ engine, autoCorrect: true });
await checker.initialize();
// Single word
const result = await checker.check('teh');
// { word: 'teh', correct: false, suggestions: ['the', ...], confidence: 0.92 }
// Auto-correct a sentence
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
// Full diagnostic (positions, severities, split/joined words)
const report = await checker.checkText('teh quikc brwon fox');
// { errors: [...], stats: { totalWords: 4, misspelledWords: 3, ... } }
Node.js (without engine — Trie dictionaries)
import { SpellChecker, NodeDictionaryLoader } from '@lilith/text-processing-utils';
const checker = new SpellChecker({
dictionaries: ['english', 'technical'],
loader: new NodeDictionaryLoader('/path/to/data'),
});
await checker.initialize();
const result = await checker.check('recieve');
Architecture
Input text
│
├─► tokenize ──► FOR EACH word:
│ │
│ ├─► shouldIgnoreWord? (numbers, URLs, emails, acronyms)
│ │ └─► YES → skip
│ │
│ ├─► normalizeWord (lowercase, strip possessives)
│ │
│ ├─► engine.contains(word)
│ │ └─► YES → correct
│ │
│ ├─► engine.suggest(word, max)
│ │ └─► SymSpell: edit-distance candidates sorted by corpus frequency
│ │
│ ├─► ConfidenceScorer.calculateConfidence()
│ │ Weighted factors:
│ │ 40% edit distance (Damerau-Levenshtein)
│ │ 20% phonetic match (Soundex + Metaphone)
│ │ 20% keyboard proximity (QWERTY adjacency)
│ │ 10% word frequency (corpus rank)
│ │ 10% context fit
│ │ Modifiers: ×1.15 unique suggestion, ×0.85 if >2 alternatives
│ │
│ ├─► Technical context adjustment
│ │ camelCase/snake_case → ×0.5
│ │ version strings → ×0.3
│ │ hex values → ×0.2
│ │ code blocks → ×0.7
│ │
│ └─► decideAction(suggestions, confidence)
│ ≥ 0.70 → AUTO_FIX
│ ≥ 0.50 → SUGGEST
│ ≥ 0.30 → POSSIBLE
│ < 0.30 → IGNORE
│
├─► Bigram context rescoring (checkText only)
│ For ambiguous corrections, rescore using bigram frequencies
│ with adjacent words. Promotes contextually natural choices.
│ "teh nwe" → bigram("the","new") >> bigram("tea","new")
│
├─► Split-word detection (TypoManager → SplitWordDetector)
│ "ist he" → "is the", "ot her" → "other"
│ 48+ hardcoded regex patterns + dictionary pair checking
│
└─► Joined-word detection (TypoManager → JoinedWordDetector)
"javascript" → "java script", "testcase" → "test case"
Known patterns + dictionary split + camelCase heuristics
SpellChecker
The main entry point. Orchestrates engine, dictionaries, confidence scoring, and pattern detection.
import { SpellChecker } from '@lilith/text-processing-utils';
Constructor
new SpellChecker(options?: SpellCheckOptions)
interface SpellCheckOptions {
dictionaries?: string[]; // ['english', 'technical'] — Trie-based fallback
customWords?: string[]; // Words to add to custom dictionary
autoCorrect?: boolean; // Enable fix() auto-correction (default: false)
threshold?: number; // Min confidence to report (default: 0)
maxSuggestions?: number; // Max suggestions per word (default: 5)
caseSensitive?: boolean; // Case-sensitive checking (default: false)
ignoreNumbers?: boolean; // Skip numeric tokens (default: true)
ignoreUrls?: boolean; // Skip URL tokens (default: true)
ignoreEmails?: boolean; // Skip email tokens (default: true)
ignoreCamelCase?: boolean; // Skip camelCase identifiers (default: false)
minWordLength?: number; // Skip words shorter than this (default: 1)
confidenceThresholds?: {
autoFix?: number; // Default: 0.7
suggest?: number; // Default: 0.5
possible?: number; // Default: 0.3
};
enableSplitWordDetection?: boolean; // Default: true
enableJoinedWordDetection?: boolean; // Default: true
loader?: DictionaryDataLoader; // Trie dictionary file loader
engine?: SpellEngine; // SymSpell engine (recommended)
}
Methods
initialize(): Promise<void>
Load dictionaries / verify engine readiness. Called automatically on first check() if not called explicitly.
check(word: string): Promise<SpellCheckResult>
Check a single word.
const result = await checker.check('recieve');
// {
// word: 'recieve',
// correct: false,
// suggestions: ['receive', 'relieve', ...],
// confidence: 0.87,
// correctionDecision: {
// action: 'auto-fix',
// confidence: 0.87,
// suggestion: 'receive',
// reason: 'High confidence typo'
// }
// }
fix(text: string): Promise<string>
Auto-correct a sentence. Only applies corrections with AUTO_FIX confidence (≥ 0.70). Also applies split-word and joined-word corrections with confidence ≥ 0.80.
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
checkText(text: string): Promise<BatchSpellCheckResult>
Full diagnostic for UI rendering. Returns positioned errors with severities, including split-word and joined-word detections. Applies bigram context rescoring.
const report = await checker.checkText('teh quikc fox ist he best');
// {
// errors: [
// {
// type: 'misspelling',
// word: 'teh',
// suggestions: ['the', ...],
// severity: 'error',
// position: { start: 0, end: 3 },
// confidence: 0.92,
// correctionAction: 'auto-fix'
// },
// {
// type: 'split-word',
// word: 'ist he',
// suggestions: ['is the'],
// severity: 'error',
// position: { start: 18, end: 24 },
// confidence: 0.95
// }
// ],
// stats: {
// totalWords: 6,
// misspelledWords: 2,
// correctedWords: 0,
// ignoredWords: 0,
// processingTime: 12
// }
// }
addWord(word: string, dictionaryName?: string): void
Add a word to a dictionary (default: 'custom').
removeWord(word: string, dictionaryName?: string): boolean
Remove a word from a dictionary.
getDictionaryNames(): string[]
List all loaded dictionary names.
addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void
Register a custom split-word pattern.
checker.addSplitWordPattern('with out', 'without', 0.9);
checkWordPair(word1: string, word2: string): SplitWordDetection | null
Check if two adjacent words are a split-word error.
detectSplitWords(text: string): SplitWordDetection[]
Detect all split-word errors in text.
setSplitWordDetection(enabled: boolean): void
Enable/disable split-word detection.
clearCache(): void
Clear internal caches.
SpellEngine Interface
The abstraction over the spell-checking backend. SymSpell is the production implementation.
interface SpellEngine {
isReady(): boolean;
contains(word: string): boolean;
suggest(word: string, maxSuggestions?: number): SpellSuggestion[];
addWord(word: string, frequency?: number): void;
bigramFrequency?(word1: string, word2: string): number; // optional
}
interface SpellSuggestion {
word: string;
distance: number; // edit distance from input
frequency: number; // corpus frequency count
}
SymSpellEngine
Production engine using @lilith/spellchecker-wasm.
import { SymSpellEngine } from '@lilith/text-processing-utils';
const engine = new SymSpellEngine({
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
bigramUrl: '/spellcheck-data/frequency-bigrams.txt', // optional
maxEditDistance: 2, // default: 2
});
await engine.init();
engine.contains('hello'); // true
engine.suggest('helo', 5); // [{ word: 'hello', distance: 1, frequency: 23135851162 }, ...]
engine.addWord('myterm', 1000); // add to runtime dictionary
engine.bigramFrequency('the', 'quick'); // bigram corpus count
Dictionary format (frequency-dictionary.txt):
the 23135851162
of 13151942776
and 12997637966
Bigram format (frequency-bigrams.txt):
the the 93748009
of the 87821839
in the 72063547
ConfidenceScorer
Multi-factor confidence scoring for spell corrections. Determines whether to auto-fix, suggest, or ignore.
import { ConfidenceScorer, CorrectionConfidence } from '@lilith/text-processing-utils';
Constructor
new ConfidenceScorer(options?: ConfidenceScorerOptions)
interface ConfidenceScorerOptions {
thresholds?: {
autoFix?: number; // Default: 0.7
suggest?: number; // Default: 0.5
possible?: number; // Default: 0.3
};
}
Methods
calculateConfidence(original, suggestion, additionalSuggestions?, engineFrequency?): number
Returns a score from 0.0 to 1.0 based on weighted factors:
| Factor | Weight | Source |
|---|---|---|
| Edit distance | 40% | Damerau-Levenshtein (transpositions, insertions, deletions, substitutions) |
| Phonetic match | 20% | Soundex OR Metaphone soundsLike() |
| Keyboard proximity | 20% | QWERTY adjacency map (adjacent-key substitutions and insertions) |
| Word frequency | 10% | Static corpus rank table or engine frequency |
| Context fit | 10% | Default 0.5 (reserved for future context analyzer) |
Modifiers:
- Case-only difference → 0.85 (fixed)
- Unique suggestion (no alternatives) → ×1.15
- More than 2 alternatives → ×0.85
decideAction(suggestions, confidence): CorrectionDecision
enum CorrectionConfidence {
AUTO_FIX = 'auto-fix', // ≥ 0.70 — safe to auto-correct
SUGGEST = 'suggest', // ≥ 0.50 — show suggestions to user
POSSIBLE = 'possible', // ≥ 0.30 — might be intentional
IGNORE = 'ignore', // < 0.30 — probably intentional
}
interface CorrectionDecision {
action: CorrectionConfidence;
confidence: number;
suggestion?: string; // Best suggestion (for AUTO_FIX / SUGGEST)
alternatives?: string[]; // Additional options (for SUGGEST / POSSIBLE)
reason: string; // Human-readable explanation
}
isTechnicalIdentifier(word: string): boolean
Detects camelCase, PascalCase, snake_case, CONST_CASE, and get/set/is/has/with/on prefixed identifiers.
adjustForTechnicalContext(baseConfidence, word, inCodeBlock?): number
Reduces confidence for technical patterns:
- camelCase / snake_case → ×0.5
- Version strings (
v1.2.3) → ×0.3 - Hex values (
a1b2c3) → ×0.2 - Inside code blocks → ×0.7
TypoManager
Coordinates split-word and joined-word pattern detection. Does not handle single-word typo correction (that's the engine's job).
import { TypoManager } from '@lilith/text-processing-utils';
Constructor
new TypoManager(enableSplitWords?: boolean, enableJoinedWords?: boolean)
// Both default to true
Methods
setDictionaryChecker(checker: (word: string) => boolean): void
// Split-word detection
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void
setSplitWordDetection(enabled: boolean): void
isSplitWordDetectionEnabled(): boolean
// Joined-word detection
detectJoinedWords(text: string): JoinedWordDetection[]
addJoinedWordPattern(joinedForm: string, splitWords: string[], confidence?: number, category?: string): void
setJoinedWordDetection(enabled: boolean): void
isJoinedWordDetectionEnabled(): boolean
getJoinedWordDetector(): JoinedWordDetector
// Management
clearDetectorCaches(): void
getStats(): { splitWordDetectionEnabled, joinedWordDetectionEnabled, joinedWordPatternsEnabled }
SplitWordDetector
Detects accidentally split words: "ist he" → "is the", "th e" → "the".
import { SplitWordDetector } from '@lilith/text-processing-utils';
Detection Methods
48+ hardcoded regex patterns for common splits, plus dictionary-based pair checking for unknown patterns.
interface SplitWordDetection {
originalText: string; // "ist he"
splitWords: string[]; // ["ist", "he"]
suggestedCorrection: string; // "is the"
confidence: number; // 0.0–1.0
startPosition: number;
endPosition: number;
pattern?: SplitWordPattern;
}
Constructor
new SplitWordDetector(
dictionaryChecker?: (word: string) => boolean,
maxWordSequence?: number // default: 4
)
Key Methods
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addCustomPattern(splitForm: string, correctForm: string, confidence?: number, context?: string): void
getKnownPatterns(): Map<string, SplitWordPattern>
hasPattern(splitForm: string): boolean
clearCache(): void
getCacheStats(): { detectionCacheSize, dictionaryCacheSize, maxCacheSize }
JoinedWordDetector
Detects accidentally joined words: "javascript" → "java script", "testcase" → "test case".
import { JoinedWordDetector } from '@lilith/text-processing-utils';
Detection Methods
Three detection strategies in priority order:
- Known patterns — hardcoded compound words (confidence: 0.95)
- Dictionary-based — split at every position, check both halves against dictionary
- Heuristic — camelCase splitting (when
enableHeuristics: true)
interface JoinedWordDetection {
originalWord: string; // "javascript"
suggestedSplit: string; // "java script"
splitWords: string[]; // ["java", "script"]
confidence: number;
startPosition: number;
endPosition: number;
patternType: 'known-pattern' | 'dictionary-based' | 'heuristic';
reason: string;
}
Constructor
new JoinedWordDetector(options?: JoinedWordDetectorOptions)
interface JoinedWordDetectorOptions {
maxWordLength?: number; // Skip words longer than this
minSplitLength?: number; // Minimum length for split parts
enableHeuristics?: boolean; // Enable camelCase heuristic (default: false)
confidenceThreshold?: number; // Filter results below this threshold
maxCacheSize?: number; // LRU cache size
}
Feature System
SOLID-based pluggable feature architecture. Each feature implements SpellCheckFeature and can be composed via FeatureManager.
interface SpellCheckFeature<TConfig = unknown> {
name: string;
enabled: boolean;
initialize(): Promise<void>;
checkText(text: string): Promise<FeatureResult[]>;
configure(options: Partial<TConfig>): void;
}
interface FeatureResult {
type: string;
originalText: string;
suggestedCorrection: string;
confidence: number;
startPosition: number;
endPosition: number;
metadata?: Record<string, unknown>;
}
FeatureManager
import { FeatureManager } from '@lilith/text-processing-utils';
const manager = new FeatureManager();
manager.addFeature(new CapitalizationFeature());
manager.addFeature(new GrammarPatternFeature());
manager.addFeature(new PunctuationFeature());
await manager.initializeAll();
const results = await manager.checkText('i went too the store.');
// Deduplicates overlapping results automatically
Available Features
| Feature | name |
Detects |
|---|---|---|
SplitWordFeature |
split-word-detection |
Accidentally split words |
JoinedWordFeature |
joined-word-detection |
Accidentally joined words |
CapitalizationFeature |
capitalization-detection |
Sentence-start, proper nouns, acronyms, title case |
GrammarPatternFeature |
grammar-pattern-detection |
Articles, homophones, contractions, agreement, double negatives |
PunctuationFeature |
punctuation-detection |
Missing/extra punctuation, quote style, bracket matching |
HomophoneFeature |
homophone-detection |
there/their/they're, its/it's, etc. |
RedundancyFeature |
redundancy-detection |
Redundant phrases ("ATM machine", "free gift") |
AbbreviationFeature |
abbreviation-detection |
Abbreviation consistency |
TechnicalConsistencyFeature |
technical-consistency |
Technical term consistency (JavaScript vs Javascript) |
Each feature has a corresponding *Factory class with presets:
import { GrammarPatternFeatureFactory } from '@lilith/text-processing-utils';
const strict = GrammarPatternFeatureFactory.createStrict();
const relaxed = GrammarPatternFeatureFactory.createRelaxed();
const technical = GrammarPatternFeatureFactory.createTechnicalWriting();
const custom = GrammarPatternFeatureFactory.createCustom({ checkHomophones: true, checkArticles: false });
Dictionary System
DictionaryManager
Manages multiple named dictionaries with priority ordering.
import { DictionaryManager, NodeDictionaryLoader } from '@lilith/text-processing-utils';
const manager = new DictionaryManager(new NodeDictionaryLoader('/path/to/data'));
await manager.initialize([
{ name: 'english', type: 'english', priority: 1 },
{ name: 'technical', type: 'technical', priority: 2 },
]);
manager.contains('function'); // true (checks all)
manager.contains('kubectl', ['technical']); // true (checks specific)
manager.getSuggestions('functon', 5); // ['function', ...]
manager.addWordToDictionary('myterm', 'custom');
Dictionary Types
| Class | Loads From |
|---|---|
EnglishDictionary |
english-words.txt via loader |
TechnicalDictionary |
technical-terms.txt via loader |
CustomDictionary |
In-memory, optionally seeded with word list |
DictionaryDataLoader
interface DictionaryDataLoader {
loadText(path: string): Promise<string>;
exists(path: string): Promise<boolean>;
}
Two implementations:
NodeDictionaryLoader(rootPath: string)—fs.readFilebasedFetchDictionaryLoader(baseUrl: string)— HTTPfetchbased (browser)
DictionaryPersistence
Import/export dictionaries to disk as JSON or plain text.
import { DictionaryPersistence } from '@lilith/text-processing-utils';
const persistence = new DictionaryPersistence('.dictionaries');
await persistence.saveDictionary(myDict);
await persistence.exportAsText(myDict, 'output.txt');
const imported = await persistence.importFromText('input.txt', 'my-dict');
await persistence.backupDictionaries('backup/');
Utility Classes
BloomFilter
Probabilistic set membership. Used internally for fast negative lookups.
import { BloomFilter } from '@lilith/text-processing-utils';
const filter = new BloomFilter(10000, 0.01); // 10K items, 1% false positive rate
filter.add('hello');
filter.addMany(['world', 'foo']);
filter.mightContain('hello'); // true
filter.definitelyNotContains('xyz'); // true (guaranteed)
filter.getStats(); // { size, numHashes, itemCount, saturation, estimatedFPR }
// Serialize
const data = filter.export();
const restored = BloomFilter.import(data);
LRUCache / TTLCache
import { TTLCache } from '@lilith/text-processing-utils';
const cache = new TTLCache<string, number>(1000, 60_000); // 1K entries, 60s TTL
cache.set('key', 42);
cache.get('key'); // 42
cache.getStats(); // { size, maxSize, hits, misses, hitRate }
cache.prune(); // Remove expired entries, returns count removed
Result Types
interface SpellCheckResult {
word: string;
correct: boolean;
suggestions: string[];
confidence: number;
position?: { start: number; end: number; line?: number; column?: number };
correctionDecision?: CorrectionDecision;
}
interface BatchSpellCheckResult {
errors: SpellCheckError[];
stats: {
totalWords: number;
misspelledWords: number;
correctedWords: number;
ignoredWords: number;
processingTime: number;
};
}
interface SpellCheckError {
type: 'misspelling' | 'grammar' | 'capitalization' | 'punctuation' | 'split-word' | 'joined-word';
word: string;
message: string;
suggestions: string[];
severity: 'error' | 'warning' | 'info';
position: { start: number; end: number; line?: number; column?: number };
confidence?: number;
correctionAction?: string;
splitWords?: string[];
}
Correction Strategies
import { AutoCorrector, ContextualCorrector } from '@lilith/text-processing-utils';
// Auto-correct above confidence threshold
const auto = new AutoCorrector(0.75);
auto.shouldApply('teh'); // true
auto.correct('teh', ['the', 'tea']); // 'the'
// Context-aware correction
const contextual = new ContextualCorrector();
contextual.correct('teh', ['the', 'tea'], 'I went to teh store'); // 'the'
Both implement CorrectionStrategy:
interface CorrectionStrategy {
name: string;
correct(word: string, suggestions: string[], context?: string): string | null;
shouldApply(word: string, context?: string): boolean;
}
Web Worker Integration
For browser use, the spellcheck system runs in a Web Worker to avoid blocking the main thread.
Main Thread Worker Thread
───────────── ─────────────
useSpellcheck() hook spellcheck.worker.ts
│ │
├─► postMessage({ type: 'init' }) ──►│
│ ├─► new SymSpellEngine()
│ ├─► engine.init() (loads WASM + dict ~2s)
│ ├─► new SpellChecker({ engine })
│◄── { type: 'ready' } ─────────────┤
│ │
├─► postMessage({ type: 'check', ──►│
│ text: '...' }) ├─► checker.checkText(text)
│ │
│◄── { type: 'result', ────────────┤
│ errors: [...] } │
│ │
└─► SpellcheckOverlay renders │
• Green underlines (auto-fix) │
• Cyan underlines (suggestions) │
• 5s countdown for approval │
useSpellcheck Hook
import { useSpellcheck } from './hooks/useSpellcheck';
const { errors, isReady, isChecking } = useSpellcheck(textValue, {
debounceMs: 300,
autoApproveConfidence: 0.5,
timeoutMode: 'auto-approve',
});
SpellcheckOverlay Component
import { SpellcheckOverlay } from './components/SpellcheckOverlay';
<SpellcheckOverlay
text={textValue}
errors={errors}
onApprove={(correction) => applyCorrection(correction)}
onDismiss={(error) => dismissError(error)}
/>