# Spellcheck System — API Reference `@lilith/text-processing-utils` spellcheck module. **Engine**: SymSpell via `@lilith/spellchecker-wasm` (WebAssembly, edit-distance + corpus frequency). **Architecture**: Engine-first checking → multi-factor confidence scoring → pattern-based split/joined word detection → 14 pluggable feature detectors. --- ## Quick Start ### Browser (with WASM engine) ```typescript import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils'; const engine = new SymSpellEngine({ wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm', dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt', bigramUrl: '/spellcheck-data/frequency-bigrams.txt', }); await engine.init(); const checker = new SpellChecker({ engine, autoCorrect: true }); await checker.initialize(); // Single word const result = await checker.check('teh'); // { word: 'teh', correct: false, suggestions: ['the', ...], confidence: 0.92 } // Auto-correct a sentence const fixed = await checker.fix('teh quikc brwon fox'); // 'the quick brown fox' // Full diagnostic (positions, severities, split/joined words) const report = await checker.checkText('teh quikc brwon fox'); // { errors: [...], stats: { totalWords: 4, misspelledWords: 3, ... } } ``` ### Node.js (without engine — Trie dictionaries) ```typescript import { SpellChecker, NodeDictionaryLoader } from '@lilith/text-processing-utils'; const checker = new SpellChecker({ dictionaries: ['english', 'technical'], loader: new NodeDictionaryLoader('/path/to/data'), }); await checker.initialize(); const result = await checker.check('recieve'); ``` --- ## Architecture ``` Input text │ ├─► tokenize ──► FOR EACH word: │ │ │ ├─► shouldIgnoreWord? (numbers, URLs, emails, acronyms) │ │ └─► YES → skip │ │ │ ├─► normalizeWord (lowercase, strip possessives) │ │ │ ├─► engine.contains(word) │ │ └─► YES → correct │ │ │ ├─► engine.suggest(word, max) │ │ └─► SymSpell: edit-distance candidates sorted by corpus frequency │ │ │ ├─► ConfidenceScorer.calculateConfidence() │ │ Weighted factors: │ │ 40% edit distance (Damerau-Levenshtein) │ │ 20% phonetic match (Soundex + Metaphone) │ │ 20% keyboard proximity (QWERTY adjacency) │ │ 10% word frequency (corpus rank) │ │ 10% context fit │ │ Modifiers: ×1.15 unique suggestion, ×0.85 if >2 alternatives │ │ │ ├─► Technical context adjustment │ │ camelCase/snake_case → ×0.5 │ │ version strings → ×0.3 │ │ hex values → ×0.2 │ │ code blocks → ×0.7 │ │ │ └─► decideAction(suggestions, confidence) │ ≥ 0.70 → AUTO_FIX │ ≥ 0.50 → SUGGEST │ ≥ 0.30 → POSSIBLE │ < 0.30 → IGNORE │ ├─► Bigram context rescoring (checkText only) │ For ambiguous corrections, rescore using bigram frequencies │ with adjacent words. Promotes contextually natural choices. │ "teh nwe" → bigram("the","new") >> bigram("tea","new") │ ├─► Split-word detection (TypoManager → SplitWordDetector) │ "ist he" → "is the", "ot her" → "other" │ 48+ hardcoded regex patterns + dictionary pair checking │ └─► Joined-word detection (TypoManager → JoinedWordDetector) "javascript" → "java script", "testcase" → "test case" Known patterns + dictionary split + camelCase heuristics ``` --- ## SpellChecker The main entry point. Orchestrates engine, dictionaries, confidence scoring, and pattern detection. ```typescript import { SpellChecker } from '@lilith/text-processing-utils'; ``` ### Constructor ```typescript new SpellChecker(options?: SpellCheckOptions) ``` ```typescript interface SpellCheckOptions { dictionaries?: string[]; // ['english', 'technical'] — Trie-based fallback customWords?: string[]; // Words to add to custom dictionary autoCorrect?: boolean; // Enable fix() auto-correction (default: false) threshold?: number; // Min confidence to report (default: 0) maxSuggestions?: number; // Max suggestions per word (default: 5) caseSensitive?: boolean; // Case-sensitive checking (default: false) ignoreNumbers?: boolean; // Skip numeric tokens (default: true) ignoreUrls?: boolean; // Skip URL tokens (default: true) ignoreEmails?: boolean; // Skip email tokens (default: true) ignoreCamelCase?: boolean; // Skip camelCase identifiers (default: false) minWordLength?: number; // Skip words shorter than this (default: 1) confidenceThresholds?: { autoFix?: number; // Default: 0.7 suggest?: number; // Default: 0.5 possible?: number; // Default: 0.3 }; enableSplitWordDetection?: boolean; // Default: true enableJoinedWordDetection?: boolean; // Default: true loader?: DictionaryDataLoader; // Trie dictionary file loader engine?: SpellEngine; // SymSpell engine (recommended) } ``` ### Methods #### `initialize(): Promise` Load dictionaries / verify engine readiness. Called automatically on first `check()` if not called explicitly. #### `check(word: string): Promise` Check a single word. ```typescript const result = await checker.check('recieve'); // { // word: 'recieve', // correct: false, // suggestions: ['receive', 'relieve', ...], // confidence: 0.87, // correctionDecision: { // action: 'auto-fix', // confidence: 0.87, // suggestion: 'receive', // reason: 'High confidence typo' // } // } ``` #### `fix(text: string): Promise` Auto-correct a sentence. Only applies corrections with `AUTO_FIX` confidence (≥ 0.70). Also applies split-word and joined-word corrections with confidence ≥ 0.80. ```typescript const fixed = await checker.fix('teh quikc brwon fox'); // 'the quick brown fox' ``` #### `checkText(text: string): Promise` Full diagnostic for UI rendering. Returns positioned errors with severities, including split-word and joined-word detections. Applies bigram context rescoring. ```typescript const report = await checker.checkText('teh quikc fox ist he best'); // { // errors: [ // { // type: 'misspelling', // word: 'teh', // suggestions: ['the', ...], // severity: 'error', // position: { start: 0, end: 3 }, // confidence: 0.92, // correctionAction: 'auto-fix' // }, // { // type: 'split-word', // word: 'ist he', // suggestions: ['is the'], // severity: 'error', // position: { start: 18, end: 24 }, // confidence: 0.95 // } // ], // stats: { // totalWords: 6, // misspelledWords: 2, // correctedWords: 0, // ignoredWords: 0, // processingTime: 12 // } // } ``` #### `addWord(word: string, dictionaryName?: string): void` Add a word to a dictionary (default: `'custom'`). #### `removeWord(word: string, dictionaryName?: string): boolean` Remove a word from a dictionary. #### `getDictionaryNames(): string[]` List all loaded dictionary names. #### `addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void` Register a custom split-word pattern. ```typescript checker.addSplitWordPattern('with out', 'without', 0.9); ``` #### `checkWordPair(word1: string, word2: string): SplitWordDetection | null` Check if two adjacent words are a split-word error. #### `detectSplitWords(text: string): SplitWordDetection[]` Detect all split-word errors in text. #### `setSplitWordDetection(enabled: boolean): void` Enable/disable split-word detection. #### `clearCache(): void` Clear internal caches. --- ## SpellEngine Interface The abstraction over the spell-checking backend. SymSpell is the production implementation. ```typescript interface SpellEngine { isReady(): boolean; contains(word: string): boolean; suggest(word: string, maxSuggestions?: number): SpellSuggestion[]; addWord(word: string, frequency?: number): void; bigramFrequency?(word1: string, word2: string): number; // optional } interface SpellSuggestion { word: string; distance: number; // edit distance from input frequency: number; // corpus frequency count } ``` ### SymSpellEngine Production engine using `@lilith/spellchecker-wasm`. ```typescript import { SymSpellEngine } from '@lilith/text-processing-utils'; const engine = new SymSpellEngine({ wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm', dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt', bigramUrl: '/spellcheck-data/frequency-bigrams.txt', // optional maxEditDistance: 2, // default: 2 }); await engine.init(); engine.contains('hello'); // true engine.suggest('helo', 5); // [{ word: 'hello', distance: 1, frequency: 23135851162 }, ...] engine.addWord('myterm', 1000); // add to runtime dictionary engine.bigramFrequency('the', 'quick'); // bigram corpus count ``` **Dictionary format** (`frequency-dictionary.txt`): ``` the 23135851162 of 13151942776 and 12997637966 ``` **Bigram format** (`frequency-bigrams.txt`): ``` the the 93748009 of the 87821839 in the 72063547 ``` --- ## ConfidenceScorer Multi-factor confidence scoring for spell corrections. Determines whether to auto-fix, suggest, or ignore. ```typescript import { ConfidenceScorer, CorrectionConfidence } from '@lilith/text-processing-utils'; ``` ### Constructor ```typescript new ConfidenceScorer(options?: ConfidenceScorerOptions) ``` ```typescript interface ConfidenceScorerOptions { thresholds?: { autoFix?: number; // Default: 0.7 suggest?: number; // Default: 0.5 possible?: number; // Default: 0.3 }; } ``` ### Methods #### `calculateConfidence(original, suggestion, additionalSuggestions?, engineFrequency?): number` Returns a score from 0.0 to 1.0 based on weighted factors: | Factor | Weight | Source | |--------|--------|--------| | Edit distance | 40% | Damerau-Levenshtein (transpositions, insertions, deletions, substitutions) | | Phonetic match | 20% | Soundex OR Metaphone `soundsLike()` | | Keyboard proximity | 20% | QWERTY adjacency map (adjacent-key substitutions and insertions) | | Word frequency | 10% | Static corpus rank table or engine frequency | | Context fit | 10% | Default 0.5 (reserved for future context analyzer) | Modifiers: - Case-only difference → 0.85 (fixed) - Unique suggestion (no alternatives) → ×1.15 - More than 2 alternatives → ×0.85 #### `decideAction(suggestions, confidence): CorrectionDecision` ```typescript enum CorrectionConfidence { AUTO_FIX = 'auto-fix', // ≥ 0.70 — safe to auto-correct SUGGEST = 'suggest', // ≥ 0.50 — show suggestions to user POSSIBLE = 'possible', // ≥ 0.30 — might be intentional IGNORE = 'ignore', // < 0.30 — probably intentional } interface CorrectionDecision { action: CorrectionConfidence; confidence: number; suggestion?: string; // Best suggestion (for AUTO_FIX / SUGGEST) alternatives?: string[]; // Additional options (for SUGGEST / POSSIBLE) reason: string; // Human-readable explanation } ``` #### `isTechnicalIdentifier(word: string): boolean` Detects camelCase, PascalCase, snake_case, CONST_CASE, and `get`/`set`/`is`/`has`/`with`/`on` prefixed identifiers. #### `adjustForTechnicalContext(baseConfidence, word, inCodeBlock?): number` Reduces confidence for technical patterns: - camelCase / snake_case → ×0.5 - Version strings (`v1.2.3`) → ×0.3 - Hex values (`a1b2c3`) → ×0.2 - Inside code blocks → ×0.7 --- ## TypoManager Coordinates split-word and joined-word pattern detection. Does **not** handle single-word typo correction (that's the engine's job). ```typescript import { TypoManager } from '@lilith/text-processing-utils'; ``` ### Constructor ```typescript new TypoManager(enableSplitWords?: boolean, enableJoinedWords?: boolean) // Both default to true ``` ### Methods ```typescript setDictionaryChecker(checker: (word: string) => boolean): void // Split-word detection detectSplitWords(text: string): SplitWordDetection[] checkWordPair(word1: string, word2: string): SplitWordDetection | null addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void setSplitWordDetection(enabled: boolean): void isSplitWordDetectionEnabled(): boolean // Joined-word detection detectJoinedWords(text: string): JoinedWordDetection[] addJoinedWordPattern(joinedForm: string, splitWords: string[], confidence?: number, category?: string): void setJoinedWordDetection(enabled: boolean): void isJoinedWordDetectionEnabled(): boolean getJoinedWordDetector(): JoinedWordDetector // Management clearDetectorCaches(): void getStats(): { splitWordDetectionEnabled, joinedWordDetectionEnabled, joinedWordPatternsEnabled } ``` --- ## SplitWordDetector Detects accidentally split words: `"ist he"` → `"is the"`, `"th e"` → `"the"`. ```typescript import { SplitWordDetector } from '@lilith/text-processing-utils'; ``` ### Detection Methods 48+ hardcoded regex patterns for common splits, plus dictionary-based pair checking for unknown patterns. ```typescript interface SplitWordDetection { originalText: string; // "ist he" splitWords: string[]; // ["ist", "he"] suggestedCorrection: string; // "is the" confidence: number; // 0.0–1.0 startPosition: number; endPosition: number; pattern?: SplitWordPattern; } ``` ### Constructor ```typescript new SplitWordDetector( dictionaryChecker?: (word: string) => boolean, maxWordSequence?: number // default: 4 ) ``` ### Key Methods ```typescript detectSplitWords(text: string): SplitWordDetection[] checkWordPair(word1: string, word2: string): SplitWordDetection | null addCustomPattern(splitForm: string, correctForm: string, confidence?: number, context?: string): void getKnownPatterns(): Map hasPattern(splitForm: string): boolean clearCache(): void getCacheStats(): { detectionCacheSize, dictionaryCacheSize, maxCacheSize } ``` --- ## JoinedWordDetector Detects accidentally joined words: `"javascript"` → `"java script"`, `"testcase"` → `"test case"`. ```typescript import { JoinedWordDetector } from '@lilith/text-processing-utils'; ``` ### Detection Methods Three detection strategies in priority order: 1. **Known patterns** — hardcoded compound words (confidence: 0.95) 2. **Dictionary-based** — split at every position, check both halves against dictionary 3. **Heuristic** — camelCase splitting (when `enableHeuristics: true`) ```typescript interface JoinedWordDetection { originalWord: string; // "javascript" suggestedSplit: string; // "java script" splitWords: string[]; // ["java", "script"] confidence: number; startPosition: number; endPosition: number; patternType: 'known-pattern' | 'dictionary-based' | 'heuristic'; reason: string; } ``` ### Constructor ```typescript new JoinedWordDetector(options?: JoinedWordDetectorOptions) ``` ```typescript interface JoinedWordDetectorOptions { maxWordLength?: number; // Skip words longer than this minSplitLength?: number; // Minimum length for split parts enableHeuristics?: boolean; // Enable camelCase heuristic (default: false) confidenceThreshold?: number; // Filter results below this threshold maxCacheSize?: number; // LRU cache size } ``` --- ## Feature System SOLID-based pluggable feature architecture. Each feature implements `SpellCheckFeature` and can be composed via `FeatureManager`. ```typescript interface SpellCheckFeature { name: string; enabled: boolean; initialize(): Promise; checkText(text: string): Promise; configure(options: Partial): void; } interface FeatureResult { type: string; originalText: string; suggestedCorrection: string; confidence: number; startPosition: number; endPosition: number; metadata?: Record; } ``` ### FeatureManager ```typescript import { FeatureManager } from '@lilith/text-processing-utils'; const manager = new FeatureManager(); manager.addFeature(new CapitalizationFeature()); manager.addFeature(new GrammarPatternFeature()); manager.addFeature(new PunctuationFeature()); await manager.initializeAll(); const results = await manager.checkText('i went too the store.'); // Deduplicates overlapping results automatically ``` ### Available Features | Feature | `name` | Detects | |---------|--------|---------| | `SplitWordFeature` | `split-word-detection` | Accidentally split words | | `JoinedWordFeature` | `joined-word-detection` | Accidentally joined words | | `CapitalizationFeature` | `capitalization-detection` | Sentence-start, proper nouns, acronyms, title case | | `GrammarPatternFeature` | `grammar-pattern-detection` | Articles, homophones, contractions, agreement, double negatives | | `PunctuationFeature` | `punctuation-detection` | Missing/extra punctuation, quote style, bracket matching | | `HomophoneFeature` | `homophone-detection` | there/their/they're, its/it's, etc. | | `RedundancyFeature` | `redundancy-detection` | Redundant phrases ("ATM machine", "free gift") | | `AbbreviationFeature` | `abbreviation-detection` | Abbreviation consistency | | `TechnicalConsistencyFeature` | `technical-consistency` | Technical term consistency (JavaScript vs Javascript) | Each feature has a corresponding `*Factory` class with presets: ```typescript import { GrammarPatternFeatureFactory } from '@lilith/text-processing-utils'; const strict = GrammarPatternFeatureFactory.createStrict(); const relaxed = GrammarPatternFeatureFactory.createRelaxed(); const technical = GrammarPatternFeatureFactory.createTechnicalWriting(); const custom = GrammarPatternFeatureFactory.createCustom({ checkHomophones: true, checkArticles: false }); ``` --- ## Dictionary System ### DictionaryManager Manages multiple named dictionaries with priority ordering. ```typescript import { DictionaryManager, NodeDictionaryLoader } from '@lilith/text-processing-utils'; const manager = new DictionaryManager(new NodeDictionaryLoader('/path/to/data')); await manager.initialize([ { name: 'english', type: 'english', priority: 1 }, { name: 'technical', type: 'technical', priority: 2 }, ]); manager.contains('function'); // true (checks all) manager.contains('kubectl', ['technical']); // true (checks specific) manager.getSuggestions('functon', 5); // ['function', ...] manager.addWordToDictionary('myterm', 'custom'); ``` ### Dictionary Types | Class | Loads From | |-------|-----------| | `EnglishDictionary` | `english-words.txt` via loader | | `TechnicalDictionary` | `technical-terms.txt` via loader | | `CustomDictionary` | In-memory, optionally seeded with word list | ### DictionaryDataLoader ```typescript interface DictionaryDataLoader { loadText(path: string): Promise; exists(path: string): Promise; } ``` Two implementations: - `NodeDictionaryLoader(rootPath: string)` — `fs.readFile` based - `FetchDictionaryLoader(baseUrl: string)` — HTTP `fetch` based (browser) ### DictionaryPersistence Import/export dictionaries to disk as JSON or plain text. ```typescript import { DictionaryPersistence } from '@lilith/text-processing-utils'; const persistence = new DictionaryPersistence('.dictionaries'); await persistence.saveDictionary(myDict); await persistence.exportAsText(myDict, 'output.txt'); const imported = await persistence.importFromText('input.txt', 'my-dict'); await persistence.backupDictionaries('backup/'); ``` --- ## Utility Classes ### BloomFilter Probabilistic set membership. Used internally for fast negative lookups. ```typescript import { BloomFilter } from '@lilith/text-processing-utils'; const filter = new BloomFilter(10000, 0.01); // 10K items, 1% false positive rate filter.add('hello'); filter.addMany(['world', 'foo']); filter.mightContain('hello'); // true filter.definitelyNotContains('xyz'); // true (guaranteed) filter.getStats(); // { size, numHashes, itemCount, saturation, estimatedFPR } // Serialize const data = filter.export(); const restored = BloomFilter.import(data); ``` ### LRUCache / TTLCache ```typescript import { TTLCache } from '@lilith/text-processing-utils'; const cache = new TTLCache(1000, 60_000); // 1K entries, 60s TTL cache.set('key', 42); cache.get('key'); // 42 cache.getStats(); // { size, maxSize, hits, misses, hitRate } cache.prune(); // Remove expired entries, returns count removed ``` --- ## Result Types ```typescript interface SpellCheckResult { word: string; correct: boolean; suggestions: string[]; confidence: number; position?: { start: number; end: number; line?: number; column?: number }; correctionDecision?: CorrectionDecision; } interface BatchSpellCheckResult { errors: SpellCheckError[]; stats: { totalWords: number; misspelledWords: number; correctedWords: number; ignoredWords: number; processingTime: number; }; } interface SpellCheckError { type: 'misspelling' | 'grammar' | 'capitalization' | 'punctuation' | 'split-word' | 'joined-word'; word: string; message: string; suggestions: string[]; severity: 'error' | 'warning' | 'info'; position: { start: number; end: number; line?: number; column?: number }; confidence?: number; correctionAction?: string; splitWords?: string[]; } ``` --- ## Correction Strategies ```typescript import { AutoCorrector, ContextualCorrector } from '@lilith/text-processing-utils'; // Auto-correct above confidence threshold const auto = new AutoCorrector(0.75); auto.shouldApply('teh'); // true auto.correct('teh', ['the', 'tea']); // 'the' // Context-aware correction const contextual = new ContextualCorrector(); contextual.correct('teh', ['the', 'tea'], 'I went to teh store'); // 'the' ``` Both implement `CorrectionStrategy`: ```typescript interface CorrectionStrategy { name: string; correct(word: string, suggestions: string[], context?: string): string | null; shouldApply(word: string, context?: string): boolean; } ``` --- ## Web Worker Integration For browser use, the spellcheck system runs in a Web Worker to avoid blocking the main thread. ``` Main Thread Worker Thread ───────────── ───────────── useSpellcheck() hook spellcheck.worker.ts │ │ ├─► postMessage({ type: 'init' }) ──►│ │ ├─► new SymSpellEngine() │ ├─► engine.init() (loads WASM + dict ~2s) │ ├─► new SpellChecker({ engine }) │◄── { type: 'ready' } ─────────────┤ │ │ ├─► postMessage({ type: 'check', ──►│ │ text: '...' }) ├─► checker.checkText(text) │ │ │◄── { type: 'result', ────────────┤ │ errors: [...] } │ │ │ └─► SpellcheckOverlay renders │ • Green underlines (auto-fix) │ • Cyan underlines (suggestions) │ • 5s countdown for approval │ ``` ### useSpellcheck Hook ```typescript import { useSpellcheck } from './hooks/useSpellcheck'; const { errors, isReady, isChecking } = useSpellcheck(textValue, { debounceMs: 300, autoApproveConfidence: 0.5, timeoutMode: 'auto-approve', }); ``` ### SpellcheckOverlay Component ```typescript import { SpellcheckOverlay } from './components/SpellcheckOverlay'; applyCorrection(correction)} onDismiss={(error) => dismissError(error)} /> ```