text-processing-utils/docs/spellcheck.md
2026-02-26 19:27:04 -08:00

802 lines
24 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Spellcheck System — API Reference
`@lilith/text-processing-utils` spellcheck module.
**Engine**: SymSpell via `@lilith/spellchecker-wasm` (WebAssembly, edit-distance + corpus frequency).
**Architecture**: Engine-first checking → multi-factor confidence scoring → pattern-based split/joined word detection → 14 pluggable feature detectors.
---
## Quick Start
### Browser (with WASM engine)
```typescript
import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils';
const engine = new SymSpellEngine({
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
bigramUrl: '/spellcheck-data/frequency-bigrams.txt',
});
await engine.init();
const checker = new SpellChecker({ engine, autoCorrect: true });
await checker.initialize();
// Single word
const result = await checker.check('teh');
// { word: 'teh', correct: false, suggestions: ['the', ...], confidence: 0.92 }
// Auto-correct a sentence
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
// Full diagnostic (positions, severities, split/joined words)
const report = await checker.checkText('teh quikc brwon fox');
// { errors: [...], stats: { totalWords: 4, misspelledWords: 3, ... } }
```
### Node.js (without engine — Trie dictionaries)
```typescript
import { SpellChecker, NodeDictionaryLoader } from '@lilith/text-processing-utils';
const checker = new SpellChecker({
dictionaries: ['english', 'technical'],
loader: new NodeDictionaryLoader('/path/to/data'),
});
await checker.initialize();
const result = await checker.check('recieve');
```
---
## Architecture
```
Input text
├─► tokenize ──► FOR EACH word:
│ │
│ ├─► shouldIgnoreWord? (numbers, URLs, emails, acronyms)
│ │ └─► YES → skip
│ │
│ ├─► normalizeWord (lowercase, strip possessives)
│ │
│ ├─► engine.contains(word)
│ │ └─► YES → correct
│ │
│ ├─► engine.suggest(word, max)
│ │ └─► SymSpell: edit-distance candidates sorted by corpus frequency
│ │
│ ├─► ConfidenceScorer.calculateConfidence()
│ │ Weighted factors:
│ │ 40% edit distance (Damerau-Levenshtein)
│ │ 20% phonetic match (Soundex + Metaphone)
│ │ 20% keyboard proximity (QWERTY adjacency)
│ │ 10% word frequency (corpus rank)
│ │ 10% context fit
│ │ Modifiers: ×1.15 unique suggestion, ×0.85 if >2 alternatives
│ │
│ ├─► Technical context adjustment
│ │ camelCase/snake_case → ×0.5
│ │ version strings → ×0.3
│ │ hex values → ×0.2
│ │ code blocks → ×0.7
│ │
│ └─► decideAction(suggestions, confidence)
│ ≥ 0.70 → AUTO_FIX
│ ≥ 0.50 → SUGGEST
│ ≥ 0.30 → POSSIBLE
│ < 0.30 → IGNORE
├─► Bigram context rescoring (checkText only)
│ For ambiguous corrections, rescore using bigram frequencies
│ with adjacent words. Promotes contextually natural choices.
│ "teh nwe" → bigram("the","new") >> bigram("tea","new")
├─► Split-word detection (TypoManager → SplitWordDetector)
│ "ist he" → "is the", "ot her" → "other"
│ 48+ hardcoded regex patterns + dictionary pair checking
└─► Joined-word detection (TypoManager → JoinedWordDetector)
"javascript" → "java script", "testcase" → "test case"
Known patterns + dictionary split + camelCase heuristics
```
---
## SpellChecker
The main entry point. Orchestrates engine, dictionaries, confidence scoring, and pattern detection.
```typescript
import { SpellChecker } from '@lilith/text-processing-utils';
```
### Constructor
```typescript
new SpellChecker(options?: SpellCheckOptions)
```
```typescript
interface SpellCheckOptions {
dictionaries?: string[]; // ['english', 'technical'] — Trie-based fallback
customWords?: string[]; // Words to add to custom dictionary
autoCorrect?: boolean; // Enable fix() auto-correction (default: false)
threshold?: number; // Min confidence to report (default: 0)
maxSuggestions?: number; // Max suggestions per word (default: 5)
caseSensitive?: boolean; // Case-sensitive checking (default: false)
ignoreNumbers?: boolean; // Skip numeric tokens (default: true)
ignoreUrls?: boolean; // Skip URL tokens (default: true)
ignoreEmails?: boolean; // Skip email tokens (default: true)
ignoreCamelCase?: boolean; // Skip camelCase identifiers (default: false)
minWordLength?: number; // Skip words shorter than this (default: 1)
confidenceThresholds?: {
autoFix?: number; // Default: 0.7
suggest?: number; // Default: 0.5
possible?: number; // Default: 0.3
};
enableSplitWordDetection?: boolean; // Default: true
enableJoinedWordDetection?: boolean; // Default: true
loader?: DictionaryDataLoader; // Trie dictionary file loader
engine?: SpellEngine; // SymSpell engine (recommended)
}
```
### Methods
#### `initialize(): Promise<void>`
Load dictionaries / verify engine readiness. Called automatically on first `check()` if not called explicitly.
#### `check(word: string): Promise<SpellCheckResult>`
Check a single word.
```typescript
const result = await checker.check('recieve');
// {
// word: 'recieve',
// correct: false,
// suggestions: ['receive', 'relieve', ...],
// confidence: 0.87,
// correctionDecision: {
// action: 'auto-fix',
// confidence: 0.87,
// suggestion: 'receive',
// reason: 'High confidence typo'
// }
// }
```
#### `fix(text: string): Promise<string>`
Auto-correct a sentence. Only applies corrections with `AUTO_FIX` confidence (≥ 0.70). Also applies split-word and joined-word corrections with confidence ≥ 0.80.
```typescript
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
```
#### `checkText(text: string): Promise<BatchSpellCheckResult>`
Full diagnostic for UI rendering. Returns positioned errors with severities, including split-word and joined-word detections. Applies bigram context rescoring.
```typescript
const report = await checker.checkText('teh quikc fox ist he best');
// {
// errors: [
// {
// type: 'misspelling',
// word: 'teh',
// suggestions: ['the', ...],
// severity: 'error',
// position: { start: 0, end: 3 },
// confidence: 0.92,
// correctionAction: 'auto-fix'
// },
// {
// type: 'split-word',
// word: 'ist he',
// suggestions: ['is the'],
// severity: 'error',
// position: { start: 18, end: 24 },
// confidence: 0.95
// }
// ],
// stats: {
// totalWords: 6,
// misspelledWords: 2,
// correctedWords: 0,
// ignoredWords: 0,
// processingTime: 12
// }
// }
```
#### `addWord(word: string, dictionaryName?: string): void`
Add a word to a dictionary (default: `'custom'`).
#### `removeWord(word: string, dictionaryName?: string): boolean`
Remove a word from a dictionary.
#### `getDictionaryNames(): string[]`
List all loaded dictionary names.
#### `addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void`
Register a custom split-word pattern.
```typescript
checker.addSplitWordPattern('with out', 'without', 0.9);
```
#### `checkWordPair(word1: string, word2: string): SplitWordDetection | null`
Check if two adjacent words are a split-word error.
#### `detectSplitWords(text: string): SplitWordDetection[]`
Detect all split-word errors in text.
#### `setSplitWordDetection(enabled: boolean): void`
Enable/disable split-word detection.
#### `clearCache(): void`
Clear internal caches.
---
## SpellEngine Interface
The abstraction over the spell-checking backend. SymSpell is the production implementation.
```typescript
interface SpellEngine {
isReady(): boolean;
contains(word: string): boolean;
suggest(word: string, maxSuggestions?: number): SpellSuggestion[];
addWord(word: string, frequency?: number): void;
bigramFrequency?(word1: string, word2: string): number; // optional
}
interface SpellSuggestion {
word: string;
distance: number; // edit distance from input
frequency: number; // corpus frequency count
}
```
### SymSpellEngine
Production engine using `@lilith/spellchecker-wasm`.
```typescript
import { SymSpellEngine } from '@lilith/text-processing-utils';
const engine = new SymSpellEngine({
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
bigramUrl: '/spellcheck-data/frequency-bigrams.txt', // optional
maxEditDistance: 2, // default: 2
});
await engine.init();
engine.contains('hello'); // true
engine.suggest('helo', 5); // [{ word: 'hello', distance: 1, frequency: 23135851162 }, ...]
engine.addWord('myterm', 1000); // add to runtime dictionary
engine.bigramFrequency('the', 'quick'); // bigram corpus count
```
**Dictionary format** (`frequency-dictionary.txt`):
```
the 23135851162
of 13151942776
and 12997637966
```
**Bigram format** (`frequency-bigrams.txt`):
```
the the 93748009
of the 87821839
in the 72063547
```
---
## ConfidenceScorer
Multi-factor confidence scoring for spell corrections. Determines whether to auto-fix, suggest, or ignore.
```typescript
import { ConfidenceScorer, CorrectionConfidence } from '@lilith/text-processing-utils';
```
### Constructor
```typescript
new ConfidenceScorer(options?: ConfidenceScorerOptions)
```
```typescript
interface ConfidenceScorerOptions {
thresholds?: {
autoFix?: number; // Default: 0.7
suggest?: number; // Default: 0.5
possible?: number; // Default: 0.3
};
}
```
### Methods
#### `calculateConfidence(original, suggestion, additionalSuggestions?, engineFrequency?): number`
Returns a score from 0.0 to 1.0 based on weighted factors:
| Factor | Weight | Source |
|--------|--------|--------|
| Edit distance | 40% | Damerau-Levenshtein (transpositions, insertions, deletions, substitutions) |
| Phonetic match | 20% | Soundex OR Metaphone `soundsLike()` |
| Keyboard proximity | 20% | QWERTY adjacency map (adjacent-key substitutions and insertions) |
| Word frequency | 10% | Static corpus rank table or engine frequency |
| Context fit | 10% | Default 0.5 (reserved for future context analyzer) |
Modifiers:
- Case-only difference → 0.85 (fixed)
- Unique suggestion (no alternatives) → ×1.15
- More than 2 alternatives → ×0.85
#### `decideAction(suggestions, confidence): CorrectionDecision`
```typescript
enum CorrectionConfidence {
AUTO_FIX = 'auto-fix', // ≥ 0.70 — safe to auto-correct
SUGGEST = 'suggest', // ≥ 0.50 — show suggestions to user
POSSIBLE = 'possible', // ≥ 0.30 — might be intentional
IGNORE = 'ignore', // < 0.30 — probably intentional
}
interface CorrectionDecision {
action: CorrectionConfidence;
confidence: number;
suggestion?: string; // Best suggestion (for AUTO_FIX / SUGGEST)
alternatives?: string[]; // Additional options (for SUGGEST / POSSIBLE)
reason: string; // Human-readable explanation
}
```
#### `isTechnicalIdentifier(word: string): boolean`
Detects camelCase, PascalCase, snake_case, CONST_CASE, and `get`/`set`/`is`/`has`/`with`/`on` prefixed identifiers.
#### `adjustForTechnicalContext(baseConfidence, word, inCodeBlock?): number`
Reduces confidence for technical patterns:
- camelCase / snake_case → ×0.5
- Version strings (`v1.2.3`) → ×0.3
- Hex values (`a1b2c3`) → ×0.2
- Inside code blocks → ×0.7
---
## TypoManager
Coordinates split-word and joined-word pattern detection. Does **not** handle single-word typo correction (that's the engine's job).
```typescript
import { TypoManager } from '@lilith/text-processing-utils';
```
### Constructor
```typescript
new TypoManager(enableSplitWords?: boolean, enableJoinedWords?: boolean)
// Both default to true
```
### Methods
```typescript
setDictionaryChecker(checker: (word: string) => boolean): void
// Split-word detection
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void
setSplitWordDetection(enabled: boolean): void
isSplitWordDetectionEnabled(): boolean
// Joined-word detection
detectJoinedWords(text: string): JoinedWordDetection[]
addJoinedWordPattern(joinedForm: string, splitWords: string[], confidence?: number, category?: string): void
setJoinedWordDetection(enabled: boolean): void
isJoinedWordDetectionEnabled(): boolean
getJoinedWordDetector(): JoinedWordDetector
// Management
clearDetectorCaches(): void
getStats(): { splitWordDetectionEnabled, joinedWordDetectionEnabled, joinedWordPatternsEnabled }
```
---
## SplitWordDetector
Detects accidentally split words: `"ist he"``"is the"`, `"th e"``"the"`.
```typescript
import { SplitWordDetector } from '@lilith/text-processing-utils';
```
### Detection Methods
48+ hardcoded regex patterns for common splits, plus dictionary-based pair checking for unknown patterns.
```typescript
interface SplitWordDetection {
originalText: string; // "ist he"
splitWords: string[]; // ["ist", "he"]
suggestedCorrection: string; // "is the"
confidence: number; // 0.01.0
startPosition: number;
endPosition: number;
pattern?: SplitWordPattern;
}
```
### Constructor
```typescript
new SplitWordDetector(
dictionaryChecker?: (word: string) => boolean,
maxWordSequence?: number // default: 4
)
```
### Key Methods
```typescript
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addCustomPattern(splitForm: string, correctForm: string, confidence?: number, context?: string): void
getKnownPatterns(): Map<string, SplitWordPattern>
hasPattern(splitForm: string): boolean
clearCache(): void
getCacheStats(): { detectionCacheSize, dictionaryCacheSize, maxCacheSize }
```
---
## JoinedWordDetector
Detects accidentally joined words: `"javascript"``"java script"`, `"testcase"``"test case"`.
```typescript
import { JoinedWordDetector } from '@lilith/text-processing-utils';
```
### Detection Methods
Three detection strategies in priority order:
1. **Known patterns** — hardcoded compound words (confidence: 0.95)
2. **Dictionary-based** — split at every position, check both halves against dictionary
3. **Heuristic** — camelCase splitting (when `enableHeuristics: true`)
```typescript
interface JoinedWordDetection {
originalWord: string; // "javascript"
suggestedSplit: string; // "java script"
splitWords: string[]; // ["java", "script"]
confidence: number;
startPosition: number;
endPosition: number;
patternType: 'known-pattern' | 'dictionary-based' | 'heuristic';
reason: string;
}
```
### Constructor
```typescript
new JoinedWordDetector(options?: JoinedWordDetectorOptions)
```
```typescript
interface JoinedWordDetectorOptions {
maxWordLength?: number; // Skip words longer than this
minSplitLength?: number; // Minimum length for split parts
enableHeuristics?: boolean; // Enable camelCase heuristic (default: false)
confidenceThreshold?: number; // Filter results below this threshold
maxCacheSize?: number; // LRU cache size
}
```
---
## Feature System
SOLID-based pluggable feature architecture. Each feature implements `SpellCheckFeature` and can be composed via `FeatureManager`.
```typescript
interface SpellCheckFeature<TConfig = unknown> {
name: string;
enabled: boolean;
initialize(): Promise<void>;
checkText(text: string): Promise<FeatureResult[]>;
configure(options: Partial<TConfig>): void;
}
interface FeatureResult {
type: string;
originalText: string;
suggestedCorrection: string;
confidence: number;
startPosition: number;
endPosition: number;
metadata?: Record<string, unknown>;
}
```
### FeatureManager
```typescript
import { FeatureManager } from '@lilith/text-processing-utils';
const manager = new FeatureManager();
manager.addFeature(new CapitalizationFeature());
manager.addFeature(new GrammarPatternFeature());
manager.addFeature(new PunctuationFeature());
await manager.initializeAll();
const results = await manager.checkText('i went too the store.');
// Deduplicates overlapping results automatically
```
### Available Features
| Feature | `name` | Detects |
|---------|--------|---------|
| `SplitWordFeature` | `split-word-detection` | Accidentally split words |
| `JoinedWordFeature` | `joined-word-detection` | Accidentally joined words |
| `CapitalizationFeature` | `capitalization-detection` | Sentence-start, proper nouns, acronyms, title case |
| `GrammarPatternFeature` | `grammar-pattern-detection` | Articles, homophones, contractions, agreement, double negatives |
| `PunctuationFeature` | `punctuation-detection` | Missing/extra punctuation, quote style, bracket matching |
| `HomophoneFeature` | `homophone-detection` | there/their/they're, its/it's, etc. |
| `RedundancyFeature` | `redundancy-detection` | Redundant phrases ("ATM machine", "free gift") |
| `AbbreviationFeature` | `abbreviation-detection` | Abbreviation consistency |
| `TechnicalConsistencyFeature` | `technical-consistency` | Technical term consistency (JavaScript vs Javascript) |
Each feature has a corresponding `*Factory` class with presets:
```typescript
import { GrammarPatternFeatureFactory } from '@lilith/text-processing-utils';
const strict = GrammarPatternFeatureFactory.createStrict();
const relaxed = GrammarPatternFeatureFactory.createRelaxed();
const technical = GrammarPatternFeatureFactory.createTechnicalWriting();
const custom = GrammarPatternFeatureFactory.createCustom({ checkHomophones: true, checkArticles: false });
```
---
## Dictionary System
### DictionaryManager
Manages multiple named dictionaries with priority ordering.
```typescript
import { DictionaryManager, NodeDictionaryLoader } from '@lilith/text-processing-utils';
const manager = new DictionaryManager(new NodeDictionaryLoader('/path/to/data'));
await manager.initialize([
{ name: 'english', type: 'english', priority: 1 },
{ name: 'technical', type: 'technical', priority: 2 },
]);
manager.contains('function'); // true (checks all)
manager.contains('kubectl', ['technical']); // true (checks specific)
manager.getSuggestions('functon', 5); // ['function', ...]
manager.addWordToDictionary('myterm', 'custom');
```
### Dictionary Types
| Class | Loads From |
|-------|-----------|
| `EnglishDictionary` | `english-words.txt` via loader |
| `TechnicalDictionary` | `technical-terms.txt` via loader |
| `CustomDictionary` | In-memory, optionally seeded with word list |
### DictionaryDataLoader
```typescript
interface DictionaryDataLoader {
loadText(path: string): Promise<string>;
exists(path: string): Promise<boolean>;
}
```
Two implementations:
- `NodeDictionaryLoader(rootPath: string)``fs.readFile` based
- `FetchDictionaryLoader(baseUrl: string)` — HTTP `fetch` based (browser)
### DictionaryPersistence
Import/export dictionaries to disk as JSON or plain text.
```typescript
import { DictionaryPersistence } from '@lilith/text-processing-utils';
const persistence = new DictionaryPersistence('.dictionaries');
await persistence.saveDictionary(myDict);
await persistence.exportAsText(myDict, 'output.txt');
const imported = await persistence.importFromText('input.txt', 'my-dict');
await persistence.backupDictionaries('backup/');
```
---
## Utility Classes
### BloomFilter
Probabilistic set membership. Used internally for fast negative lookups.
```typescript
import { BloomFilter } from '@lilith/text-processing-utils';
const filter = new BloomFilter(10000, 0.01); // 10K items, 1% false positive rate
filter.add('hello');
filter.addMany(['world', 'foo']);
filter.mightContain('hello'); // true
filter.definitelyNotContains('xyz'); // true (guaranteed)
filter.getStats(); // { size, numHashes, itemCount, saturation, estimatedFPR }
// Serialize
const data = filter.export();
const restored = BloomFilter.import(data);
```
### LRUCache / TTLCache
```typescript
import { TTLCache } from '@lilith/text-processing-utils';
const cache = new TTLCache<string, number>(1000, 60_000); // 1K entries, 60s TTL
cache.set('key', 42);
cache.get('key'); // 42
cache.getStats(); // { size, maxSize, hits, misses, hitRate }
cache.prune(); // Remove expired entries, returns count removed
```
---
## Result Types
```typescript
interface SpellCheckResult {
word: string;
correct: boolean;
suggestions: string[];
confidence: number;
position?: { start: number; end: number; line?: number; column?: number };
correctionDecision?: CorrectionDecision;
}
interface BatchSpellCheckResult {
errors: SpellCheckError[];
stats: {
totalWords: number;
misspelledWords: number;
correctedWords: number;
ignoredWords: number;
processingTime: number;
};
}
interface SpellCheckError {
type: 'misspelling' | 'grammar' | 'capitalization' | 'punctuation' | 'split-word' | 'joined-word';
word: string;
message: string;
suggestions: string[];
severity: 'error' | 'warning' | 'info';
position: { start: number; end: number; line?: number; column?: number };
confidence?: number;
correctionAction?: string;
splitWords?: string[];
}
```
---
## Correction Strategies
```typescript
import { AutoCorrector, ContextualCorrector } from '@lilith/text-processing-utils';
// Auto-correct above confidence threshold
const auto = new AutoCorrector(0.75);
auto.shouldApply('teh'); // true
auto.correct('teh', ['the', 'tea']); // 'the'
// Context-aware correction
const contextual = new ContextualCorrector();
contextual.correct('teh', ['the', 'tea'], 'I went to teh store'); // 'the'
```
Both implement `CorrectionStrategy`:
```typescript
interface CorrectionStrategy {
name: string;
correct(word: string, suggestions: string[], context?: string): string | null;
shouldApply(word: string, context?: string): boolean;
}
```
---
## Web Worker Integration
For browser use, the spellcheck system runs in a Web Worker to avoid blocking the main thread.
```
Main Thread Worker Thread
───────────── ─────────────
useSpellcheck() hook spellcheck.worker.ts
│ │
├─► postMessage({ type: 'init' }) ──►│
│ ├─► new SymSpellEngine()
│ ├─► engine.init() (loads WASM + dict ~2s)
│ ├─► new SpellChecker({ engine })
│◄── { type: 'ready' } ─────────────┤
│ │
├─► postMessage({ type: 'check', ──►│
│ text: '...' }) ├─► checker.checkText(text)
│ │
│◄── { type: 'result', ────────────┤
│ errors: [...] } │
│ │
└─► SpellcheckOverlay renders │
• Green underlines (auto-fix) │
• Cyan underlines (suggestions) │
• 5s countdown for approval │
```
### useSpellcheck Hook
```typescript
import { useSpellcheck } from './hooks/useSpellcheck';
const { errors, isReady, isChecking } = useSpellcheck(textValue, {
debounceMs: 300,
autoApproveConfidence: 0.5,
timeoutMode: 'auto-approve',
});
```
### SpellcheckOverlay Component
```typescript
import { SpellcheckOverlay } from './components/SpellcheckOverlay';
<SpellcheckOverlay
text={textValue}
errors={errors}
onApprove={(correction) => applyCorrection(correction)}
onDismiss={(error) => dismissError(error)}
/>
```