text-processing-utils/docs/spellcheck.md

803 lines
24 KiB
Markdown
Raw Permalink Normal View History

# Spellcheck System — API Reference
`@lilith/text-processing-utils` spellcheck module.
**Engine**: SymSpell via `@lilith/spellchecker-wasm` (WebAssembly, edit-distance + corpus frequency).
**Architecture**: Engine-first checking → multi-factor confidence scoring → pattern-based split/joined word detection → 14 pluggable feature detectors.
---
## Quick Start
### Browser (with WASM engine)
```typescript
import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils';
const engine = new SymSpellEngine({
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
bigramUrl: '/spellcheck-data/frequency-bigrams.txt',
});
await engine.init();
const checker = new SpellChecker({ engine, autoCorrect: true });
await checker.initialize();
// Single word
const result = await checker.check('teh');
// { word: 'teh', correct: false, suggestions: ['the', ...], confidence: 0.92 }
// Auto-correct a sentence
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
// Full diagnostic (positions, severities, split/joined words)
const report = await checker.checkText('teh quikc brwon fox');
// { errors: [...], stats: { totalWords: 4, misspelledWords: 3, ... } }
```
### Node.js (without engine — Trie dictionaries)
```typescript
import { SpellChecker, NodeDictionaryLoader } from '@lilith/text-processing-utils';
const checker = new SpellChecker({
dictionaries: ['english', 'technical'],
loader: new NodeDictionaryLoader('/path/to/data'),
});
await checker.initialize();
const result = await checker.check('recieve');
```
---
## Architecture
```
Input text
├─► tokenize ──► FOR EACH word:
│ │
│ ├─► shouldIgnoreWord? (numbers, URLs, emails, acronyms)
│ │ └─► YES → skip
│ │
│ ├─► normalizeWord (lowercase, strip possessives)
│ │
│ ├─► engine.contains(word)
│ │ └─► YES → correct
│ │
│ ├─► engine.suggest(word, max)
│ │ └─► SymSpell: edit-distance candidates sorted by corpus frequency
│ │
│ ├─► ConfidenceScorer.calculateConfidence()
│ │ Weighted factors:
│ │ 40% edit distance (Damerau-Levenshtein)
│ │ 20% phonetic match (Soundex + Metaphone)
│ │ 20% keyboard proximity (QWERTY adjacency)
│ │ 10% word frequency (corpus rank)
│ │ 10% context fit
│ │ Modifiers: ×1.15 unique suggestion, ×0.85 if >2 alternatives
│ │
│ ├─► Technical context adjustment
│ │ camelCase/snake_case → ×0.5
│ │ version strings → ×0.3
│ │ hex values → ×0.2
│ │ code blocks → ×0.7
│ │
│ └─► decideAction(suggestions, confidence)
│ ≥ 0.70 → AUTO_FIX
│ ≥ 0.50 → SUGGEST
│ ≥ 0.30 → POSSIBLE
< 0.30 IGNORE
├─► Bigram context rescoring (checkText only)
│ For ambiguous corrections, rescore using bigram frequencies
│ with adjacent words. Promotes contextually natural choices.
│ "teh nwe" → bigram("the","new") >> bigram("tea","new")
├─► Split-word detection (TypoManager → SplitWordDetector)
│ "ist he" → "is the", "ot her" → "other"
│ 48+ hardcoded regex patterns + dictionary pair checking
└─► Joined-word detection (TypoManager → JoinedWordDetector)
"javascript" → "java script", "testcase" → "test case"
Known patterns + dictionary split + camelCase heuristics
```
---
## SpellChecker
The main entry point. Orchestrates engine, dictionaries, confidence scoring, and pattern detection.
```typescript
import { SpellChecker } from '@lilith/text-processing-utils';
```
### Constructor
```typescript
new SpellChecker(options?: SpellCheckOptions)
```
```typescript
interface SpellCheckOptions {
dictionaries?: string[]; // ['english', 'technical'] — Trie-based fallback
customWords?: string[]; // Words to add to custom dictionary
autoCorrect?: boolean; // Enable fix() auto-correction (default: false)
threshold?: number; // Min confidence to report (default: 0)
maxSuggestions?: number; // Max suggestions per word (default: 5)
caseSensitive?: boolean; // Case-sensitive checking (default: false)
ignoreNumbers?: boolean; // Skip numeric tokens (default: true)
ignoreUrls?: boolean; // Skip URL tokens (default: true)
ignoreEmails?: boolean; // Skip email tokens (default: true)
ignoreCamelCase?: boolean; // Skip camelCase identifiers (default: false)
minWordLength?: number; // Skip words shorter than this (default: 1)
confidenceThresholds?: {
autoFix?: number; // Default: 0.7
suggest?: number; // Default: 0.5
possible?: number; // Default: 0.3
};
enableSplitWordDetection?: boolean; // Default: true
enableJoinedWordDetection?: boolean; // Default: true
loader?: DictionaryDataLoader; // Trie dictionary file loader
engine?: SpellEngine; // SymSpell engine (recommended)
}
```
### Methods
#### `initialize(): Promise<void>`
Load dictionaries / verify engine readiness. Called automatically on first `check()` if not called explicitly.
#### `check(word: string): Promise<SpellCheckResult>`
Check a single word.
```typescript
const result = await checker.check('recieve');
// {
// word: 'recieve',
// correct: false,
// suggestions: ['receive', 'relieve', ...],
// confidence: 0.87,
// correctionDecision: {
// action: 'auto-fix',
// confidence: 0.87,
// suggestion: 'receive',
// reason: 'High confidence typo'
// }
// }
```
#### `fix(text: string): Promise<string>`
Auto-correct a sentence. Only applies corrections with `AUTO_FIX` confidence (≥ 0.70). Also applies split-word and joined-word corrections with confidence ≥ 0.80.
```typescript
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
```
#### `checkText(text: string): Promise<BatchSpellCheckResult>`
Full diagnostic for UI rendering. Returns positioned errors with severities, including split-word and joined-word detections. Applies bigram context rescoring.
```typescript
const report = await checker.checkText('teh quikc fox ist he best');
// {
// errors: [
// {
// type: 'misspelling',
// word: 'teh',
// suggestions: ['the', ...],
// severity: 'error',
// position: { start: 0, end: 3 },
// confidence: 0.92,
// correctionAction: 'auto-fix'
// },
// {
// type: 'split-word',
// word: 'ist he',
// suggestions: ['is the'],
// severity: 'error',
// position: { start: 18, end: 24 },
// confidence: 0.95
// }
// ],
// stats: {
// totalWords: 6,
// misspelledWords: 2,
// correctedWords: 0,
// ignoredWords: 0,
// processingTime: 12
// }
// }
```
#### `addWord(word: string, dictionaryName?: string): void`
Add a word to a dictionary (default: `'custom'`).
#### `removeWord(word: string, dictionaryName?: string): boolean`
Remove a word from a dictionary.
#### `getDictionaryNames(): string[]`
List all loaded dictionary names.
#### `addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void`
Register a custom split-word pattern.
```typescript
checker.addSplitWordPattern('with out', 'without', 0.9);
```
#### `checkWordPair(word1: string, word2: string): SplitWordDetection | null`
Check if two adjacent words are a split-word error.
#### `detectSplitWords(text: string): SplitWordDetection[]`
Detect all split-word errors in text.
#### `setSplitWordDetection(enabled: boolean): void`
Enable/disable split-word detection.
#### `clearCache(): void`
Clear internal caches.
---
## SpellEngine Interface
The abstraction over the spell-checking backend. SymSpell is the production implementation.
```typescript
interface SpellEngine {
isReady(): boolean;
contains(word: string): boolean;
suggest(word: string, maxSuggestions?: number): SpellSuggestion[];
addWord(word: string, frequency?: number): void;
bigramFrequency?(word1: string, word2: string): number; // optional
}
interface SpellSuggestion {
word: string;
distance: number; // edit distance from input
frequency: number; // corpus frequency count
}
```
### SymSpellEngine
Production engine using `@lilith/spellchecker-wasm`.
```typescript
import { SymSpellEngine } from '@lilith/text-processing-utils';
const engine = new SymSpellEngine({
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
bigramUrl: '/spellcheck-data/frequency-bigrams.txt', // optional
maxEditDistance: 2, // default: 2
});
await engine.init();
engine.contains('hello'); // true
engine.suggest('helo', 5); // [{ word: 'hello', distance: 1, frequency: 23135851162 }, ...]
engine.addWord('myterm', 1000); // add to runtime dictionary
engine.bigramFrequency('the', 'quick'); // bigram corpus count
```
**Dictionary format** (`frequency-dictionary.txt`):
```
the 23135851162
of 13151942776
and 12997637966
```
**Bigram format** (`frequency-bigrams.txt`):
```
the the 93748009
of the 87821839
in the 72063547
```
---
## ConfidenceScorer
Multi-factor confidence scoring for spell corrections. Determines whether to auto-fix, suggest, or ignore.
```typescript
import { ConfidenceScorer, CorrectionConfidence } from '@lilith/text-processing-utils';
```
### Constructor
```typescript
new ConfidenceScorer(options?: ConfidenceScorerOptions)
```
```typescript
interface ConfidenceScorerOptions {
thresholds?: {
autoFix?: number; // Default: 0.7
suggest?: number; // Default: 0.5
possible?: number; // Default: 0.3
};
}
```
### Methods
#### `calculateConfidence(original, suggestion, additionalSuggestions?, engineFrequency?): number`
Returns a score from 0.0 to 1.0 based on weighted factors:
| Factor | Weight | Source |
|--------|--------|--------|
| Edit distance | 40% | Damerau-Levenshtein (transpositions, insertions, deletions, substitutions) |
| Phonetic match | 20% | Soundex OR Metaphone `soundsLike()` |
| Keyboard proximity | 20% | QWERTY adjacency map (adjacent-key substitutions and insertions) |
| Word frequency | 10% | Static corpus rank table or engine frequency |
| Context fit | 10% | Default 0.5 (reserved for future context analyzer) |
Modifiers:
- Case-only difference → 0.85 (fixed)
- Unique suggestion (no alternatives) → ×1.15
- More than 2 alternatives → ×0.85
#### `decideAction(suggestions, confidence): CorrectionDecision`
```typescript
enum CorrectionConfidence {
AUTO_FIX = 'auto-fix', // ≥ 0.70 — safe to auto-correct
SUGGEST = 'suggest', // ≥ 0.50 — show suggestions to user
POSSIBLE = 'possible', // ≥ 0.30 — might be intentional
IGNORE = 'ignore', // < 0.30 probably intentional
}
interface CorrectionDecision {
action: CorrectionConfidence;
confidence: number;
suggestion?: string; // Best suggestion (for AUTO_FIX / SUGGEST)
alternatives?: string[]; // Additional options (for SUGGEST / POSSIBLE)
reason: string; // Human-readable explanation
}
```
#### `isTechnicalIdentifier(word: string): boolean`
Detects camelCase, PascalCase, snake_case, CONST_CASE, and `get`/`set`/`is`/`has`/`with`/`on` prefixed identifiers.
#### `adjustForTechnicalContext(baseConfidence, word, inCodeBlock?): number`
Reduces confidence for technical patterns:
- camelCase / snake_case → ×0.5
- Version strings (`v1.2.3`) → ×0.3
- Hex values (`a1b2c3`) → ×0.2
- Inside code blocks → ×0.7
---
## TypoManager
Coordinates split-word and joined-word pattern detection. Does **not** handle single-word typo correction (that's the engine's job).
```typescript
import { TypoManager } from '@lilith/text-processing-utils';
```
### Constructor
```typescript
new TypoManager(enableSplitWords?: boolean, enableJoinedWords?: boolean)
// Both default to true
```
### Methods
```typescript
setDictionaryChecker(checker: (word: string) => boolean): void
// Split-word detection
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void
setSplitWordDetection(enabled: boolean): void
isSplitWordDetectionEnabled(): boolean
// Joined-word detection
detectJoinedWords(text: string): JoinedWordDetection[]
addJoinedWordPattern(joinedForm: string, splitWords: string[], confidence?: number, category?: string): void
setJoinedWordDetection(enabled: boolean): void
isJoinedWordDetectionEnabled(): boolean
getJoinedWordDetector(): JoinedWordDetector
// Management
clearDetectorCaches(): void
getStats(): { splitWordDetectionEnabled, joinedWordDetectionEnabled, joinedWordPatternsEnabled }
```
---
## SplitWordDetector
Detects accidentally split words: `"ist he"``"is the"`, `"th e"``"the"`.
```typescript
import { SplitWordDetector } from '@lilith/text-processing-utils';
```
### Detection Methods
48+ hardcoded regex patterns for common splits, plus dictionary-based pair checking for unknown patterns.
```typescript
interface SplitWordDetection {
originalText: string; // "ist he"
splitWords: string[]; // ["ist", "he"]
suggestedCorrection: string; // "is the"
confidence: number; // 0.01.0
startPosition: number;
endPosition: number;
pattern?: SplitWordPattern;
}
```
### Constructor
```typescript
new SplitWordDetector(
dictionaryChecker?: (word: string) => boolean,
maxWordSequence?: number // default: 4
)
```
### Key Methods
```typescript
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addCustomPattern(splitForm: string, correctForm: string, confidence?: number, context?: string): void
getKnownPatterns(): Map<string, SplitWordPattern>
hasPattern(splitForm: string): boolean
clearCache(): void
getCacheStats(): { detectionCacheSize, dictionaryCacheSize, maxCacheSize }
```
---
## JoinedWordDetector
Detects accidentally joined words: `"javascript"``"java script"`, `"testcase"``"test case"`.
```typescript
import { JoinedWordDetector } from '@lilith/text-processing-utils';
```
### Detection Methods
Three detection strategies in priority order:
1. **Known patterns** — hardcoded compound words (confidence: 0.95)
2. **Dictionary-based** — split at every position, check both halves against dictionary
3. **Heuristic** — camelCase splitting (when `enableHeuristics: true`)
```typescript
interface JoinedWordDetection {
originalWord: string; // "javascript"
suggestedSplit: string; // "java script"
splitWords: string[]; // ["java", "script"]
confidence: number;
startPosition: number;
endPosition: number;
patternType: 'known-pattern' | 'dictionary-based' | 'heuristic';
reason: string;
}
```
### Constructor
```typescript
new JoinedWordDetector(options?: JoinedWordDetectorOptions)
```
```typescript
interface JoinedWordDetectorOptions {
maxWordLength?: number; // Skip words longer than this
minSplitLength?: number; // Minimum length for split parts
enableHeuristics?: boolean; // Enable camelCase heuristic (default: false)
confidenceThreshold?: number; // Filter results below this threshold
maxCacheSize?: number; // LRU cache size
}
```
---
## Feature System
SOLID-based pluggable feature architecture. Each feature implements `SpellCheckFeature` and can be composed via `FeatureManager`.
```typescript
interface SpellCheckFeature<TConfig = unknown> {
name: string;
enabled: boolean;
initialize(): Promise<void>;
checkText(text: string): Promise<FeatureResult[]>;
configure(options: Partial<TConfig>): void;
}
interface FeatureResult {
type: string;
originalText: string;
suggestedCorrection: string;
confidence: number;
startPosition: number;
endPosition: number;
metadata?: Record<string, unknown>;
}
```
### FeatureManager
```typescript
import { FeatureManager } from '@lilith/text-processing-utils';
const manager = new FeatureManager();
manager.addFeature(new CapitalizationFeature());
manager.addFeature(new GrammarPatternFeature());
manager.addFeature(new PunctuationFeature());
await manager.initializeAll();
const results = await manager.checkText('i went too the store.');
// Deduplicates overlapping results automatically
```
### Available Features
| Feature | `name` | Detects |
|---------|--------|---------|
| `SplitWordFeature` | `split-word-detection` | Accidentally split words |
| `JoinedWordFeature` | `joined-word-detection` | Accidentally joined words |
| `CapitalizationFeature` | `capitalization-detection` | Sentence-start, proper nouns, acronyms, title case |
| `GrammarPatternFeature` | `grammar-pattern-detection` | Articles, homophones, contractions, agreement, double negatives |
| `PunctuationFeature` | `punctuation-detection` | Missing/extra punctuation, quote style, bracket matching |
| `HomophoneFeature` | `homophone-detection` | there/their/they're, its/it's, etc. |
| `RedundancyFeature` | `redundancy-detection` | Redundant phrases ("ATM machine", "free gift") |
| `AbbreviationFeature` | `abbreviation-detection` | Abbreviation consistency |
| `TechnicalConsistencyFeature` | `technical-consistency` | Technical term consistency (JavaScript vs Javascript) |
Each feature has a corresponding `*Factory` class with presets:
```typescript
import { GrammarPatternFeatureFactory } from '@lilith/text-processing-utils';
const strict = GrammarPatternFeatureFactory.createStrict();
const relaxed = GrammarPatternFeatureFactory.createRelaxed();
const technical = GrammarPatternFeatureFactory.createTechnicalWriting();
const custom = GrammarPatternFeatureFactory.createCustom({ checkHomophones: true, checkArticles: false });
```
---
## Dictionary System
### DictionaryManager
Manages multiple named dictionaries with priority ordering.
```typescript
import { DictionaryManager, NodeDictionaryLoader } from '@lilith/text-processing-utils';
const manager = new DictionaryManager(new NodeDictionaryLoader('/path/to/data'));
await manager.initialize([
{ name: 'english', type: 'english', priority: 1 },
{ name: 'technical', type: 'technical', priority: 2 },
]);
manager.contains('function'); // true (checks all)
manager.contains('kubectl', ['technical']); // true (checks specific)
manager.getSuggestions('functon', 5); // ['function', ...]
manager.addWordToDictionary('myterm', 'custom');
```
### Dictionary Types
| Class | Loads From |
|-------|-----------|
| `EnglishDictionary` | `english-words.txt` via loader |
| `TechnicalDictionary` | `technical-terms.txt` via loader |
| `CustomDictionary` | In-memory, optionally seeded with word list |
### DictionaryDataLoader
```typescript
interface DictionaryDataLoader {
loadText(path: string): Promise<string>;
exists(path: string): Promise<boolean>;
}
```
Two implementations:
- `NodeDictionaryLoader(rootPath: string)``fs.readFile` based
- `FetchDictionaryLoader(baseUrl: string)` — HTTP `fetch` based (browser)
### DictionaryPersistence
Import/export dictionaries to disk as JSON or plain text.
```typescript
import { DictionaryPersistence } from '@lilith/text-processing-utils';
const persistence = new DictionaryPersistence('.dictionaries');
await persistence.saveDictionary(myDict);
await persistence.exportAsText(myDict, 'output.txt');
const imported = await persistence.importFromText('input.txt', 'my-dict');
await persistence.backupDictionaries('backup/');
```
---
## Utility Classes
### BloomFilter
Probabilistic set membership. Used internally for fast negative lookups.
```typescript
import { BloomFilter } from '@lilith/text-processing-utils';
const filter = new BloomFilter(10000, 0.01); // 10K items, 1% false positive rate
filter.add('hello');
filter.addMany(['world', 'foo']);
filter.mightContain('hello'); // true
filter.definitelyNotContains('xyz'); // true (guaranteed)
filter.getStats(); // { size, numHashes, itemCount, saturation, estimatedFPR }
// Serialize
const data = filter.export();
const restored = BloomFilter.import(data);
```
### LRUCache / TTLCache
```typescript
import { TTLCache } from '@lilith/text-processing-utils';
const cache = new TTLCache<string, number>(1000, 60_000); // 1K entries, 60s TTL
cache.set('key', 42);
cache.get('key'); // 42
cache.getStats(); // { size, maxSize, hits, misses, hitRate }
cache.prune(); // Remove expired entries, returns count removed
```
---
## Result Types
```typescript
interface SpellCheckResult {
word: string;
correct: boolean;
suggestions: string[];
confidence: number;
position?: { start: number; end: number; line?: number; column?: number };
correctionDecision?: CorrectionDecision;
}
interface BatchSpellCheckResult {
errors: SpellCheckError[];
stats: {
totalWords: number;
misspelledWords: number;
correctedWords: number;
ignoredWords: number;
processingTime: number;
};
}
interface SpellCheckError {
type: 'misspelling' | 'grammar' | 'capitalization' | 'punctuation' | 'split-word' | 'joined-word';
word: string;
message: string;
suggestions: string[];
severity: 'error' | 'warning' | 'info';
position: { start: number; end: number; line?: number; column?: number };
confidence?: number;
correctionAction?: string;
splitWords?: string[];
}
```
---
## Correction Strategies
```typescript
import { AutoCorrector, ContextualCorrector } from '@lilith/text-processing-utils';
// Auto-correct above confidence threshold
const auto = new AutoCorrector(0.75);
auto.shouldApply('teh'); // true
auto.correct('teh', ['the', 'tea']); // 'the'
// Context-aware correction
const contextual = new ContextualCorrector();
contextual.correct('teh', ['the', 'tea'], 'I went to teh store'); // 'the'
```
Both implement `CorrectionStrategy`:
```typescript
interface CorrectionStrategy {
name: string;
correct(word: string, suggestions: string[], context?: string): string | null;
shouldApply(word: string, context?: string): boolean;
}
```
---
## Web Worker Integration
For browser use, the spellcheck system runs in a Web Worker to avoid blocking the main thread.
```
Main Thread Worker Thread
───────────── ─────────────
useSpellcheck() hook spellcheck.worker.ts
│ │
├─► postMessage({ type: 'init' }) ──►│
│ ├─► new SymSpellEngine()
│ ├─► engine.init() (loads WASM + dict ~2s)
│ ├─► new SpellChecker({ engine })
│◄── { type: 'ready' } ─────────────┤
│ │
├─► postMessage({ type: 'check', ──►│
│ text: '...' }) ├─► checker.checkText(text)
│ │
│◄── { type: 'result', ────────────┤
│ errors: [...] } │
│ │
└─► SpellcheckOverlay renders │
• Green underlines (auto-fix) │
• Cyan underlines (suggestions) │
• 5s countdown for approval │
```
### useSpellcheck Hook
```typescript
import { useSpellcheck } from './hooks/useSpellcheck';
const { errors, isReady, isChecking } = useSpellcheck(textValue, {
debounceMs: 300,
autoApproveConfidence: 0.5,
timeoutMode: 'auto-approve',
});
```
### SpellcheckOverlay Component
```typescript
import { SpellcheckOverlay } from './components/SpellcheckOverlay';
<SpellcheckOverlay
text={textValue}
errors={errors}
onApprove={(correction) => applyCorrection(correction)}
onDismiss={(error) => dismissError(error)}
/>
```