803 lines
24 KiB
Markdown
803 lines
24 KiB
Markdown
|
|
# Spellcheck System — API Reference
|
|||
|
|
|
|||
|
|
`@lilith/text-processing-utils` spellcheck module.
|
|||
|
|
|
|||
|
|
**Engine**: SymSpell via `@lilith/spellchecker-wasm` (WebAssembly, edit-distance + corpus frequency).
|
|||
|
|
**Architecture**: Engine-first checking → multi-factor confidence scoring → pattern-based split/joined word detection → 14 pluggable feature detectors.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Quick Start
|
|||
|
|
|
|||
|
|
### Browser (with WASM engine)
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const engine = new SymSpellEngine({
|
|||
|
|
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
|
|||
|
|
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
|
|||
|
|
bigramUrl: '/spellcheck-data/frequency-bigrams.txt',
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
await engine.init();
|
|||
|
|
|
|||
|
|
const checker = new SpellChecker({ engine, autoCorrect: true });
|
|||
|
|
await checker.initialize();
|
|||
|
|
|
|||
|
|
// Single word
|
|||
|
|
const result = await checker.check('teh');
|
|||
|
|
// { word: 'teh', correct: false, suggestions: ['the', ...], confidence: 0.92 }
|
|||
|
|
|
|||
|
|
// Auto-correct a sentence
|
|||
|
|
const fixed = await checker.fix('teh quikc brwon fox');
|
|||
|
|
// 'the quick brown fox'
|
|||
|
|
|
|||
|
|
// Full diagnostic (positions, severities, split/joined words)
|
|||
|
|
const report = await checker.checkText('teh quikc brwon fox');
|
|||
|
|
// { errors: [...], stats: { totalWords: 4, misspelledWords: 3, ... } }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Node.js (without engine — Trie dictionaries)
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { SpellChecker, NodeDictionaryLoader } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const checker = new SpellChecker({
|
|||
|
|
dictionaries: ['english', 'technical'],
|
|||
|
|
loader: new NodeDictionaryLoader('/path/to/data'),
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
await checker.initialize();
|
|||
|
|
const result = await checker.check('recieve');
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Input text
|
|||
|
|
│
|
|||
|
|
├─► tokenize ──► FOR EACH word:
|
|||
|
|
│ │
|
|||
|
|
│ ├─► shouldIgnoreWord? (numbers, URLs, emails, acronyms)
|
|||
|
|
│ │ └─► YES → skip
|
|||
|
|
│ │
|
|||
|
|
│ ├─► normalizeWord (lowercase, strip possessives)
|
|||
|
|
│ │
|
|||
|
|
│ ├─► engine.contains(word)
|
|||
|
|
│ │ └─► YES → correct
|
|||
|
|
│ │
|
|||
|
|
│ ├─► engine.suggest(word, max)
|
|||
|
|
│ │ └─► SymSpell: edit-distance candidates sorted by corpus frequency
|
|||
|
|
│ │
|
|||
|
|
│ ├─► ConfidenceScorer.calculateConfidence()
|
|||
|
|
│ │ Weighted factors:
|
|||
|
|
│ │ 40% edit distance (Damerau-Levenshtein)
|
|||
|
|
│ │ 20% phonetic match (Soundex + Metaphone)
|
|||
|
|
│ │ 20% keyboard proximity (QWERTY adjacency)
|
|||
|
|
│ │ 10% word frequency (corpus rank)
|
|||
|
|
│ │ 10% context fit
|
|||
|
|
│ │ Modifiers: ×1.15 unique suggestion, ×0.85 if >2 alternatives
|
|||
|
|
│ │
|
|||
|
|
│ ├─► Technical context adjustment
|
|||
|
|
│ │ camelCase/snake_case → ×0.5
|
|||
|
|
│ │ version strings → ×0.3
|
|||
|
|
│ │ hex values → ×0.2
|
|||
|
|
│ │ code blocks → ×0.7
|
|||
|
|
│ │
|
|||
|
|
│ └─► decideAction(suggestions, confidence)
|
|||
|
|
│ ≥ 0.70 → AUTO_FIX
|
|||
|
|
│ ≥ 0.50 → SUGGEST
|
|||
|
|
│ ≥ 0.30 → POSSIBLE
|
|||
|
|
│ < 0.30 → IGNORE
|
|||
|
|
│
|
|||
|
|
├─► Bigram context rescoring (checkText only)
|
|||
|
|
│ For ambiguous corrections, rescore using bigram frequencies
|
|||
|
|
│ with adjacent words. Promotes contextually natural choices.
|
|||
|
|
│ "teh nwe" → bigram("the","new") >> bigram("tea","new")
|
|||
|
|
│
|
|||
|
|
├─► Split-word detection (TypoManager → SplitWordDetector)
|
|||
|
|
│ "ist he" → "is the", "ot her" → "other"
|
|||
|
|
│ 48+ hardcoded regex patterns + dictionary pair checking
|
|||
|
|
│
|
|||
|
|
└─► Joined-word detection (TypoManager → JoinedWordDetector)
|
|||
|
|
"javascript" → "java script", "testcase" → "test case"
|
|||
|
|
Known patterns + dictionary split + camelCase heuristics
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## SpellChecker
|
|||
|
|
|
|||
|
|
The main entry point. Orchestrates engine, dictionaries, confidence scoring, and pattern detection.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { SpellChecker } from '@lilith/text-processing-utils';
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Constructor
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
new SpellChecker(options?: SpellCheckOptions)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface SpellCheckOptions {
|
|||
|
|
dictionaries?: string[]; // ['english', 'technical'] — Trie-based fallback
|
|||
|
|
customWords?: string[]; // Words to add to custom dictionary
|
|||
|
|
autoCorrect?: boolean; // Enable fix() auto-correction (default: false)
|
|||
|
|
threshold?: number; // Min confidence to report (default: 0)
|
|||
|
|
maxSuggestions?: number; // Max suggestions per word (default: 5)
|
|||
|
|
caseSensitive?: boolean; // Case-sensitive checking (default: false)
|
|||
|
|
ignoreNumbers?: boolean; // Skip numeric tokens (default: true)
|
|||
|
|
ignoreUrls?: boolean; // Skip URL tokens (default: true)
|
|||
|
|
ignoreEmails?: boolean; // Skip email tokens (default: true)
|
|||
|
|
ignoreCamelCase?: boolean; // Skip camelCase identifiers (default: false)
|
|||
|
|
minWordLength?: number; // Skip words shorter than this (default: 1)
|
|||
|
|
confidenceThresholds?: {
|
|||
|
|
autoFix?: number; // Default: 0.7
|
|||
|
|
suggest?: number; // Default: 0.5
|
|||
|
|
possible?: number; // Default: 0.3
|
|||
|
|
};
|
|||
|
|
enableSplitWordDetection?: boolean; // Default: true
|
|||
|
|
enableJoinedWordDetection?: boolean; // Default: true
|
|||
|
|
loader?: DictionaryDataLoader; // Trie dictionary file loader
|
|||
|
|
engine?: SpellEngine; // SymSpell engine (recommended)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### `initialize(): Promise<void>`
|
|||
|
|
|
|||
|
|
Load dictionaries / verify engine readiness. Called automatically on first `check()` if not called explicitly.
|
|||
|
|
|
|||
|
|
#### `check(word: string): Promise<SpellCheckResult>`
|
|||
|
|
|
|||
|
|
Check a single word.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
const result = await checker.check('recieve');
|
|||
|
|
// {
|
|||
|
|
// word: 'recieve',
|
|||
|
|
// correct: false,
|
|||
|
|
// suggestions: ['receive', 'relieve', ...],
|
|||
|
|
// confidence: 0.87,
|
|||
|
|
// correctionDecision: {
|
|||
|
|
// action: 'auto-fix',
|
|||
|
|
// confidence: 0.87,
|
|||
|
|
// suggestion: 'receive',
|
|||
|
|
// reason: 'High confidence typo'
|
|||
|
|
// }
|
|||
|
|
// }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `fix(text: string): Promise<string>`
|
|||
|
|
|
|||
|
|
Auto-correct a sentence. Only applies corrections with `AUTO_FIX` confidence (≥ 0.70). Also applies split-word and joined-word corrections with confidence ≥ 0.80.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
const fixed = await checker.fix('teh quikc brwon fox');
|
|||
|
|
// 'the quick brown fox'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `checkText(text: string): Promise<BatchSpellCheckResult>`
|
|||
|
|
|
|||
|
|
Full diagnostic for UI rendering. Returns positioned errors with severities, including split-word and joined-word detections. Applies bigram context rescoring.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
const report = await checker.checkText('teh quikc fox ist he best');
|
|||
|
|
// {
|
|||
|
|
// errors: [
|
|||
|
|
// {
|
|||
|
|
// type: 'misspelling',
|
|||
|
|
// word: 'teh',
|
|||
|
|
// suggestions: ['the', ...],
|
|||
|
|
// severity: 'error',
|
|||
|
|
// position: { start: 0, end: 3 },
|
|||
|
|
// confidence: 0.92,
|
|||
|
|
// correctionAction: 'auto-fix'
|
|||
|
|
// },
|
|||
|
|
// {
|
|||
|
|
// type: 'split-word',
|
|||
|
|
// word: 'ist he',
|
|||
|
|
// suggestions: ['is the'],
|
|||
|
|
// severity: 'error',
|
|||
|
|
// position: { start: 18, end: 24 },
|
|||
|
|
// confidence: 0.95
|
|||
|
|
// }
|
|||
|
|
// ],
|
|||
|
|
// stats: {
|
|||
|
|
// totalWords: 6,
|
|||
|
|
// misspelledWords: 2,
|
|||
|
|
// correctedWords: 0,
|
|||
|
|
// ignoredWords: 0,
|
|||
|
|
// processingTime: 12
|
|||
|
|
// }
|
|||
|
|
// }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `addWord(word: string, dictionaryName?: string): void`
|
|||
|
|
|
|||
|
|
Add a word to a dictionary (default: `'custom'`).
|
|||
|
|
|
|||
|
|
#### `removeWord(word: string, dictionaryName?: string): boolean`
|
|||
|
|
|
|||
|
|
Remove a word from a dictionary.
|
|||
|
|
|
|||
|
|
#### `getDictionaryNames(): string[]`
|
|||
|
|
|
|||
|
|
List all loaded dictionary names.
|
|||
|
|
|
|||
|
|
#### `addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void`
|
|||
|
|
|
|||
|
|
Register a custom split-word pattern.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
checker.addSplitWordPattern('with out', 'without', 0.9);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `checkWordPair(word1: string, word2: string): SplitWordDetection | null`
|
|||
|
|
|
|||
|
|
Check if two adjacent words are a split-word error.
|
|||
|
|
|
|||
|
|
#### `detectSplitWords(text: string): SplitWordDetection[]`
|
|||
|
|
|
|||
|
|
Detect all split-word errors in text.
|
|||
|
|
|
|||
|
|
#### `setSplitWordDetection(enabled: boolean): void`
|
|||
|
|
|
|||
|
|
Enable/disable split-word detection.
|
|||
|
|
|
|||
|
|
#### `clearCache(): void`
|
|||
|
|
|
|||
|
|
Clear internal caches.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## SpellEngine Interface
|
|||
|
|
|
|||
|
|
The abstraction over the spell-checking backend. SymSpell is the production implementation.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface SpellEngine {
|
|||
|
|
isReady(): boolean;
|
|||
|
|
contains(word: string): boolean;
|
|||
|
|
suggest(word: string, maxSuggestions?: number): SpellSuggestion[];
|
|||
|
|
addWord(word: string, frequency?: number): void;
|
|||
|
|
bigramFrequency?(word1: string, word2: string): number; // optional
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface SpellSuggestion {
|
|||
|
|
word: string;
|
|||
|
|
distance: number; // edit distance from input
|
|||
|
|
frequency: number; // corpus frequency count
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### SymSpellEngine
|
|||
|
|
|
|||
|
|
Production engine using `@lilith/spellchecker-wasm`.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { SymSpellEngine } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const engine = new SymSpellEngine({
|
|||
|
|
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
|
|||
|
|
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
|
|||
|
|
bigramUrl: '/spellcheck-data/frequency-bigrams.txt', // optional
|
|||
|
|
maxEditDistance: 2, // default: 2
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
await engine.init();
|
|||
|
|
|
|||
|
|
engine.contains('hello'); // true
|
|||
|
|
engine.suggest('helo', 5); // [{ word: 'hello', distance: 1, frequency: 23135851162 }, ...]
|
|||
|
|
engine.addWord('myterm', 1000); // add to runtime dictionary
|
|||
|
|
engine.bigramFrequency('the', 'quick'); // bigram corpus count
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Dictionary format** (`frequency-dictionary.txt`):
|
|||
|
|
```
|
|||
|
|
the 23135851162
|
|||
|
|
of 13151942776
|
|||
|
|
and 12997637966
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Bigram format** (`frequency-bigrams.txt`):
|
|||
|
|
```
|
|||
|
|
the the 93748009
|
|||
|
|
of the 87821839
|
|||
|
|
in the 72063547
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ConfidenceScorer
|
|||
|
|
|
|||
|
|
Multi-factor confidence scoring for spell corrections. Determines whether to auto-fix, suggest, or ignore.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { ConfidenceScorer, CorrectionConfidence } from '@lilith/text-processing-utils';
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Constructor
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
new ConfidenceScorer(options?: ConfidenceScorerOptions)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface ConfidenceScorerOptions {
|
|||
|
|
thresholds?: {
|
|||
|
|
autoFix?: number; // Default: 0.7
|
|||
|
|
suggest?: number; // Default: 0.5
|
|||
|
|
possible?: number; // Default: 0.3
|
|||
|
|
};
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### `calculateConfidence(original, suggestion, additionalSuggestions?, engineFrequency?): number`
|
|||
|
|
|
|||
|
|
Returns a score from 0.0 to 1.0 based on weighted factors:
|
|||
|
|
|
|||
|
|
| Factor | Weight | Source |
|
|||
|
|
|--------|--------|--------|
|
|||
|
|
| Edit distance | 40% | Damerau-Levenshtein (transpositions, insertions, deletions, substitutions) |
|
|||
|
|
| Phonetic match | 20% | Soundex OR Metaphone `soundsLike()` |
|
|||
|
|
| Keyboard proximity | 20% | QWERTY adjacency map (adjacent-key substitutions and insertions) |
|
|||
|
|
| Word frequency | 10% | Static corpus rank table or engine frequency |
|
|||
|
|
| Context fit | 10% | Default 0.5 (reserved for future context analyzer) |
|
|||
|
|
|
|||
|
|
Modifiers:
|
|||
|
|
- Case-only difference → 0.85 (fixed)
|
|||
|
|
- Unique suggestion (no alternatives) → ×1.15
|
|||
|
|
- More than 2 alternatives → ×0.85
|
|||
|
|
|
|||
|
|
#### `decideAction(suggestions, confidence): CorrectionDecision`
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
enum CorrectionConfidence {
|
|||
|
|
AUTO_FIX = 'auto-fix', // ≥ 0.70 — safe to auto-correct
|
|||
|
|
SUGGEST = 'suggest', // ≥ 0.50 — show suggestions to user
|
|||
|
|
POSSIBLE = 'possible', // ≥ 0.30 — might be intentional
|
|||
|
|
IGNORE = 'ignore', // < 0.30 — probably intentional
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface CorrectionDecision {
|
|||
|
|
action: CorrectionConfidence;
|
|||
|
|
confidence: number;
|
|||
|
|
suggestion?: string; // Best suggestion (for AUTO_FIX / SUGGEST)
|
|||
|
|
alternatives?: string[]; // Additional options (for SUGGEST / POSSIBLE)
|
|||
|
|
reason: string; // Human-readable explanation
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `isTechnicalIdentifier(word: string): boolean`
|
|||
|
|
|
|||
|
|
Detects camelCase, PascalCase, snake_case, CONST_CASE, and `get`/`set`/`is`/`has`/`with`/`on` prefixed identifiers.
|
|||
|
|
|
|||
|
|
#### `adjustForTechnicalContext(baseConfidence, word, inCodeBlock?): number`
|
|||
|
|
|
|||
|
|
Reduces confidence for technical patterns:
|
|||
|
|
- camelCase / snake_case → ×0.5
|
|||
|
|
- Version strings (`v1.2.3`) → ×0.3
|
|||
|
|
- Hex values (`a1b2c3`) → ×0.2
|
|||
|
|
- Inside code blocks → ×0.7
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## TypoManager
|
|||
|
|
|
|||
|
|
Coordinates split-word and joined-word pattern detection. Does **not** handle single-word typo correction (that's the engine's job).
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { TypoManager } from '@lilith/text-processing-utils';
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Constructor
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
new TypoManager(enableSplitWords?: boolean, enableJoinedWords?: boolean)
|
|||
|
|
// Both default to true
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
setDictionaryChecker(checker: (word: string) => boolean): void
|
|||
|
|
|
|||
|
|
// Split-word detection
|
|||
|
|
detectSplitWords(text: string): SplitWordDetection[]
|
|||
|
|
checkWordPair(word1: string, word2: string): SplitWordDetection | null
|
|||
|
|
addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void
|
|||
|
|
setSplitWordDetection(enabled: boolean): void
|
|||
|
|
isSplitWordDetectionEnabled(): boolean
|
|||
|
|
|
|||
|
|
// Joined-word detection
|
|||
|
|
detectJoinedWords(text: string): JoinedWordDetection[]
|
|||
|
|
addJoinedWordPattern(joinedForm: string, splitWords: string[], confidence?: number, category?: string): void
|
|||
|
|
setJoinedWordDetection(enabled: boolean): void
|
|||
|
|
isJoinedWordDetectionEnabled(): boolean
|
|||
|
|
getJoinedWordDetector(): JoinedWordDetector
|
|||
|
|
|
|||
|
|
// Management
|
|||
|
|
clearDetectorCaches(): void
|
|||
|
|
getStats(): { splitWordDetectionEnabled, joinedWordDetectionEnabled, joinedWordPatternsEnabled }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## SplitWordDetector
|
|||
|
|
|
|||
|
|
Detects accidentally split words: `"ist he"` → `"is the"`, `"th e"` → `"the"`.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { SplitWordDetector } from '@lilith/text-processing-utils';
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Detection Methods
|
|||
|
|
|
|||
|
|
48+ hardcoded regex patterns for common splits, plus dictionary-based pair checking for unknown patterns.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface SplitWordDetection {
|
|||
|
|
originalText: string; // "ist he"
|
|||
|
|
splitWords: string[]; // ["ist", "he"]
|
|||
|
|
suggestedCorrection: string; // "is the"
|
|||
|
|
confidence: number; // 0.0–1.0
|
|||
|
|
startPosition: number;
|
|||
|
|
endPosition: number;
|
|||
|
|
pattern?: SplitWordPattern;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Constructor
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
new SplitWordDetector(
|
|||
|
|
dictionaryChecker?: (word: string) => boolean,
|
|||
|
|
maxWordSequence?: number // default: 4
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Key Methods
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
detectSplitWords(text: string): SplitWordDetection[]
|
|||
|
|
checkWordPair(word1: string, word2: string): SplitWordDetection | null
|
|||
|
|
addCustomPattern(splitForm: string, correctForm: string, confidence?: number, context?: string): void
|
|||
|
|
getKnownPatterns(): Map<string, SplitWordPattern>
|
|||
|
|
hasPattern(splitForm: string): boolean
|
|||
|
|
clearCache(): void
|
|||
|
|
getCacheStats(): { detectionCacheSize, dictionaryCacheSize, maxCacheSize }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## JoinedWordDetector
|
|||
|
|
|
|||
|
|
Detects accidentally joined words: `"javascript"` → `"java script"`, `"testcase"` → `"test case"`.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { JoinedWordDetector } from '@lilith/text-processing-utils';
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Detection Methods
|
|||
|
|
|
|||
|
|
Three detection strategies in priority order:
|
|||
|
|
1. **Known patterns** — hardcoded compound words (confidence: 0.95)
|
|||
|
|
2. **Dictionary-based** — split at every position, check both halves against dictionary
|
|||
|
|
3. **Heuristic** — camelCase splitting (when `enableHeuristics: true`)
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface JoinedWordDetection {
|
|||
|
|
originalWord: string; // "javascript"
|
|||
|
|
suggestedSplit: string; // "java script"
|
|||
|
|
splitWords: string[]; // ["java", "script"]
|
|||
|
|
confidence: number;
|
|||
|
|
startPosition: number;
|
|||
|
|
endPosition: number;
|
|||
|
|
patternType: 'known-pattern' | 'dictionary-based' | 'heuristic';
|
|||
|
|
reason: string;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Constructor
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
new JoinedWordDetector(options?: JoinedWordDetectorOptions)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface JoinedWordDetectorOptions {
|
|||
|
|
maxWordLength?: number; // Skip words longer than this
|
|||
|
|
minSplitLength?: number; // Minimum length for split parts
|
|||
|
|
enableHeuristics?: boolean; // Enable camelCase heuristic (default: false)
|
|||
|
|
confidenceThreshold?: number; // Filter results below this threshold
|
|||
|
|
maxCacheSize?: number; // LRU cache size
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Feature System
|
|||
|
|
|
|||
|
|
SOLID-based pluggable feature architecture. Each feature implements `SpellCheckFeature` and can be composed via `FeatureManager`.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface SpellCheckFeature<TConfig = unknown> {
|
|||
|
|
name: string;
|
|||
|
|
enabled: boolean;
|
|||
|
|
initialize(): Promise<void>;
|
|||
|
|
checkText(text: string): Promise<FeatureResult[]>;
|
|||
|
|
configure(options: Partial<TConfig>): void;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface FeatureResult {
|
|||
|
|
type: string;
|
|||
|
|
originalText: string;
|
|||
|
|
suggestedCorrection: string;
|
|||
|
|
confidence: number;
|
|||
|
|
startPosition: number;
|
|||
|
|
endPosition: number;
|
|||
|
|
metadata?: Record<string, unknown>;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### FeatureManager
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { FeatureManager } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const manager = new FeatureManager();
|
|||
|
|
manager.addFeature(new CapitalizationFeature());
|
|||
|
|
manager.addFeature(new GrammarPatternFeature());
|
|||
|
|
manager.addFeature(new PunctuationFeature());
|
|||
|
|
|
|||
|
|
await manager.initializeAll();
|
|||
|
|
const results = await manager.checkText('i went too the store.');
|
|||
|
|
// Deduplicates overlapping results automatically
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Available Features
|
|||
|
|
|
|||
|
|
| Feature | `name` | Detects |
|
|||
|
|
|---------|--------|---------|
|
|||
|
|
| `SplitWordFeature` | `split-word-detection` | Accidentally split words |
|
|||
|
|
| `JoinedWordFeature` | `joined-word-detection` | Accidentally joined words |
|
|||
|
|
| `CapitalizationFeature` | `capitalization-detection` | Sentence-start, proper nouns, acronyms, title case |
|
|||
|
|
| `GrammarPatternFeature` | `grammar-pattern-detection` | Articles, homophones, contractions, agreement, double negatives |
|
|||
|
|
| `PunctuationFeature` | `punctuation-detection` | Missing/extra punctuation, quote style, bracket matching |
|
|||
|
|
| `HomophoneFeature` | `homophone-detection` | there/their/they're, its/it's, etc. |
|
|||
|
|
| `RedundancyFeature` | `redundancy-detection` | Redundant phrases ("ATM machine", "free gift") |
|
|||
|
|
| `AbbreviationFeature` | `abbreviation-detection` | Abbreviation consistency |
|
|||
|
|
| `TechnicalConsistencyFeature` | `technical-consistency` | Technical term consistency (JavaScript vs Javascript) |
|
|||
|
|
|
|||
|
|
Each feature has a corresponding `*Factory` class with presets:
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { GrammarPatternFeatureFactory } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const strict = GrammarPatternFeatureFactory.createStrict();
|
|||
|
|
const relaxed = GrammarPatternFeatureFactory.createRelaxed();
|
|||
|
|
const technical = GrammarPatternFeatureFactory.createTechnicalWriting();
|
|||
|
|
const custom = GrammarPatternFeatureFactory.createCustom({ checkHomophones: true, checkArticles: false });
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Dictionary System
|
|||
|
|
|
|||
|
|
### DictionaryManager
|
|||
|
|
|
|||
|
|
Manages multiple named dictionaries with priority ordering.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { DictionaryManager, NodeDictionaryLoader } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const manager = new DictionaryManager(new NodeDictionaryLoader('/path/to/data'));
|
|||
|
|
await manager.initialize([
|
|||
|
|
{ name: 'english', type: 'english', priority: 1 },
|
|||
|
|
{ name: 'technical', type: 'technical', priority: 2 },
|
|||
|
|
]);
|
|||
|
|
|
|||
|
|
manager.contains('function'); // true (checks all)
|
|||
|
|
manager.contains('kubectl', ['technical']); // true (checks specific)
|
|||
|
|
manager.getSuggestions('functon', 5); // ['function', ...]
|
|||
|
|
manager.addWordToDictionary('myterm', 'custom');
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Dictionary Types
|
|||
|
|
|
|||
|
|
| Class | Loads From |
|
|||
|
|
|-------|-----------|
|
|||
|
|
| `EnglishDictionary` | `english-words.txt` via loader |
|
|||
|
|
| `TechnicalDictionary` | `technical-terms.txt` via loader |
|
|||
|
|
| `CustomDictionary` | In-memory, optionally seeded with word list |
|
|||
|
|
|
|||
|
|
### DictionaryDataLoader
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface DictionaryDataLoader {
|
|||
|
|
loadText(path: string): Promise<string>;
|
|||
|
|
exists(path: string): Promise<boolean>;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Two implementations:
|
|||
|
|
- `NodeDictionaryLoader(rootPath: string)` — `fs.readFile` based
|
|||
|
|
- `FetchDictionaryLoader(baseUrl: string)` — HTTP `fetch` based (browser)
|
|||
|
|
|
|||
|
|
### DictionaryPersistence
|
|||
|
|
|
|||
|
|
Import/export dictionaries to disk as JSON or plain text.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { DictionaryPersistence } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const persistence = new DictionaryPersistence('.dictionaries');
|
|||
|
|
await persistence.saveDictionary(myDict);
|
|||
|
|
await persistence.exportAsText(myDict, 'output.txt');
|
|||
|
|
const imported = await persistence.importFromText('input.txt', 'my-dict');
|
|||
|
|
await persistence.backupDictionaries('backup/');
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Utility Classes
|
|||
|
|
|
|||
|
|
### BloomFilter
|
|||
|
|
|
|||
|
|
Probabilistic set membership. Used internally for fast negative lookups.
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { BloomFilter } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const filter = new BloomFilter(10000, 0.01); // 10K items, 1% false positive rate
|
|||
|
|
filter.add('hello');
|
|||
|
|
filter.addMany(['world', 'foo']);
|
|||
|
|
filter.mightContain('hello'); // true
|
|||
|
|
filter.definitelyNotContains('xyz'); // true (guaranteed)
|
|||
|
|
filter.getStats(); // { size, numHashes, itemCount, saturation, estimatedFPR }
|
|||
|
|
|
|||
|
|
// Serialize
|
|||
|
|
const data = filter.export();
|
|||
|
|
const restored = BloomFilter.import(data);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### LRUCache / TTLCache
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { TTLCache } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
const cache = new TTLCache<string, number>(1000, 60_000); // 1K entries, 60s TTL
|
|||
|
|
cache.set('key', 42);
|
|||
|
|
cache.get('key'); // 42
|
|||
|
|
cache.getStats(); // { size, maxSize, hits, misses, hitRate }
|
|||
|
|
cache.prune(); // Remove expired entries, returns count removed
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Result Types
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface SpellCheckResult {
|
|||
|
|
word: string;
|
|||
|
|
correct: boolean;
|
|||
|
|
suggestions: string[];
|
|||
|
|
confidence: number;
|
|||
|
|
position?: { start: number; end: number; line?: number; column?: number };
|
|||
|
|
correctionDecision?: CorrectionDecision;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface BatchSpellCheckResult {
|
|||
|
|
errors: SpellCheckError[];
|
|||
|
|
stats: {
|
|||
|
|
totalWords: number;
|
|||
|
|
misspelledWords: number;
|
|||
|
|
correctedWords: number;
|
|||
|
|
ignoredWords: number;
|
|||
|
|
processingTime: number;
|
|||
|
|
};
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
interface SpellCheckError {
|
|||
|
|
type: 'misspelling' | 'grammar' | 'capitalization' | 'punctuation' | 'split-word' | 'joined-word';
|
|||
|
|
word: string;
|
|||
|
|
message: string;
|
|||
|
|
suggestions: string[];
|
|||
|
|
severity: 'error' | 'warning' | 'info';
|
|||
|
|
position: { start: number; end: number; line?: number; column?: number };
|
|||
|
|
confidence?: number;
|
|||
|
|
correctionAction?: string;
|
|||
|
|
splitWords?: string[];
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Correction Strategies
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { AutoCorrector, ContextualCorrector } from '@lilith/text-processing-utils';
|
|||
|
|
|
|||
|
|
// Auto-correct above confidence threshold
|
|||
|
|
const auto = new AutoCorrector(0.75);
|
|||
|
|
auto.shouldApply('teh'); // true
|
|||
|
|
auto.correct('teh', ['the', 'tea']); // 'the'
|
|||
|
|
|
|||
|
|
// Context-aware correction
|
|||
|
|
const contextual = new ContextualCorrector();
|
|||
|
|
contextual.correct('teh', ['the', 'tea'], 'I went to teh store'); // 'the'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Both implement `CorrectionStrategy`:
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
interface CorrectionStrategy {
|
|||
|
|
name: string;
|
|||
|
|
correct(word: string, suggestions: string[], context?: string): string | null;
|
|||
|
|
shouldApply(word: string, context?: string): boolean;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Web Worker Integration
|
|||
|
|
|
|||
|
|
For browser use, the spellcheck system runs in a Web Worker to avoid blocking the main thread.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Main Thread Worker Thread
|
|||
|
|
───────────── ─────────────
|
|||
|
|
useSpellcheck() hook spellcheck.worker.ts
|
|||
|
|
│ │
|
|||
|
|
├─► postMessage({ type: 'init' }) ──►│
|
|||
|
|
│ ├─► new SymSpellEngine()
|
|||
|
|
│ ├─► engine.init() (loads WASM + dict ~2s)
|
|||
|
|
│ ├─► new SpellChecker({ engine })
|
|||
|
|
│◄── { type: 'ready' } ─────────────┤
|
|||
|
|
│ │
|
|||
|
|
├─► postMessage({ type: 'check', ──►│
|
|||
|
|
│ text: '...' }) ├─► checker.checkText(text)
|
|||
|
|
│ │
|
|||
|
|
│◄── { type: 'result', ────────────┤
|
|||
|
|
│ errors: [...] } │
|
|||
|
|
│ │
|
|||
|
|
└─► SpellcheckOverlay renders │
|
|||
|
|
• Green underlines (auto-fix) │
|
|||
|
|
• Cyan underlines (suggestions) │
|
|||
|
|
• 5s countdown for approval │
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### useSpellcheck Hook
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { useSpellcheck } from './hooks/useSpellcheck';
|
|||
|
|
|
|||
|
|
const { errors, isReady, isChecking } = useSpellcheck(textValue, {
|
|||
|
|
debounceMs: 300,
|
|||
|
|
autoApproveConfidence: 0.5,
|
|||
|
|
timeoutMode: 'auto-approve',
|
|||
|
|
});
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### SpellcheckOverlay Component
|
|||
|
|
|
|||
|
|
```typescript
|
|||
|
|
import { SpellcheckOverlay } from './components/SpellcheckOverlay';
|
|||
|
|
|
|||
|
|
<SpellcheckOverlay
|
|||
|
|
text={textValue}
|
|||
|
|
errors={errors}
|
|||
|
|
onApprove={(correction) => applyCorrection(correction)}
|
|||
|
|
onDismiss={(error) => dismissError(error)}
|
|||
|
|
/>
|
|||
|
|
```
|