text-processing-utils/docs/spellcheck.md

# Spellcheck System — API Reference

`@lilith/text-processing-utils` spellcheck module.

**Engine**: SymSpell via `@lilith/spellchecker-wasm` (WebAssembly, edit-distance + corpus frequency).
**Architecture**: Engine-first checking → multi-factor confidence scoring → pattern-based split/joined word detection → 14 pluggable feature detectors.

---

## Quick Start

### Browser (with WASM engine)

```typescript
import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils';

const engine = new SymSpellEngine({
  wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
  dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
  bigramUrl: '/spellcheck-data/frequency-bigrams.txt',
});

await engine.init();

const checker = new SpellChecker({ engine, autoCorrect: true });
await checker.initialize();

// Single word
const result = await checker.check('teh');
// { word: 'teh', correct: false, suggestions: ['the', ...], confidence: 0.92 }

// Auto-correct a sentence
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'

// Full diagnostic (positions, severities, split/joined words)
const report = await checker.checkText('teh quikc brwon fox');
// { errors: [...], stats: { totalWords: 4, misspelledWords: 3, ... } }
```

### Node.js (without engine — Trie dictionaries)

```typescript
import { SpellChecker, NodeDictionaryLoader } from '@lilith/text-processing-utils';

const checker = new SpellChecker({
  dictionaries: ['english', 'technical'],
  loader: new NodeDictionaryLoader('/path/to/data'),
});

await checker.initialize();
const result = await checker.check('recieve');
```

---

## Architecture

```
Input text
  │
  ├─► tokenize ──► FOR EACH word:
  │                  │
  │                  ├─► shouldIgnoreWord? (numbers, URLs, emails, acronyms)
  │                  │     └─► YES → skip
  │                  │
  │                  ├─► normalizeWord (lowercase, strip possessives)
  │                  │
  │                  ├─► engine.contains(word)
  │                  │     └─► YES → correct
  │                  │
  │                  ├─► engine.suggest(word, max)
  │                  │     └─► SymSpell: edit-distance candidates sorted by corpus frequency
  │                  │
  │                  ├─► ConfidenceScorer.calculateConfidence()
  │                  │     Weighted factors:
  │                  │       40% edit distance (Damerau-Levenshtein)
  │                  │       20% phonetic match (Soundex + Metaphone)
  │                  │       20% keyboard proximity (QWERTY adjacency)
  │                  │       10% word frequency (corpus rank)
  │                  │       10% context fit
  │                  │     Modifiers: ×1.15 unique suggestion, ×0.85 if >2 alternatives
  │                  │
  │                  ├─► Technical context adjustment
  │                  │     camelCase/snake_case → ×0.5
  │                  │     version strings → ×0.3
  │                  │     hex values → ×0.2
  │                  │     code blocks → ×0.7
  │                  │
  │                  └─► decideAction(suggestions, confidence)
  │                        ≥ 0.70 → AUTO_FIX
  │                        ≥ 0.50 → SUGGEST
  │                        ≥ 0.30 → POSSIBLE
  │                        < 0.30 → IGNORE
  │
  ├─► Bigram context rescoring (checkText only)
  │     For ambiguous corrections, rescore using bigram frequencies
  │     with adjacent words. Promotes contextually natural choices.
  │     "teh nwe" → bigram("the","new") >> bigram("tea","new")
  │
  ├─► Split-word detection (TypoManager → SplitWordDetector)
  │     "ist he" → "is the", "ot her" → "other"
  │     48+ hardcoded regex patterns + dictionary pair checking
  │
  └─► Joined-word detection (TypoManager → JoinedWordDetector)
        "javascript" → "java script", "testcase" → "test case"
        Known patterns + dictionary split + camelCase heuristics
```

---

## SpellChecker

The main entry point. Orchestrates engine, dictionaries, confidence scoring, and pattern detection.

```typescript
import { SpellChecker } from '@lilith/text-processing-utils';
```

### Constructor

```typescript
new SpellChecker(options?: SpellCheckOptions)
```

```typescript
interface SpellCheckOptions {
  dictionaries?: string[];               // ['english', 'technical'] — Trie-based fallback
  customWords?: string[];                 // Words to add to custom dictionary
  autoCorrect?: boolean;                  // Enable fix() auto-correction (default: false)
  threshold?: number;                     // Min confidence to report (default: 0)
  maxSuggestions?: number;                // Max suggestions per word (default: 5)
  caseSensitive?: boolean;                // Case-sensitive checking (default: false)
  ignoreNumbers?: boolean;                // Skip numeric tokens (default: true)
  ignoreUrls?: boolean;                   // Skip URL tokens (default: true)
  ignoreEmails?: boolean;                 // Skip email tokens (default: true)
  ignoreCamelCase?: boolean;              // Skip camelCase identifiers (default: false)
  minWordLength?: number;                 // Skip words shorter than this (default: 1)
  confidenceThresholds?: {
    autoFix?: number;                     // Default: 0.7
    suggest?: number;                     // Default: 0.5
    possible?: number;                    // Default: 0.3
  };
  enableSplitWordDetection?: boolean;     // Default: true
  enableJoinedWordDetection?: boolean;    // Default: true
  loader?: DictionaryDataLoader;          // Trie dictionary file loader
  engine?: SpellEngine;                   // SymSpell engine (recommended)
}
```

### Methods

#### `initialize(): Promise<void>`

Load dictionaries / verify engine readiness. Called automatically on first `check()` if not called explicitly.

#### `check(word: string): Promise<SpellCheckResult>`

Check a single word.

```typescript
const result = await checker.check('recieve');
// {
//   word: 'recieve',
//   correct: false,
//   suggestions: ['receive', 'relieve', ...],
//   confidence: 0.87,
//   correctionDecision: {
//     action: 'auto-fix',
//     confidence: 0.87,
//     suggestion: 'receive',
//     reason: 'High confidence typo'
//   }
// }
```

#### `fix(text: string): Promise<string>`

Auto-correct a sentence. Only applies corrections with `AUTO_FIX` confidence (≥ 0.70). Also applies split-word and joined-word corrections with confidence ≥ 0.80.

```typescript
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
```

#### `checkText(text: string): Promise<BatchSpellCheckResult>`

Full diagnostic for UI rendering. Returns positioned errors with severities, including split-word and joined-word detections. Applies bigram context rescoring.

```typescript
const report = await checker.checkText('teh quikc fox ist he best');
// {
//   errors: [
//     {
//       type: 'misspelling',
//       word: 'teh',
//       suggestions: ['the', ...],
//       severity: 'error',
//       position: { start: 0, end: 3 },
//       confidence: 0.92,
//       correctionAction: 'auto-fix'
//     },
//     {
//       type: 'split-word',
//       word: 'ist he',
//       suggestions: ['is the'],
//       severity: 'error',
//       position: { start: 18, end: 24 },
//       confidence: 0.95
//     }
//   ],
//   stats: {
//     totalWords: 6,
//     misspelledWords: 2,
//     correctedWords: 0,
//     ignoredWords: 0,
//     processingTime: 12
//   }
// }
```

#### `addWord(word: string, dictionaryName?: string): void`

Add a word to a dictionary (default: `'custom'`).

#### `removeWord(word: string, dictionaryName?: string): boolean`

Remove a word from a dictionary.

#### `getDictionaryNames(): string[]`

List all loaded dictionary names.

#### `addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void`

Register a custom split-word pattern.

```typescript
checker.addSplitWordPattern('with out', 'without', 0.9);
```

#### `checkWordPair(word1: string, word2: string): SplitWordDetection | null`

Check if two adjacent words are a split-word error.

#### `detectSplitWords(text: string): SplitWordDetection[]`

Detect all split-word errors in text.

#### `setSplitWordDetection(enabled: boolean): void`

Enable/disable split-word detection.

#### `clearCache(): void`

Clear internal caches.

---

## SpellEngine Interface

The abstraction over the spell-checking backend. SymSpell is the production implementation.

```typescript
interface SpellEngine {
  isReady(): boolean;
  contains(word: string): boolean;
  suggest(word: string, maxSuggestions?: number): SpellSuggestion[];
  addWord(word: string, frequency?: number): void;
  bigramFrequency?(word1: string, word2: string): number;  // optional
}

interface SpellSuggestion {
  word: string;
  distance: number;     // edit distance from input
  frequency: number;    // corpus frequency count
}
```

### SymSpellEngine

Production engine using `@lilith/spellchecker-wasm`.

```typescript
import { SymSpellEngine } from '@lilith/text-processing-utils';

const engine = new SymSpellEngine({
  wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
  dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
  bigramUrl: '/spellcheck-data/frequency-bigrams.txt',   // optional
  maxEditDistance: 2,                                      // default: 2
});

await engine.init();

engine.contains('hello');          // true
engine.suggest('helo', 5);        // [{ word: 'hello', distance: 1, frequency: 23135851162 }, ...]
engine.addWord('myterm', 1000);   // add to runtime dictionary
engine.bigramFrequency('the', 'quick'); // bigram corpus count
```

**Dictionary format** (`frequency-dictionary.txt`):
```
the 23135851162
of 13151942776
and 12997637966
```

**Bigram format** (`frequency-bigrams.txt`):
```
the the 93748009
of the 87821839
in the 72063547
```

---

## ConfidenceScorer

Multi-factor confidence scoring for spell corrections. Determines whether to auto-fix, suggest, or ignore.

```typescript
import { ConfidenceScorer, CorrectionConfidence } from '@lilith/text-processing-utils';
```

### Constructor

```typescript
new ConfidenceScorer(options?: ConfidenceScorerOptions)
```

```typescript
interface ConfidenceScorerOptions {
  thresholds?: {
    autoFix?: number;   // Default: 0.7
    suggest?: number;   // Default: 0.5
    possible?: number;  // Default: 0.3
  };
}
```

### Methods

#### `calculateConfidence(original, suggestion, additionalSuggestions?, engineFrequency?): number`

Returns a score from 0.0 to 1.0 based on weighted factors:

| Factor | Weight | Source |
|--------|--------|--------|
| Edit distance | 40% | Damerau-Levenshtein (transpositions, insertions, deletions, substitutions) |
| Phonetic match | 20% | Soundex OR Metaphone `soundsLike()` |
| Keyboard proximity | 20% | QWERTY adjacency map (adjacent-key substitutions and insertions) |
| Word frequency | 10% | Static corpus rank table or engine frequency |
| Context fit | 10% | Default 0.5 (reserved for future context analyzer) |

Modifiers:
- Case-only difference → 0.85 (fixed)
- Unique suggestion (no alternatives) → ×1.15
- More than 2 alternatives → ×0.85

#### `decideAction(suggestions, confidence): CorrectionDecision`

```typescript
enum CorrectionConfidence {
  AUTO_FIX = 'auto-fix',   // ≥ 0.70 — safe to auto-correct
  SUGGEST  = 'suggest',    // ≥ 0.50 — show suggestions to user
  POSSIBLE = 'possible',   // ≥ 0.30 — might be intentional
  IGNORE   = 'ignore',     // < 0.30 — probably intentional
}

interface CorrectionDecision {
  action: CorrectionConfidence;
  confidence: number;
  suggestion?: string;       // Best suggestion (for AUTO_FIX / SUGGEST)
  alternatives?: string[];   // Additional options (for SUGGEST / POSSIBLE)
  reason: string;            // Human-readable explanation
}
```

#### `isTechnicalIdentifier(word: string): boolean`

Detects camelCase, PascalCase, snake_case, CONST_CASE, and `get`/`set`/`is`/`has`/`with`/`on` prefixed identifiers.

#### `adjustForTechnicalContext(baseConfidence, word, inCodeBlock?): number`

Reduces confidence for technical patterns:
- camelCase / snake_case → ×0.5
- Version strings (`v1.2.3`) → ×0.3
- Hex values (`a1b2c3`) → ×0.2
- Inside code blocks → ×0.7

---

## TypoManager

Coordinates split-word and joined-word pattern detection. Does **not** handle single-word typo correction (that's the engine's job).

```typescript
import { TypoManager } from '@lilith/text-processing-utils';
```

### Constructor

```typescript
new TypoManager(enableSplitWords?: boolean, enableJoinedWords?: boolean)
// Both default to true
```

### Methods

```typescript
setDictionaryChecker(checker: (word: string) => boolean): void

// Split-word detection
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void
setSplitWordDetection(enabled: boolean): void
isSplitWordDetectionEnabled(): boolean

// Joined-word detection
detectJoinedWords(text: string): JoinedWordDetection[]
addJoinedWordPattern(joinedForm: string, splitWords: string[], confidence?: number, category?: string): void
setJoinedWordDetection(enabled: boolean): void
isJoinedWordDetectionEnabled(): boolean
getJoinedWordDetector(): JoinedWordDetector

// Management
clearDetectorCaches(): void
getStats(): { splitWordDetectionEnabled, joinedWordDetectionEnabled, joinedWordPatternsEnabled }
```

---

## SplitWordDetector

Detects accidentally split words: `"ist he"` → `"is the"`, `"th e"` → `"the"`.

```typescript
import { SplitWordDetector } from '@lilith/text-processing-utils';
```

### Detection Methods

48+ hardcoded regex patterns for common splits, plus dictionary-based pair checking for unknown patterns.

```typescript
interface SplitWordDetection {
  originalText: string;          // "ist he"
  splitWords: string[];          // ["ist", "he"]
  suggestedCorrection: string;   // "is the"
  confidence: number;            // 0.0–1.0
  startPosition: number;
  endPosition: number;
  pattern?: SplitWordPattern;
}
```

### Constructor

```typescript
new SplitWordDetector(
  dictionaryChecker?: (word: string) => boolean,
  maxWordSequence?: number  // default: 4
)
```

### Key Methods

```typescript
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addCustomPattern(splitForm: string, correctForm: string, confidence?: number, context?: string): void
getKnownPatterns(): Map<string, SplitWordPattern>
hasPattern(splitForm: string): boolean
clearCache(): void
getCacheStats(): { detectionCacheSize, dictionaryCacheSize, maxCacheSize }
```

---

## JoinedWordDetector

Detects accidentally joined words: `"javascript"` → `"java script"`, `"testcase"` → `"test case"`.

```typescript
import { JoinedWordDetector } from '@lilith/text-processing-utils';
```

### Detection Methods

Three detection strategies in priority order:
1. **Known patterns** — hardcoded compound words (confidence: 0.95)
2. **Dictionary-based** — split at every position, check both halves against dictionary
3. **Heuristic** — camelCase splitting (when `enableHeuristics: true`)

```typescript
interface JoinedWordDetection {
  originalWord: string;          // "javascript"
  suggestedSplit: string;        // "java script"
  splitWords: string[];          // ["java", "script"]
  confidence: number;
  startPosition: number;
  endPosition: number;
  patternType: 'known-pattern' | 'dictionary-based' | 'heuristic';
  reason: string;
}
```

### Constructor

```typescript
new JoinedWordDetector(options?: JoinedWordDetectorOptions)
```

```typescript
interface JoinedWordDetectorOptions {
  maxWordLength?: number;        // Skip words longer than this
  minSplitLength?: number;       // Minimum length for split parts
  enableHeuristics?: boolean;    // Enable camelCase heuristic (default: false)
  confidenceThreshold?: number;  // Filter results below this threshold
  maxCacheSize?: number;         // LRU cache size
}
```

---

## Feature System

SOLID-based pluggable feature architecture. Each feature implements `SpellCheckFeature` and can be composed via `FeatureManager`.

```typescript
interface SpellCheckFeature<TConfig = unknown> {
  name: string;
  enabled: boolean;
  initialize(): Promise<void>;
  checkText(text: string): Promise<FeatureResult[]>;
  configure(options: Partial<TConfig>): void;
}

interface FeatureResult {
  type: string;
  originalText: string;
  suggestedCorrection: string;
  confidence: number;
  startPosition: number;
  endPosition: number;
  metadata?: Record<string, unknown>;
}
```

### FeatureManager

```typescript
import { FeatureManager } from '@lilith/text-processing-utils';

const manager = new FeatureManager();
manager.addFeature(new CapitalizationFeature());
manager.addFeature(new GrammarPatternFeature());
manager.addFeature(new PunctuationFeature());

await manager.initializeAll();
const results = await manager.checkText('i went too the store.');
// Deduplicates overlapping results automatically
```

### Available Features

| Feature | `name` | Detects |
|---------|--------|---------|
| `SplitWordFeature` | `split-word-detection` | Accidentally split words |
| `JoinedWordFeature` | `joined-word-detection` | Accidentally joined words |
| `CapitalizationFeature` | `capitalization-detection` | Sentence-start, proper nouns, acronyms, title case |
| `GrammarPatternFeature` | `grammar-pattern-detection` | Articles, homophones, contractions, agreement, double negatives |
| `PunctuationFeature` | `punctuation-detection` | Missing/extra punctuation, quote style, bracket matching |
| `HomophoneFeature` | `homophone-detection` | there/their/they're, its/it's, etc. |
| `RedundancyFeature` | `redundancy-detection` | Redundant phrases ("ATM machine", "free gift") |
| `AbbreviationFeature` | `abbreviation-detection` | Abbreviation consistency |
| `TechnicalConsistencyFeature` | `technical-consistency` | Technical term consistency (JavaScript vs Javascript) |

Each feature has a corresponding `*Factory` class with presets:

```typescript
import { GrammarPatternFeatureFactory } from '@lilith/text-processing-utils';

const strict = GrammarPatternFeatureFactory.createStrict();
const relaxed = GrammarPatternFeatureFactory.createRelaxed();
const technical = GrammarPatternFeatureFactory.createTechnicalWriting();
const custom = GrammarPatternFeatureFactory.createCustom({ checkHomophones: true, checkArticles: false });
```

---

## Dictionary System

### DictionaryManager

Manages multiple named dictionaries with priority ordering.

```typescript
import { DictionaryManager, NodeDictionaryLoader } from '@lilith/text-processing-utils';

const manager = new DictionaryManager(new NodeDictionaryLoader('/path/to/data'));
await manager.initialize([
  { name: 'english', type: 'english', priority: 1 },
  { name: 'technical', type: 'technical', priority: 2 },
]);

manager.contains('function');                    // true (checks all)
manager.contains('kubectl', ['technical']);       // true (checks specific)
manager.getSuggestions('functon', 5);             // ['function', ...]
manager.addWordToDictionary('myterm', 'custom');
```

### Dictionary Types

| Class | Loads From |
|-------|-----------|
| `EnglishDictionary` | `english-words.txt` via loader |
| `TechnicalDictionary` | `technical-terms.txt` via loader |
| `CustomDictionary` | In-memory, optionally seeded with word list |

### DictionaryDataLoader

```typescript
interface DictionaryDataLoader {
  loadText(path: string): Promise<string>;
  exists(path: string): Promise<boolean>;
}
```

Two implementations:
- `NodeDictionaryLoader(rootPath: string)` — `fs.readFile` based
- `FetchDictionaryLoader(baseUrl: string)` — HTTP `fetch` based (browser)

### DictionaryPersistence

Import/export dictionaries to disk as JSON or plain text.

```typescript
import { DictionaryPersistence } from '@lilith/text-processing-utils';

const persistence = new DictionaryPersistence('.dictionaries');
await persistence.saveDictionary(myDict);
await persistence.exportAsText(myDict, 'output.txt');
const imported = await persistence.importFromText('input.txt', 'my-dict');
await persistence.backupDictionaries('backup/');
```

---

## Utility Classes

### BloomFilter

Probabilistic set membership. Used internally for fast negative lookups.

```typescript
import { BloomFilter } from '@lilith/text-processing-utils';

const filter = new BloomFilter(10000, 0.01);  // 10K items, 1% false positive rate
filter.add('hello');
filter.addMany(['world', 'foo']);
filter.mightContain('hello');           // true
filter.definitelyNotContains('xyz');    // true (guaranteed)
filter.getStats();                       // { size, numHashes, itemCount, saturation, estimatedFPR }

// Serialize
const data = filter.export();
const restored = BloomFilter.import(data);
```

### LRUCache / TTLCache

```typescript
import { TTLCache } from '@lilith/text-processing-utils';

const cache = new TTLCache<string, number>(1000, 60_000);  // 1K entries, 60s TTL
cache.set('key', 42);
cache.get('key');        // 42
cache.getStats();        // { size, maxSize, hits, misses, hitRate }
cache.prune();           // Remove expired entries, returns count removed
```

---

## Result Types

```typescript
interface SpellCheckResult {
  word: string;
  correct: boolean;
  suggestions: string[];
  confidence: number;
  position?: { start: number; end: number; line?: number; column?: number };
  correctionDecision?: CorrectionDecision;
}

interface BatchSpellCheckResult {
  errors: SpellCheckError[];
  stats: {
    totalWords: number;
    misspelledWords: number;
    correctedWords: number;
    ignoredWords: number;
    processingTime: number;
  };
}

interface SpellCheckError {
  type: 'misspelling' | 'grammar' | 'capitalization' | 'punctuation' | 'split-word' | 'joined-word';
  word: string;
  message: string;
  suggestions: string[];
  severity: 'error' | 'warning' | 'info';
  position: { start: number; end: number; line?: number; column?: number };
  confidence?: number;
  correctionAction?: string;
  splitWords?: string[];
}
```

---

## Correction Strategies

```typescript
import { AutoCorrector, ContextualCorrector } from '@lilith/text-processing-utils';

// Auto-correct above confidence threshold
const auto = new AutoCorrector(0.75);
auto.shouldApply('teh');                          // true
auto.correct('teh', ['the', 'tea']);              // 'the'

// Context-aware correction
const contextual = new ContextualCorrector();
contextual.correct('teh', ['the', 'tea'], 'I went to teh store');  // 'the'
```

Both implement `CorrectionStrategy`:

```typescript
interface CorrectionStrategy {
  name: string;
  correct(word: string, suggestions: string[], context?: string): string | null;
  shouldApply(word: string, context?: string): boolean;
}
```

---

## Web Worker Integration

For browser use, the spellcheck system runs in a Web Worker to avoid blocking the main thread.

```
Main Thread                          Worker Thread
─────────────                        ─────────────
useSpellcheck() hook                 spellcheck.worker.ts
  │                                    │
  ├─► postMessage({ type: 'init' }) ──►│
  │                                    ├─► new SymSpellEngine()
  │                                    ├─► engine.init()  (loads WASM + dict ~2s)
  │                                    ├─► new SpellChecker({ engine })
  │◄── { type: 'ready' } ─────────────┤
  │                                    │
  ├─► postMessage({ type: 'check',  ──►│
  │     text: '...' })                 ├─► checker.checkText(text)
  │                                    │
  │◄── { type: 'result',  ────────────┤
  │      errors: [...] }               │
  │                                    │
  └─► SpellcheckOverlay renders        │
      • Green underlines (auto-fix)    │
      • Cyan underlines (suggestions)  │
      • 5s countdown for approval      │
```

### useSpellcheck Hook

```typescript
import { useSpellcheck } from './hooks/useSpellcheck';

const { errors, isReady, isChecking } = useSpellcheck(textValue, {
  debounceMs: 300,
  autoApproveConfidence: 0.5,
  timeoutMode: 'auto-approve',
});
```

### SpellcheckOverlay Component

```typescript
import { SpellcheckOverlay } from './components/SpellcheckOverlay';

<SpellcheckOverlay
  text={textValue}
  errors={errors}
  onApprove={(correction) => applyCorrection(correction)}
  onDismiss={(error) => dismissError(error)}
/>
```