text-processing-utils/docs/spellcheck.md
2026-02-26 19:27:04 -08:00

24 KiB
Raw Blame History

Spellcheck System — API Reference

@lilith/text-processing-utils spellcheck module.

Engine: SymSpell via @lilith/spellchecker-wasm (WebAssembly, edit-distance + corpus frequency). Architecture: Engine-first checking → multi-factor confidence scoring → pattern-based split/joined word detection → 14 pluggable feature detectors.


Quick Start

Browser (with WASM engine)

import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils';

const engine = new SymSpellEngine({
  wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
  dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
  bigramUrl: '/spellcheck-data/frequency-bigrams.txt',
});

await engine.init();

const checker = new SpellChecker({ engine, autoCorrect: true });
await checker.initialize();

// Single word
const result = await checker.check('teh');
// { word: 'teh', correct: false, suggestions: ['the', ...], confidence: 0.92 }

// Auto-correct a sentence
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'

// Full diagnostic (positions, severities, split/joined words)
const report = await checker.checkText('teh quikc brwon fox');
// { errors: [...], stats: { totalWords: 4, misspelledWords: 3, ... } }

Node.js (without engine — Trie dictionaries)

import { SpellChecker, NodeDictionaryLoader } from '@lilith/text-processing-utils';

const checker = new SpellChecker({
  dictionaries: ['english', 'technical'],
  loader: new NodeDictionaryLoader('/path/to/data'),
});

await checker.initialize();
const result = await checker.check('recieve');

Architecture

Input text
  │
  ├─► tokenize ──► FOR EACH word:
  │                  │
  │                  ├─► shouldIgnoreWord? (numbers, URLs, emails, acronyms)
  │                  │     └─► YES → skip
  │                  │
  │                  ├─► normalizeWord (lowercase, strip possessives)
  │                  │
  │                  ├─► engine.contains(word)
  │                  │     └─► YES → correct
  │                  │
  │                  ├─► engine.suggest(word, max)
  │                  │     └─► SymSpell: edit-distance candidates sorted by corpus frequency
  │                  │
  │                  ├─► ConfidenceScorer.calculateConfidence()
  │                  │     Weighted factors:
  │                  │       40% edit distance (Damerau-Levenshtein)
  │                  │       20% phonetic match (Soundex + Metaphone)
  │                  │       20% keyboard proximity (QWERTY adjacency)
  │                  │       10% word frequency (corpus rank)
  │                  │       10% context fit
  │                  │     Modifiers: ×1.15 unique suggestion, ×0.85 if >2 alternatives
  │                  │
  │                  ├─► Technical context adjustment
  │                  │     camelCase/snake_case → ×0.5
  │                  │     version strings → ×0.3
  │                  │     hex values → ×0.2
  │                  │     code blocks → ×0.7
  │                  │
  │                  └─► decideAction(suggestions, confidence)
  │                        ≥ 0.70 → AUTO_FIX
  │                        ≥ 0.50 → SUGGEST
  │                        ≥ 0.30 → POSSIBLE
  │                        < 0.30 → IGNORE
  │
  ├─► Bigram context rescoring (checkText only)
  │     For ambiguous corrections, rescore using bigram frequencies
  │     with adjacent words. Promotes contextually natural choices.
  │     "teh nwe" → bigram("the","new") >> bigram("tea","new")
  │
  ├─► Split-word detection (TypoManager → SplitWordDetector)
  │     "ist he" → "is the", "ot her" → "other"
  │     48+ hardcoded regex patterns + dictionary pair checking
  │
  └─► Joined-word detection (TypoManager → JoinedWordDetector)
        "javascript" → "java script", "testcase" → "test case"
        Known patterns + dictionary split + camelCase heuristics

SpellChecker

The main entry point. Orchestrates engine, dictionaries, confidence scoring, and pattern detection.

import { SpellChecker } from '@lilith/text-processing-utils';

Constructor

new SpellChecker(options?: SpellCheckOptions)
interface SpellCheckOptions {
  dictionaries?: string[];               // ['english', 'technical'] — Trie-based fallback
  customWords?: string[];                 // Words to add to custom dictionary
  autoCorrect?: boolean;                  // Enable fix() auto-correction (default: false)
  threshold?: number;                     // Min confidence to report (default: 0)
  maxSuggestions?: number;                // Max suggestions per word (default: 5)
  caseSensitive?: boolean;                // Case-sensitive checking (default: false)
  ignoreNumbers?: boolean;                // Skip numeric tokens (default: true)
  ignoreUrls?: boolean;                   // Skip URL tokens (default: true)
  ignoreEmails?: boolean;                 // Skip email tokens (default: true)
  ignoreCamelCase?: boolean;              // Skip camelCase identifiers (default: false)
  minWordLength?: number;                 // Skip words shorter than this (default: 1)
  confidenceThresholds?: {
    autoFix?: number;                     // Default: 0.7
    suggest?: number;                     // Default: 0.5
    possible?: number;                    // Default: 0.3
  };
  enableSplitWordDetection?: boolean;     // Default: true
  enableJoinedWordDetection?: boolean;    // Default: true
  loader?: DictionaryDataLoader;          // Trie dictionary file loader
  engine?: SpellEngine;                   // SymSpell engine (recommended)
}

Methods

initialize(): Promise<void>

Load dictionaries / verify engine readiness. Called automatically on first check() if not called explicitly.

check(word: string): Promise<SpellCheckResult>

Check a single word.

const result = await checker.check('recieve');
// {
//   word: 'recieve',
//   correct: false,
//   suggestions: ['receive', 'relieve', ...],
//   confidence: 0.87,
//   correctionDecision: {
//     action: 'auto-fix',
//     confidence: 0.87,
//     suggestion: 'receive',
//     reason: 'High confidence typo'
//   }
// }

fix(text: string): Promise<string>

Auto-correct a sentence. Only applies corrections with AUTO_FIX confidence (≥ 0.70). Also applies split-word and joined-word corrections with confidence ≥ 0.80.

const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'

checkText(text: string): Promise<BatchSpellCheckResult>

Full diagnostic for UI rendering. Returns positioned errors with severities, including split-word and joined-word detections. Applies bigram context rescoring.

const report = await checker.checkText('teh quikc fox ist he best');
// {
//   errors: [
//     {
//       type: 'misspelling',
//       word: 'teh',
//       suggestions: ['the', ...],
//       severity: 'error',
//       position: { start: 0, end: 3 },
//       confidence: 0.92,
//       correctionAction: 'auto-fix'
//     },
//     {
//       type: 'split-word',
//       word: 'ist he',
//       suggestions: ['is the'],
//       severity: 'error',
//       position: { start: 18, end: 24 },
//       confidence: 0.95
//     }
//   ],
//   stats: {
//     totalWords: 6,
//     misspelledWords: 2,
//     correctedWords: 0,
//     ignoredWords: 0,
//     processingTime: 12
//   }
// }

addWord(word: string, dictionaryName?: string): void

Add a word to a dictionary (default: 'custom').

removeWord(word: string, dictionaryName?: string): boolean

Remove a word from a dictionary.

getDictionaryNames(): string[]

List all loaded dictionary names.

addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void

Register a custom split-word pattern.

checker.addSplitWordPattern('with out', 'without', 0.9);

checkWordPair(word1: string, word2: string): SplitWordDetection | null

Check if two adjacent words are a split-word error.

detectSplitWords(text: string): SplitWordDetection[]

Detect all split-word errors in text.

setSplitWordDetection(enabled: boolean): void

Enable/disable split-word detection.

clearCache(): void

Clear internal caches.


SpellEngine Interface

The abstraction over the spell-checking backend. SymSpell is the production implementation.

interface SpellEngine {
  isReady(): boolean;
  contains(word: string): boolean;
  suggest(word: string, maxSuggestions?: number): SpellSuggestion[];
  addWord(word: string, frequency?: number): void;
  bigramFrequency?(word1: string, word2: string): number;  // optional
}

interface SpellSuggestion {
  word: string;
  distance: number;     // edit distance from input
  frequency: number;    // corpus frequency count
}

SymSpellEngine

Production engine using @lilith/spellchecker-wasm.

import { SymSpellEngine } from '@lilith/text-processing-utils';

const engine = new SymSpellEngine({
  wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
  dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
  bigramUrl: '/spellcheck-data/frequency-bigrams.txt',   // optional
  maxEditDistance: 2,                                      // default: 2
});

await engine.init();

engine.contains('hello');          // true
engine.suggest('helo', 5);        // [{ word: 'hello', distance: 1, frequency: 23135851162 }, ...]
engine.addWord('myterm', 1000);   // add to runtime dictionary
engine.bigramFrequency('the', 'quick'); // bigram corpus count

Dictionary format (frequency-dictionary.txt):

the 23135851162
of 13151942776
and 12997637966

Bigram format (frequency-bigrams.txt):

the the 93748009
of the 87821839
in the 72063547

ConfidenceScorer

Multi-factor confidence scoring for spell corrections. Determines whether to auto-fix, suggest, or ignore.

import { ConfidenceScorer, CorrectionConfidence } from '@lilith/text-processing-utils';

Constructor

new ConfidenceScorer(options?: ConfidenceScorerOptions)
interface ConfidenceScorerOptions {
  thresholds?: {
    autoFix?: number;   // Default: 0.7
    suggest?: number;   // Default: 0.5
    possible?: number;  // Default: 0.3
  };
}

Methods

calculateConfidence(original, suggestion, additionalSuggestions?, engineFrequency?): number

Returns a score from 0.0 to 1.0 based on weighted factors:

Factor Weight Source
Edit distance 40% Damerau-Levenshtein (transpositions, insertions, deletions, substitutions)
Phonetic match 20% Soundex OR Metaphone soundsLike()
Keyboard proximity 20% QWERTY adjacency map (adjacent-key substitutions and insertions)
Word frequency 10% Static corpus rank table or engine frequency
Context fit 10% Default 0.5 (reserved for future context analyzer)

Modifiers:

  • Case-only difference → 0.85 (fixed)
  • Unique suggestion (no alternatives) → ×1.15
  • More than 2 alternatives → ×0.85

decideAction(suggestions, confidence): CorrectionDecision

enum CorrectionConfidence {
  AUTO_FIX = 'auto-fix',   // ≥ 0.70 — safe to auto-correct
  SUGGEST  = 'suggest',    // ≥ 0.50 — show suggestions to user
  POSSIBLE = 'possible',   // ≥ 0.30 — might be intentional
  IGNORE   = 'ignore',     // < 0.30 — probably intentional
}

interface CorrectionDecision {
  action: CorrectionConfidence;
  confidence: number;
  suggestion?: string;       // Best suggestion (for AUTO_FIX / SUGGEST)
  alternatives?: string[];   // Additional options (for SUGGEST / POSSIBLE)
  reason: string;            // Human-readable explanation
}

isTechnicalIdentifier(word: string): boolean

Detects camelCase, PascalCase, snake_case, CONST_CASE, and get/set/is/has/with/on prefixed identifiers.

adjustForTechnicalContext(baseConfidence, word, inCodeBlock?): number

Reduces confidence for technical patterns:

  • camelCase / snake_case → ×0.5
  • Version strings (v1.2.3) → ×0.3
  • Hex values (a1b2c3) → ×0.2
  • Inside code blocks → ×0.7

TypoManager

Coordinates split-word and joined-word pattern detection. Does not handle single-word typo correction (that's the engine's job).

import { TypoManager } from '@lilith/text-processing-utils';

Constructor

new TypoManager(enableSplitWords?: boolean, enableJoinedWords?: boolean)
// Both default to true

Methods

setDictionaryChecker(checker: (word: string) => boolean): void

// Split-word detection
detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addSplitWordPattern(splitForm: string, correctForm: string, confidence?: number): void
setSplitWordDetection(enabled: boolean): void
isSplitWordDetectionEnabled(): boolean

// Joined-word detection
detectJoinedWords(text: string): JoinedWordDetection[]
addJoinedWordPattern(joinedForm: string, splitWords: string[], confidence?: number, category?: string): void
setJoinedWordDetection(enabled: boolean): void
isJoinedWordDetectionEnabled(): boolean
getJoinedWordDetector(): JoinedWordDetector

// Management
clearDetectorCaches(): void
getStats(): { splitWordDetectionEnabled, joinedWordDetectionEnabled, joinedWordPatternsEnabled }

SplitWordDetector

Detects accidentally split words: "ist he""is the", "th e""the".

import { SplitWordDetector } from '@lilith/text-processing-utils';

Detection Methods

48+ hardcoded regex patterns for common splits, plus dictionary-based pair checking for unknown patterns.

interface SplitWordDetection {
  originalText: string;          // "ist he"
  splitWords: string[];          // ["ist", "he"]
  suggestedCorrection: string;   // "is the"
  confidence: number;            // 0.01.0
  startPosition: number;
  endPosition: number;
  pattern?: SplitWordPattern;
}

Constructor

new SplitWordDetector(
  dictionaryChecker?: (word: string) => boolean,
  maxWordSequence?: number  // default: 4
)

Key Methods

detectSplitWords(text: string): SplitWordDetection[]
checkWordPair(word1: string, word2: string): SplitWordDetection | null
addCustomPattern(splitForm: string, correctForm: string, confidence?: number, context?: string): void
getKnownPatterns(): Map<string, SplitWordPattern>
hasPattern(splitForm: string): boolean
clearCache(): void
getCacheStats(): { detectionCacheSize, dictionaryCacheSize, maxCacheSize }

JoinedWordDetector

Detects accidentally joined words: "javascript""java script", "testcase""test case".

import { JoinedWordDetector } from '@lilith/text-processing-utils';

Detection Methods

Three detection strategies in priority order:

  1. Known patterns — hardcoded compound words (confidence: 0.95)
  2. Dictionary-based — split at every position, check both halves against dictionary
  3. Heuristic — camelCase splitting (when enableHeuristics: true)
interface JoinedWordDetection {
  originalWord: string;          // "javascript"
  suggestedSplit: string;        // "java script"
  splitWords: string[];          // ["java", "script"]
  confidence: number;
  startPosition: number;
  endPosition: number;
  patternType: 'known-pattern' | 'dictionary-based' | 'heuristic';
  reason: string;
}

Constructor

new JoinedWordDetector(options?: JoinedWordDetectorOptions)
interface JoinedWordDetectorOptions {
  maxWordLength?: number;        // Skip words longer than this
  minSplitLength?: number;       // Minimum length for split parts
  enableHeuristics?: boolean;    // Enable camelCase heuristic (default: false)
  confidenceThreshold?: number;  // Filter results below this threshold
  maxCacheSize?: number;         // LRU cache size
}

Feature System

SOLID-based pluggable feature architecture. Each feature implements SpellCheckFeature and can be composed via FeatureManager.

interface SpellCheckFeature<TConfig = unknown> {
  name: string;
  enabled: boolean;
  initialize(): Promise<void>;
  checkText(text: string): Promise<FeatureResult[]>;
  configure(options: Partial<TConfig>): void;
}

interface FeatureResult {
  type: string;
  originalText: string;
  suggestedCorrection: string;
  confidence: number;
  startPosition: number;
  endPosition: number;
  metadata?: Record<string, unknown>;
}

FeatureManager

import { FeatureManager } from '@lilith/text-processing-utils';

const manager = new FeatureManager();
manager.addFeature(new CapitalizationFeature());
manager.addFeature(new GrammarPatternFeature());
manager.addFeature(new PunctuationFeature());

await manager.initializeAll();
const results = await manager.checkText('i went too the store.');
// Deduplicates overlapping results automatically

Available Features

Feature name Detects
SplitWordFeature split-word-detection Accidentally split words
JoinedWordFeature joined-word-detection Accidentally joined words
CapitalizationFeature capitalization-detection Sentence-start, proper nouns, acronyms, title case
GrammarPatternFeature grammar-pattern-detection Articles, homophones, contractions, agreement, double negatives
PunctuationFeature punctuation-detection Missing/extra punctuation, quote style, bracket matching
HomophoneFeature homophone-detection there/their/they're, its/it's, etc.
RedundancyFeature redundancy-detection Redundant phrases ("ATM machine", "free gift")
AbbreviationFeature abbreviation-detection Abbreviation consistency
TechnicalConsistencyFeature technical-consistency Technical term consistency (JavaScript vs Javascript)

Each feature has a corresponding *Factory class with presets:

import { GrammarPatternFeatureFactory } from '@lilith/text-processing-utils';

const strict = GrammarPatternFeatureFactory.createStrict();
const relaxed = GrammarPatternFeatureFactory.createRelaxed();
const technical = GrammarPatternFeatureFactory.createTechnicalWriting();
const custom = GrammarPatternFeatureFactory.createCustom({ checkHomophones: true, checkArticles: false });

Dictionary System

DictionaryManager

Manages multiple named dictionaries with priority ordering.

import { DictionaryManager, NodeDictionaryLoader } from '@lilith/text-processing-utils';

const manager = new DictionaryManager(new NodeDictionaryLoader('/path/to/data'));
await manager.initialize([
  { name: 'english', type: 'english', priority: 1 },
  { name: 'technical', type: 'technical', priority: 2 },
]);

manager.contains('function');                    // true (checks all)
manager.contains('kubectl', ['technical']);       // true (checks specific)
manager.getSuggestions('functon', 5);             // ['function', ...]
manager.addWordToDictionary('myterm', 'custom');

Dictionary Types

Class Loads From
EnglishDictionary english-words.txt via loader
TechnicalDictionary technical-terms.txt via loader
CustomDictionary In-memory, optionally seeded with word list

DictionaryDataLoader

interface DictionaryDataLoader {
  loadText(path: string): Promise<string>;
  exists(path: string): Promise<boolean>;
}

Two implementations:

  • NodeDictionaryLoader(rootPath: string)fs.readFile based
  • FetchDictionaryLoader(baseUrl: string) — HTTP fetch based (browser)

DictionaryPersistence

Import/export dictionaries to disk as JSON or plain text.

import { DictionaryPersistence } from '@lilith/text-processing-utils';

const persistence = new DictionaryPersistence('.dictionaries');
await persistence.saveDictionary(myDict);
await persistence.exportAsText(myDict, 'output.txt');
const imported = await persistence.importFromText('input.txt', 'my-dict');
await persistence.backupDictionaries('backup/');

Utility Classes

BloomFilter

Probabilistic set membership. Used internally for fast negative lookups.

import { BloomFilter } from '@lilith/text-processing-utils';

const filter = new BloomFilter(10000, 0.01);  // 10K items, 1% false positive rate
filter.add('hello');
filter.addMany(['world', 'foo']);
filter.mightContain('hello');           // true
filter.definitelyNotContains('xyz');    // true (guaranteed)
filter.getStats();                       // { size, numHashes, itemCount, saturation, estimatedFPR }

// Serialize
const data = filter.export();
const restored = BloomFilter.import(data);

LRUCache / TTLCache

import { TTLCache } from '@lilith/text-processing-utils';

const cache = new TTLCache<string, number>(1000, 60_000);  // 1K entries, 60s TTL
cache.set('key', 42);
cache.get('key');        // 42
cache.getStats();        // { size, maxSize, hits, misses, hitRate }
cache.prune();           // Remove expired entries, returns count removed

Result Types

interface SpellCheckResult {
  word: string;
  correct: boolean;
  suggestions: string[];
  confidence: number;
  position?: { start: number; end: number; line?: number; column?: number };
  correctionDecision?: CorrectionDecision;
}

interface BatchSpellCheckResult {
  errors: SpellCheckError[];
  stats: {
    totalWords: number;
    misspelledWords: number;
    correctedWords: number;
    ignoredWords: number;
    processingTime: number;
  };
}

interface SpellCheckError {
  type: 'misspelling' | 'grammar' | 'capitalization' | 'punctuation' | 'split-word' | 'joined-word';
  word: string;
  message: string;
  suggestions: string[];
  severity: 'error' | 'warning' | 'info';
  position: { start: number; end: number; line?: number; column?: number };
  confidence?: number;
  correctionAction?: string;
  splitWords?: string[];
}

Correction Strategies

import { AutoCorrector, ContextualCorrector } from '@lilith/text-processing-utils';

// Auto-correct above confidence threshold
const auto = new AutoCorrector(0.75);
auto.shouldApply('teh');                          // true
auto.correct('teh', ['the', 'tea']);              // 'the'

// Context-aware correction
const contextual = new ContextualCorrector();
contextual.correct('teh', ['the', 'tea'], 'I went to teh store');  // 'the'

Both implement CorrectionStrategy:

interface CorrectionStrategy {
  name: string;
  correct(word: string, suggestions: string[], context?: string): string | null;
  shouldApply(word: string, context?: string): boolean;
}

Web Worker Integration

For browser use, the spellcheck system runs in a Web Worker to avoid blocking the main thread.

Main Thread                          Worker Thread
─────────────                        ─────────────
useSpellcheck() hook                 spellcheck.worker.ts
  │                                    │
  ├─► postMessage({ type: 'init' }) ──►│
  │                                    ├─► new SymSpellEngine()
  │                                    ├─► engine.init()  (loads WASM + dict ~2s)
  │                                    ├─► new SpellChecker({ engine })
  │◄── { type: 'ready' } ─────────────┤
  │                                    │
  ├─► postMessage({ type: 'check',  ──►│
  │     text: '...' })                 ├─► checker.checkText(text)
  │                                    │
  │◄── { type: 'result',  ────────────┤
  │      errors: [...] }               │
  │                                    │
  └─► SpellcheckOverlay renders        │
      • Green underlines (auto-fix)    │
      • Cyan underlines (suggestions)  │
      • 5s countdown for approval      │

useSpellcheck Hook

import { useSpellcheck } from './hooks/useSpellcheck';

const { errors, isReady, isChecking } = useSpellcheck(textValue, {
  debounceMs: 300,
  autoApproveConfidence: 0.5,
  timeoutMode: 'auto-approve',
});

SpellcheckOverlay Component

import { SpellcheckOverlay } from './components/SpellcheckOverlay';

<SpellcheckOverlay
  text={textValue}
  errors={errors}
  onApprove={(correction) => applyCorrection(correction)}
  onDismiss={(error) => dismissError(error)}
/>