# @lilith/text-processing-utils High-performance text processing utilities for deterministic text manipulation. ## Installation ```bash pnpm add @lilith/text-processing-utils ``` ## Modules | Module | Classes | Purpose | |--------|---------|---------| | [Spellcheck](#spellcheck) | `SpellChecker`, `SymSpellEngine`, `ConfidenceScorer` | Engine-based spell checking with confidence scoring | | [Extractors](#extractors) | `UrlExtractor`, `PathExtractor`, `CodeBlockExtractor` | Extract structured data from text | | [Sanitizers](#sanitizers) | `AnsiStripper`, `HtmlStripper`, `MarkdownStripper`, `ControlCharStripper` | Strip formatting and control characters | | [Splitters](#splitters) | `SentenceSplitter`, `ChunkSplitter` | Split text into sentences or sized chunks | | [Validators](#validators) | `EmailValidator`, `JSONValidator` | Validate text formats | | [Transformers](#transformers) | `CaseTransformer`, `Redactor`, `TemplateEngine`, `Truncator` | Transform, redact, and template text | | [Normalizers](#normalizers) | `UnicodeNormalizer`, `WhitespaceNormalizer`, `TerminalNormalizer` | Normalize text representations | | [Comparators](#comparators) | `DiffGenerator`, `FuzzyMatcher`, `SimilarityScorer` | Compare and diff text | | [Encoders](#encoders) | `Base64Encoder`, `StreamingEncoder`, `TerminalEncoder` | Encode text for transport | | [Metrics](#metrics) | `TextAnalyzer`, `ReadabilityScorer`, `CodeMetricsAnalyzer` | Analyze text statistics and readability | | [Performance](#performance) | `withTimeout`, `BatchProcessor`, `StreamProcessor`, `Throttler`, `Debouncer` | Async control flow utilities | | [Errors](#errors) | `ErrorHandler`, `TextProcessingError` | Structured error handling | | [Cache](#cache) | `RegexCache` | Compiled regex caching | --- ## Spellcheck Engine-first spell checking with multi-factor confidence scoring, bigram context rescoring, and pattern-based split/joined word detection. Full API reference: **[docs/spellcheck.md](docs/spellcheck.md)** ```typescript import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils'; const engine = new SymSpellEngine({ wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm', dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt', bigramUrl: '/spellcheck-data/frequency-bigrams.txt', }); await engine.init(); const checker = new SpellChecker({ engine, autoCorrect: true }); await checker.initialize(); // Single word const result = await checker.check('recieve'); // { word: 'recieve', correct: false, suggestions: ['receive', ...], confidence: 0.87 } // Auto-correct (only high-confidence fixes applied) const fixed = await checker.fix('teh quikc brwon fox'); // 'the quick brown fox' // Full diagnostic with positions, severities, split/joined word detection const report = await checker.checkText('teh quikc fox ist he best'); // { errors: [...], stats: { totalWords: 6, misspelledWords: 2, ... } } ``` ### Feature System 9 pluggable detectors for grammar, capitalization, punctuation, homophones, redundancy, and more: ```typescript import { FeatureManager, GrammarPatternFeature, CapitalizationFeature } from '@lilith/text-processing-utils'; const manager = new FeatureManager(); manager.addFeature(new GrammarPatternFeature()); manager.addFeature(new CapitalizationFeature()); await manager.initializeAll(); const results = await manager.checkText('i went too the store.'); ``` --- ## Extractors ### UrlExtractor ```typescript import { UrlExtractor } from '@lilith/text-processing-utils'; const extractor = new UrlExtractor(); const urls = extractor.extract('Check out https://example.com and http://test.org'); // ['https://example.com', 'http://test.org'] ``` ### PathExtractor ```typescript import { PathExtractor } from '@lilith/text-processing-utils'; const extractor = new PathExtractor(); const paths = extractor.extract('Open /home/user/file.txt or C:\\Users\\file.txt'); ``` ### CodeBlockExtractor ```typescript import { CodeBlockExtractor } from '@lilith/text-processing-utils'; const extractor = new CodeBlockExtractor(); const blocks = extractor.extract(markdown); // [{ language: 'typescript', code: '...' }] ``` --- ## Sanitizers ### AnsiStripper ```typescript import { AnsiStripper } from '@lilith/text-processing-utils'; const stripper = new AnsiStripper(); const clean = stripper.strip('\x1b[31mRed text\x1b[0m'); // 'Red text' ``` ### HtmlStripper ```typescript import { HtmlStripper } from '@lilith/text-processing-utils'; const stripper = new HtmlStripper(); const clean = stripper.strip('

Hello world

'); // 'Hello world' ``` ### MarkdownStripper ```typescript import { MarkdownStripper } from '@lilith/text-processing-utils'; const stripper = new MarkdownStripper(); const clean = stripper.strip('# Hello **world**'); // 'Hello world' ``` ### ControlCharStripper ```typescript import { ControlCharStripper } from '@lilith/text-processing-utils'; const stripper = new ControlCharStripper(); const clean = stripper.strip('Hello\x00World\x01'); // 'HelloWorld' ``` ### SanitizerFactory ```typescript import { SanitizerFactory } from '@lilith/text-processing-utils'; const sanitizer = SanitizerFactory.create('html'); ``` --- ## Splitters ### SentenceSplitter ```typescript import { SentenceSplitter } from '@lilith/text-processing-utils'; const splitter = new SentenceSplitter(); const sentences = splitter.split('Hello world. How are you? Fine.'); // ['Hello world.', 'How are you?', 'Fine.'] ``` ### ChunkSplitter ```typescript import { ChunkSplitter } from '@lilith/text-processing-utils'; const splitter = new ChunkSplitter({ maxChunkSize: 1000, overlap: 100, splitOn: 'sentence', }); const chunks = splitter.split(longText); ``` --- ## Validators ### EmailValidator ```typescript import { EmailValidator } from '@lilith/text-processing-utils'; const validator = new EmailValidator(); validator.validate('user@example.com'); // true validator.validate('invalid-email'); // false ``` ### JSONValidator ```typescript import { JSONValidator } from '@lilith/text-processing-utils'; const validator = new JSONValidator(); validator.validate('{"key": "value"}'); // true validator.validate('{invalid}'); // false const json = validator.parse(text); // parsed object or null ``` --- ## Transformers ### CaseTransformer ```typescript import { CaseTransformer } from '@lilith/text-processing-utils'; const transformer = new CaseTransformer(); transformer.toTitleCase('hello world'); // 'Hello World' transformer.toCamelCase('hello world'); // 'helloWorld' transformer.toSnakeCase('helloWorld'); // 'hello_world' transformer.toKebabCase('helloWorld'); // 'hello-world' ``` ### Redactor ```typescript import { Redactor } from '@lilith/text-processing-utils'; const redactor = new Redactor({ patterns: { email: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g, phone: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, }, replacement: '[REDACTED]', }); const clean = redactor.redact('Email me at user@example.com'); // 'Email me at [REDACTED]' ``` ### TemplateEngine ```typescript import { TemplateEngine } from '@lilith/text-processing-utils'; const engine = new TemplateEngine(); const result = engine.render('Hello {{name}}!', { name: 'World' }); // 'Hello World!' ``` ### Truncator ```typescript import { Truncator } from '@lilith/text-processing-utils'; const truncator = new Truncator(); truncator.truncate('Hello world', 8); // 'Hello...' ``` --- ## Normalizers ### UnicodeNormalizer ```typescript import { UnicodeNormalizer } from '@lilith/text-processing-utils'; const normalizer = new UnicodeNormalizer(); const normalized = normalizer.normalize('caf\u00e9'); // NFC normalization ``` ### WhitespaceNormalizer ```typescript import { WhitespaceNormalizer } from '@lilith/text-processing-utils'; const normalizer = new WhitespaceNormalizer(); const clean = normalizer.normalize('hello world\t\n'); ``` ### TerminalNormalizer ```typescript import { TerminalNormalizer } from '@lilith/text-processing-utils'; const normalizer = new TerminalNormalizer(); const clean = normalizer.normalize(terminalOutput); ``` --- ## Comparators ### FuzzyMatcher ```typescript import { FuzzyMatcher } from '@lilith/text-processing-utils'; const matcher = new FuzzyMatcher(); const matches = matcher.match('hello', ['helo', 'world', 'help']); ``` ### SimilarityScorer ```typescript import { SimilarityScorer } from '@lilith/text-processing-utils'; const scorer = new SimilarityScorer(); const score = scorer.score('hello', 'helo'); // 0.0 - 1.0 ``` ### DiffGenerator ```typescript import { DiffGenerator } from '@lilith/text-processing-utils'; const diff = new DiffGenerator(); const changes = diff.generate('hello world', 'hello there'); ``` --- ## Encoders ### Base64Encoder ```typescript import { Base64Encoder } from '@lilith/text-processing-utils'; const encoder = new Base64Encoder(); const encoded = encoder.encode('Hello World'); const decoded = encoder.decode(encoded); ``` ### StreamingEncoder ```typescript import { StreamingEncoder } from '@lilith/text-processing-utils'; const encoder = new StreamingEncoder(); ``` ### TerminalEncoder ```typescript import { TerminalEncoder } from '@lilith/text-processing-utils'; const encoder = new TerminalEncoder(); const ansi = encoder.encode('Hello', { color: 'red', bold: true }); ``` --- ## Metrics ### TextAnalyzer ```typescript import { TextAnalyzer } from '@lilith/text-processing-utils'; const analyzer = new TextAnalyzer(); const analysis = analyzer.analyze(text); // { // statistics: { characters, words, sentences, paragraphs, lines, ... }, // averages: { wordLength, sentenceLength, paragraphLength, wordsPerLine }, // complexity: { uniqueWords, lexicalDiversity, vocabularyRichness, typeTokenRatio }, // frequency: { mostCommonWords, mostCommonBigrams, mostCommonTrigrams }, // patterns: { hasNumbers, hasUrls, hasEmails, hasCamelCase, ... }, // } ``` ### ReadabilityScorer ```typescript import { ReadabilityScorer } from '@lilith/text-processing-utils'; const scorer = new ReadabilityScorer(); const scores = scorer.score(text); // { fleschReadingEase, fleschKincaidGrade, colemanLiauIndex, ... } ``` ### CodeMetricsAnalyzer ```typescript import { CodeMetricsAnalyzer } from '@lilith/text-processing-utils'; const analyzer = new CodeMetricsAnalyzer(); const metrics = analyzer.analyze(sourceCode); // { linesOfCode, cyclomaticComplexity, halstead, maintainabilityIndex } ``` --- ## Performance ### withTimeout ```typescript import { withTimeout, TimeoutError } from '@lilith/text-processing-utils'; const result = await withTimeout(slowOperation(), 5000); ``` ### BatchProcessor ```typescript import { BatchProcessor } from '@lilith/text-processing-utils'; const processor = new BatchProcessor({ batchSize: 100 }); const results = await processor.process(items, async (batch) => { return batch.map(transform); }); ``` ### Throttler / Debouncer ```typescript import { Throttler, Debouncer } from '@lilith/text-processing-utils'; const throttled = new Throttler(fn, 1000); const debounced = new Debouncer(fn, 300); ``` --- ## Errors ### ErrorHandler ```typescript import { ErrorHandler } from '@lilith/text-processing-utils'; const handler = new ErrorHandler({ onError: (err) => console.error(err) }); handler.wrap(() => riskyOperation()); ``` --- ## Cache ### RegexCache ```typescript import { RegexCache } from '@lilith/text-processing-utils'; const cache = new RegexCache(); const regex = cache.get('\\b\\w+\\b', 'gi'); // Returns cached compiled regex on subsequent calls ``` --- ## CLI ```bash npx spellcheck-cli "teh quick brwon fox" npx spellcheck-cli --file document.txt npx spellcheck-cli --fix "teh quick fox" ``` --- ## License MIT