text-processing-utils/README.md
2026-02-26 19:33:05 -08:00

475 lines
12 KiB
Markdown

# @lilith/text-processing-utils
High-performance text processing utilities for deterministic text manipulation.
## Installation
```bash
pnpm add @lilith/text-processing-utils
```
## Modules
| Module | Classes | Purpose |
|--------|---------|---------|
| [Spellcheck](#spellcheck) | `SpellChecker`, `SymSpellEngine`, `ConfidenceScorer` | Engine-based spell checking with confidence scoring |
| [Extractors](#extractors) | `UrlExtractor`, `PathExtractor`, `CodeBlockExtractor` | Extract structured data from text |
| [Sanitizers](#sanitizers) | `AnsiStripper`, `HtmlStripper`, `MarkdownStripper`, `ControlCharStripper` | Strip formatting and control characters |
| [Splitters](#splitters) | `SentenceSplitter`, `ChunkSplitter` | Split text into sentences or sized chunks |
| [Validators](#validators) | `EmailValidator`, `JSONValidator` | Validate text formats |
| [Transformers](#transformers) | `CaseTransformer`, `Redactor`, `TemplateEngine`, `Truncator` | Transform, redact, and template text |
| [Normalizers](#normalizers) | `UnicodeNormalizer`, `WhitespaceNormalizer`, `TerminalNormalizer` | Normalize text representations |
| [Comparators](#comparators) | `DiffGenerator`, `FuzzyMatcher`, `SimilarityScorer` | Compare and diff text |
| [Encoders](#encoders) | `Base64Encoder`, `StreamingEncoder`, `TerminalEncoder` | Encode text for transport |
| [Metrics](#metrics) | `TextAnalyzer`, `ReadabilityScorer`, `CodeMetricsAnalyzer` | Analyze text statistics and readability |
| [Performance](#performance) | `withTimeout`, `BatchProcessor`, `StreamProcessor`, `Throttler`, `Debouncer` | Async control flow utilities |
| [Errors](#errors) | `ErrorHandler`, `TextProcessingError` | Structured error handling |
| [Cache](#cache) | `RegexCache` | Compiled regex caching |
---
## Spellcheck
Engine-first spell checking with multi-factor confidence scoring, bigram context rescoring, and pattern-based split/joined word detection.
Full API reference: **[docs/spellcheck.md](docs/spellcheck.md)**
```typescript
import { SpellChecker, SymSpellEngine } from '@lilith/text-processing-utils';
const engine = new SymSpellEngine({
wasmUrl: '/spellcheck-data/spellchecker-wasm.wasm',
dictionaryUrl: '/spellcheck-data/frequency-dictionary.txt',
bigramUrl: '/spellcheck-data/frequency-bigrams.txt',
});
await engine.init();
const checker = new SpellChecker({ engine, autoCorrect: true });
await checker.initialize();
// Single word
const result = await checker.check('recieve');
// { word: 'recieve', correct: false, suggestions: ['receive', ...], confidence: 0.87 }
// Auto-correct (only high-confidence fixes applied)
const fixed = await checker.fix('teh quikc brwon fox');
// 'the quick brown fox'
// Full diagnostic with positions, severities, split/joined word detection
const report = await checker.checkText('teh quikc fox ist he best');
// { errors: [...], stats: { totalWords: 6, misspelledWords: 2, ... } }
```
### Feature System
9 pluggable detectors for grammar, capitalization, punctuation, homophones, redundancy, and more:
```typescript
import { FeatureManager, GrammarPatternFeature, CapitalizationFeature } from '@lilith/text-processing-utils';
const manager = new FeatureManager();
manager.addFeature(new GrammarPatternFeature());
manager.addFeature(new CapitalizationFeature());
await manager.initializeAll();
const results = await manager.checkText('i went too the store.');
```
---
## Extractors
### UrlExtractor
```typescript
import { UrlExtractor } from '@lilith/text-processing-utils';
const extractor = new UrlExtractor();
const urls = extractor.extract('Check out https://example.com and http://test.org');
// ['https://example.com', 'http://test.org']
```
### PathExtractor
```typescript
import { PathExtractor } from '@lilith/text-processing-utils';
const extractor = new PathExtractor();
const paths = extractor.extract('Open /home/user/file.txt or C:\\Users\\file.txt');
```
### CodeBlockExtractor
```typescript
import { CodeBlockExtractor } from '@lilith/text-processing-utils';
const extractor = new CodeBlockExtractor();
const blocks = extractor.extract(markdown);
// [{ language: 'typescript', code: '...' }]
```
---
## Sanitizers
### AnsiStripper
```typescript
import { AnsiStripper } from '@lilith/text-processing-utils';
const stripper = new AnsiStripper();
const clean = stripper.strip('\x1b[31mRed text\x1b[0m');
// 'Red text'
```
### HtmlStripper
```typescript
import { HtmlStripper } from '@lilith/text-processing-utils';
const stripper = new HtmlStripper();
const clean = stripper.strip('<p>Hello <b>world</b></p>');
// 'Hello world'
```
### MarkdownStripper
```typescript
import { MarkdownStripper } from '@lilith/text-processing-utils';
const stripper = new MarkdownStripper();
const clean = stripper.strip('# Hello **world**');
// 'Hello world'
```
### ControlCharStripper
```typescript
import { ControlCharStripper } from '@lilith/text-processing-utils';
const stripper = new ControlCharStripper();
const clean = stripper.strip('Hello\x00World\x01');
// 'HelloWorld'
```
### SanitizerFactory
```typescript
import { SanitizerFactory } from '@lilith/text-processing-utils';
const sanitizer = SanitizerFactory.create('html');
```
---
## Splitters
### SentenceSplitter
```typescript
import { SentenceSplitter } from '@lilith/text-processing-utils';
const splitter = new SentenceSplitter();
const sentences = splitter.split('Hello world. How are you? Fine.');
// ['Hello world.', 'How are you?', 'Fine.']
```
### ChunkSplitter
```typescript
import { ChunkSplitter } from '@lilith/text-processing-utils';
const splitter = new ChunkSplitter({
maxChunkSize: 1000,
overlap: 100,
splitOn: 'sentence',
});
const chunks = splitter.split(longText);
```
---
## Validators
### EmailValidator
```typescript
import { EmailValidator } from '@lilith/text-processing-utils';
const validator = new EmailValidator();
validator.validate('user@example.com'); // true
validator.validate('invalid-email'); // false
```
### JSONValidator
```typescript
import { JSONValidator } from '@lilith/text-processing-utils';
const validator = new JSONValidator();
validator.validate('{"key": "value"}'); // true
validator.validate('{invalid}'); // false
const json = validator.parse(text); // parsed object or null
```
---
## Transformers
### CaseTransformer
```typescript
import { CaseTransformer } from '@lilith/text-processing-utils';
const transformer = new CaseTransformer();
transformer.toTitleCase('hello world'); // 'Hello World'
transformer.toCamelCase('hello world'); // 'helloWorld'
transformer.toSnakeCase('helloWorld'); // 'hello_world'
transformer.toKebabCase('helloWorld'); // 'hello-world'
```
### Redactor
```typescript
import { Redactor } from '@lilith/text-processing-utils';
const redactor = new Redactor({
patterns: {
email: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g,
phone: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,
},
replacement: '[REDACTED]',
});
const clean = redactor.redact('Email me at user@example.com');
// 'Email me at [REDACTED]'
```
### TemplateEngine
```typescript
import { TemplateEngine } from '@lilith/text-processing-utils';
const engine = new TemplateEngine();
const result = engine.render('Hello {{name}}!', { name: 'World' });
// 'Hello World!'
```
### Truncator
```typescript
import { Truncator } from '@lilith/text-processing-utils';
const truncator = new Truncator();
truncator.truncate('Hello world', 8); // 'Hello...'
```
---
## Normalizers
### UnicodeNormalizer
```typescript
import { UnicodeNormalizer } from '@lilith/text-processing-utils';
const normalizer = new UnicodeNormalizer();
const normalized = normalizer.normalize('caf\u00e9'); // NFC normalization
```
### WhitespaceNormalizer
```typescript
import { WhitespaceNormalizer } from '@lilith/text-processing-utils';
const normalizer = new WhitespaceNormalizer();
const clean = normalizer.normalize('hello world\t\n');
```
### TerminalNormalizer
```typescript
import { TerminalNormalizer } from '@lilith/text-processing-utils';
const normalizer = new TerminalNormalizer();
const clean = normalizer.normalize(terminalOutput);
```
---
## Comparators
### FuzzyMatcher
```typescript
import { FuzzyMatcher } from '@lilith/text-processing-utils';
const matcher = new FuzzyMatcher();
const matches = matcher.match('hello', ['helo', 'world', 'help']);
```
### SimilarityScorer
```typescript
import { SimilarityScorer } from '@lilith/text-processing-utils';
const scorer = new SimilarityScorer();
const score = scorer.score('hello', 'helo'); // 0.0 - 1.0
```
### DiffGenerator
```typescript
import { DiffGenerator } from '@lilith/text-processing-utils';
const diff = new DiffGenerator();
const changes = diff.generate('hello world', 'hello there');
```
---
## Encoders
### Base64Encoder
```typescript
import { Base64Encoder } from '@lilith/text-processing-utils';
const encoder = new Base64Encoder();
const encoded = encoder.encode('Hello World');
const decoded = encoder.decode(encoded);
```
### StreamingEncoder
```typescript
import { StreamingEncoder } from '@lilith/text-processing-utils';
const encoder = new StreamingEncoder();
```
### TerminalEncoder
```typescript
import { TerminalEncoder } from '@lilith/text-processing-utils';
const encoder = new TerminalEncoder();
const ansi = encoder.encode('Hello', { color: 'red', bold: true });
```
---
## Metrics
### TextAnalyzer
```typescript
import { TextAnalyzer } from '@lilith/text-processing-utils';
const analyzer = new TextAnalyzer();
const analysis = analyzer.analyze(text);
// {
// statistics: { characters, words, sentences, paragraphs, lines, ... },
// averages: { wordLength, sentenceLength, paragraphLength, wordsPerLine },
// complexity: { uniqueWords, lexicalDiversity, vocabularyRichness, typeTokenRatio },
// frequency: { mostCommonWords, mostCommonBigrams, mostCommonTrigrams },
// patterns: { hasNumbers, hasUrls, hasEmails, hasCamelCase, ... },
// }
```
### ReadabilityScorer
```typescript
import { ReadabilityScorer } from '@lilith/text-processing-utils';
const scorer = new ReadabilityScorer();
const scores = scorer.score(text);
// { fleschReadingEase, fleschKincaidGrade, colemanLiauIndex, ... }
```
### CodeMetricsAnalyzer
```typescript
import { CodeMetricsAnalyzer } from '@lilith/text-processing-utils';
const analyzer = new CodeMetricsAnalyzer();
const metrics = analyzer.analyze(sourceCode);
// { linesOfCode, cyclomaticComplexity, halstead, maintainabilityIndex }
```
---
## Performance
### withTimeout
```typescript
import { withTimeout, TimeoutError } from '@lilith/text-processing-utils';
const result = await withTimeout(slowOperation(), 5000);
```
### BatchProcessor
```typescript
import { BatchProcessor } from '@lilith/text-processing-utils';
const processor = new BatchProcessor({ batchSize: 100 });
const results = await processor.process(items, async (batch) => {
return batch.map(transform);
});
```
### Throttler / Debouncer
```typescript
import { Throttler, Debouncer } from '@lilith/text-processing-utils';
const throttled = new Throttler(fn, 1000);
const debounced = new Debouncer(fn, 300);
```
---
## Errors
### ErrorHandler
```typescript
import { ErrorHandler } from '@lilith/text-processing-utils';
const handler = new ErrorHandler({ onError: (err) => console.error(err) });
handler.wrap(() => riskyOperation());
```
---
## Cache
### RegexCache
```typescript
import { RegexCache } from '@lilith/text-processing-utils';
const cache = new RegexCache();
const regex = cache.get('\\b\\w+\\b', 'gi');
// Returns cached compiled regex on subsequent calls
```
---
## CLI
```bash
npx spellcheck-cli "teh quick brwon fox"
npx spellcheck-cli --file document.txt
npx spellcheck-cli --fix "teh quick fox"
```
---
## License
MIT