YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Tegmen
A high-performance on-premise PII detection and masking solution
Overview
Tegmen is a production-ready token classification system designed for identifying and masking personally identifiable information (PII) in text data. Built for high-throughput data sanitization workflows, it offers on-premise deployment capabilities with enterprise-grade performance.
Key Features
- On-Premise Deployment: Run entirely within your infrastructure
- Lightweight Architecture: Optimized for edge deployment
- Fine-Tunable: Easily adapt to your specific data distributions
- Long Context Support: Process documents up to 128,000 tokens
- Configurable Detection: Tune precision/recall tradeoffs
Supported PII Categories
The model detects 8 categories of sensitive information:
| Category | Description |
|---|---|
account_number |
Financial account identifiers |
private_address |
Physical and mailing addresses |
private_email |
Email addresses |
private_person |
Personal names |
private_phone |
Phone numbers |
private_url |
URLs and web addresses |
private_date |
Birth dates and personal dates |
secret |
API keys, passwords, credentials |
Installation
pip install transformers torch
Quick Start
Using the Pipeline API
from transformers import pipeline
detector = pipeline("token-classification", model="comethrusws/tegmen", aggregation_strategy="simple")
text = "Contact John Smith at john.smith@email.com"
results = detector(text)
for item in results:
print(f"Found: {item['word']} ({item['entity_group']})")
Using the Model Directly
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("comethrusws/tegmen")
model = AutoModelForTokenClassification.from_pretrained("comethrusws/tegmen")
text = "My name is Alice and my email is alice@example.com"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
labels = [model.config.id2label[p.item()] for p in predictions[0]]
print(labels)
Performance Specifications
- Architecture: Transformer encoder
- Parameters: 1.5B total / 50M active
- Context Window: 128,000 tokens
- Output Format: BIOES span tagging
License
Apache License 2.0
Support
For enterprise support, contact SAGEA.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support