YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Tegmen

A high-performance on-premise PII detection and masking solution

Overview

Tegmen is a production-ready token classification system designed for identifying and masking personally identifiable information (PII) in text data. Built for high-throughput data sanitization workflows, it offers on-premise deployment capabilities with enterprise-grade performance.

Key Features

  • On-Premise Deployment: Run entirely within your infrastructure
  • Lightweight Architecture: Optimized for edge deployment
  • Fine-Tunable: Easily adapt to your specific data distributions
  • Long Context Support: Process documents up to 128,000 tokens
  • Configurable Detection: Tune precision/recall tradeoffs

Supported PII Categories

The model detects 8 categories of sensitive information:

Category Description
account_number Financial account identifiers
private_address Physical and mailing addresses
private_email Email addresses
private_person Personal names
private_phone Phone numbers
private_url URLs and web addresses
private_date Birth dates and personal dates
secret API keys, passwords, credentials

Installation

pip install transformers torch

Quick Start

Using the Pipeline API

from transformers import pipeline

detector = pipeline("token-classification", model="comethrusws/tegmen", aggregation_strategy="simple")

text = "Contact John Smith at john.smith@email.com"
results = detector(text)

for item in results:
    print(f"Found: {item['word']} ({item['entity_group']})")

Using the Model Directly

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("comethrusws/tegmen")
model = AutoModelForTokenClassification.from_pretrained("comethrusws/tegmen")

text = "My name is Alice and my email is alice@example.com"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(dim=-1)
labels = [model.config.id2label[p.item()] for p in predictions[0]]
print(labels)

Performance Specifications

  • Architecture: Transformer encoder
  • Parameters: 1.5B total / 50M active
  • Context Window: 128,000 tokens
  • Output Format: BIOES span tagging

License

Apache License 2.0

Support

For enterprise support, contact SAGEA.

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support