Tegmen

A high-performance on-premise PII detection and masking solution

Overview

Tegmen is a production-ready token classification system designed for identifying and masking personally identifiable information (PII) in text data. Built for high-throughput data sanitization workflows, it offers on-premise deployment capabilities with enterprise-grade performance.

Key Features

On-Premise Deployment: Run entirely within your infrastructure
Lightweight Architecture: Optimized for edge deployment
Fine-Tunable: Easily adapt to your specific data distributions
Long Context Support: Process documents up to 128,000 tokens
Configurable Detection: Tune precision/recall tradeoffs

Supported PII Categories

The model detects 8 categories of sensitive information:

Category	Description
`account_number`	Financial account identifiers
`private_address`	Physical and mailing addresses
`private_email`	Email addresses
`private_person`	Personal names
`private_phone`	Phone numbers
`private_url`	URLs and web addresses
`private_date`	Birth dates and personal dates
`secret`	API keys, passwords, credentials

Installation

pip install transformers torch

Quick Start

Using the Pipeline API

from transformers import pipeline

detector = pipeline("token-classification", model="comethrusws/tegmen", aggregation_strategy="simple")

text = "Contact John Smith at john.smith@email.com"
results = detector(text)

for item in results:
    print(f"Found: {item['word']} ({item['entity_group']})")

Using the Model Directly

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("comethrusws/tegmen")
model = AutoModelForTokenClassification.from_pretrained("comethrusws/tegmen")

text = "My name is Alice and my email is alice@example.com"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(dim=-1)
labels = [model.config.id2label[p.item()] for p in predictions[0]]
print(labels)

Performance Specifications

Architecture: Transformer encoder
Parameters: 1.5B total / 50M active
Context Window: 128,000 tokens
Output Format: BIOES span tagging

License

Apache License 2.0

Support

For enterprise support, contact SAGEA.

Downloads last month: -

Safetensors

Model size

1B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support