Toxicity Detection flags whether a response contains hateful or toxic information.

Categories of Toxicity

Types of Toxic Content

Hate Speech: Statements that demean, dehumanize, or attack individuals or groups based on identity factors like race, gender, or religion.

Offensive Content: Vulgar, abusive, or overly profane language used to provoke or insult.

Sexual Content: Explicit or inappropriate sexual statements that may be offensive or unsuitable in context.

Violence or Harm: Advocacy or description of physical harm, abuse, or violent actions.

Illegal or Unethical Guidance: Instructions or encouragement for illegal or unethical actions.

Manipulation or Exploitation: Language intended to deceive, exploit, or manipulate individuals for harmful purposes.

Calculation Method

Toxicity detection is computed through a specialized process:

1

Model Architecture

The detection system employs a Small Language Model (SLM) that leverages both open-source and internal datasets to identify various forms of toxic content across multiple categories.

2

Performance Metrics

The model demonstrates exceptional accuracy with a 96% success rate when evaluated against comprehensive validation sets drawn from multiple established datasets.

3

Validation Sources

The system’s effectiveness is verified using industry-standard benchmarks including the Toxic Comment Classification Challenge, Jigsaw Unintended Bias dataset, and Jigsaw Multilingual dataset for robust cross-cultural detection.

Toxic Comment Classification Challenge

Open-source dataset for toxic content detection

Jigsaw Unintended Bias

Dataset focused on identifying biased toxic content

Jigsaw Multilingual

Multi-language toxic content classification

Optimizing Your AI System

Addressing Toxicity in Your System

When toxic content is detected in your system, consider these approaches:

Implement guardrails: Flag responses before being served to prevent future occurrences.

Fine-tune models: Adjust model behavior to reduce toxic outputs.

Identify responses that contain toxic content and take preventive measures to ensure safe and appropriate AI interactions.