Data Policies

Honeyframe applies column-level masking to query results based on the column's classification, the user's role, and any masking policies attached to the dataset or organization. Masking is enforced post-query in Python so it works uniformly across every connector (PostgreSQL, Oracle, MySQL, MSSQL, etc.) without connector-specific SQL.

The masking engine lives at paas/backend/services/masking.py.

Concepts

A column has a semantic type, a sensitivity level, and a masking strategy. The platform classifies columns automatically based on column names matching known PII patterns; classifications can be overridden per dataset or per organization.

Sensitivity levels

Level	Default treatment
`critical`	Always masked unless the user holds an unmask role.
`high`	Masked for non-admin viewers by default.
`medium`	Masked for `viewer`-tier roles.
`low`	Not masked.

Masking strategies

Strategy	Behavior
`partial`	Show first/last few characters, mask the middle (`john****@gmail.com`).
`full`	Replace the value with a fixed mask (`***`).
`hash`	Replace with a deterministic hash so joins still work (`sha256(value)[:12]`).
`redact`	Drop the value entirely (returns `null`).
`none`	Pass through unchanged.

The supported strategies are enumerated in STRATEGIES = ("partial", "full", "hash", "redact", "none").

Default classifications

The PII_DEFAULTS table seeds the engine with classifications for common Indonesian healthcare and personal-data fields:

Semantic type	Default sensitivity	Default strategy	Unmask roles
`nik`	`critical`	`partial`	`admin`
`email`	`high`	`partial`	`admin`
`phone`	`high`	`partial`	`admin`
`address`	`high`	`partial`	`admin`
`name`	`medium`	`partial`	`admin`

Auto-classification kicks in when a query result column name matches a known semantic-type pattern (e.g. email, customer_email, email_address all map to email). Other columns default to no masking.

Resolution order

When the engine masks a query result, it resolves the rule for each column in this order — first match wins:

Dataset-level override — datasets.settings.masking[col_name] JSON object on the dataset record.
Org-level default — organizations.data_policies.masking_defaults[semantic_type] JSON object on the org record.
Auto-classify default — the entry in PII_DEFAULTS[semantic_type].
No rule → no masking.

This means a dataset owner can promote a normally-masked column to none for a specific dataset (e.g. an analyst-facing aggregate view), and an organization admin can tighten or loosen the default for everyone.

Setting a dataset-level rule

Update the dataset's settings.masking field via the dataset settings UI or the /api/datasets/{dataset_id} endpoint:

{
  "settings": {
    "masking": {
      "patient_email": {
        "strategy": "hash",
        "unmask_roles": ["admin", "cs_staff"]
      },
      "patient_phone": {
        "strategy": "redact"
      }
    }
  }
}

unmask_roles is a list of role strings that bypass the mask for this column. If omitted, the engine uses the default unmask roles from PII_DEFAULTS.

Setting an org-level default

Update the organization's data_policies.masking_defaults:

{
  "data_policies": {
    "masking_defaults": {
      "email": {"strategy": "hash"},
      "phone": {"strategy": "full"}
    }
  }
}

Org-level defaults override PII_DEFAULTS for every dataset in the org that does not have its own dataset-level rule.

Per-project unmask roles

Some installs need finer-grained control — e.g. a customer-service team that should see unmasked phone numbers only on the projects they're assigned to. The engine honors a unmask_project_roles field on the rule. If the user is a member of a project where their project-role is in unmask_project_roles, the column is unmasked just for queries scoped to that project.

{
  "phone": {
    "strategy": "partial",
    "unmask_roles": ["admin"],
    "unmask_project_roles": ["admin", "cs_staff"]
  }
}

Where masking does (and doesn't) apply

Masking is enforced by the platform's SQL execution path (/api/chat, /api/datasets/{id}/explore, dashboard queries, and dataset preview). It is not enforced for:

Direct database access — anyone with a Postgres connection string sees the unmasked rows. Treat the masking engine as a UI-layer protection, not a data-layer one.
Raw connector exports — the data_api publishing surface does not run results through the masking engine. Sharing a dataset via data_api exposes the raw values.
Lakehouse Parquet files — files written by ingestion do not carry masking metadata. Anyone who can read the Parquet path sees the raw values.
dbt model output — dbt runs against the source connector directly; transformation output is unmasked.

For data-layer enforcement, use database-side row security or a separate read replica with masked columns materialized at ingestion time.

Row-level filters

Row-level filtering is not yet implemented as a first-class platform feature. The standard approach is to:

Define a dataset that includes only the rows a given audience should see (e.g. WHERE org_id = :user_org).
Share that dataset with the audience instead of the underlying table.
Use the dataset-level masking rules for column-level concerns.

A planned row_filters field on the dataset record will allow declarative row predicates ({"region_id": "{user.region_id}"}) — track the roadmap for the rollout window.

Auditing masking decisions

The masking engine emits structured logs at INFO level for each query — column names, applied strategy, and reason (auto-classify / dataset-override / org-default). The logs are not stored in the audit table by default. To capture them, configure the application's logger to ship to your SIEM, or add log_audit(...) calls in the masking engine on the strategy-decision path.

Concepts​

Sensitivity levels​

Masking strategies​

Default classifications​

Resolution order​

Setting a dataset-level rule​

Setting an org-level default​

Per-project unmask roles​

Where masking does (and doesn't) apply​

Row-level filters​

Auditing masking decisions​