Datasets
A dataset is a queryable, governed view of data — a table, an external view, a CSV upload, or a dbt-built model. Datasets are the primary unit Honeyframe operates on: recipes consume them, dashboards visualize them, agents query them, and access control gates on them.
Two things become datasets automatically:
- dbt models discovered from the project's manifest — every staging, intermediate, and mart model lands in the dataset catalog with its layer tag.
- Sources declared in
sources.yml— external tables behind a Connector appear with the source's schema and freshness metadata.
Anything else (a one-off CSV upload, a custom SQL view against a connector, an imported external schema) becomes a dataset through the + New Dataset flow.
Layers
Each dataset carries a layer tag aligned with the dbt convention:
| Layer | Meaning |
|---|---|
source | Raw external table reached through a connector. Read-only. |
staging | Lightly cleaned per-source projection (stg_*). |
intermediate | Reusable transformation between staging and marts (int_*). |
marts | Business-modeled output. The default surface for dashboards and agents. |
views | Ad-hoc views (custom SQL datasets). Not part of the dbt build. |
The layer drives where the dataset lands on the Flow canvas and which schemas it materialises into. It does not by itself control access — that's the permission model.
Browsing datasets
The dataset list page (/datasets) serves four lenses:
- All — every dataset in the project (default).
- By layer — filter to one of the layers above.
- By search — substring match on name and description.
- Available — the union of dbt models, sources, and user-imported datasets the caller can read.
For each dataset, the row shows: name, layer, row count, freshness status (fresh / stale / error), last refresh, and ownership flag. Click through for the dataset detail page.
Detail page
The detail page has six tabs:
| Tab | What it shows |
|---|---|
| Overview | Description, ownership, lineage edges (upstream + downstream count). |
| Schema | Column list with name, data type, nullable, semantic type, masking. |
| Preview | Paginated rows with PII masking applied. |
| Profile | Per-column stats — distinct count, null %, min/max, top values. |
| Classification | PII classification per column with a one-click masking suggestion. |
| Sync | Last sync status, schedule, and a manual Sync now button. |
The Schema → Edit button opens an inline editor for column descriptions, semantic types, and masking config. Edits save to the datasets row, not to dbt — the next dbt build does not overwrite the metadata.
Creating a dataset
Three creation paths in the UI:
From a connector
- + New Dataset → From connector.
- Pick a connector. The browser lists schemas (
GET /api/datasets/browse/schemas), then tables in the selected schema (/browse/tables). - Select a table or write custom SQL.
- Name it, pick a layer, save. The dataset appears immediately on the Flow canvas as a
sourcenode.
From a file
- + New Dataset → Upload file.
- Drag a CSV or Excel file. The platform infers column types from a sample of rows (
POST /api/datasets/file/previewreturns the inferred schema for review). - Adjust column types if needed, name the dataset, save. The file is persisted into the platform's storage and becomes a queryable dataset.
CSV uploads are subject to the nginx client_max_body_size (default 200 MB). For larger files, write to an object-storage connector and import as a source.
Importing a dbt model
dbt models are auto-discovered, but sometimes you want to surface a model explicitly with custom metadata (custom layer, a different display name, ownership). Import dbt model does that — it creates a dataset row that wraps the dbt model and lets you edit its surface metadata without touching the dbt project.
Semantic types and masking
The platform infers a semantic_type per column from the column name and data type. Recognised semantic types include email, phone, name, address, and national_id (the catalog ships with country-specific patterns — for example nik for Indonesia — and is extensible).
For each semantic type, you can set a masking method:
| Method | What it does |
|---|---|
none | Pass through. |
hash | SHA-256 the value. Stable for joins, irreversible. |
redact | Replace with ***. |
truncate | Keep the first N characters; replace the rest with *. |
Masking is applied at query time in the Python layer before results are JSON-serialized — the underlying SQL still selects the raw column, but operators without data.read_unmasked see only the masked value.
The classification tab is a fast way to set masking on every column at once. The platform proposes a default policy based on the semantic type; review and apply.
Sharing datasets
Datasets can be shared across projects within an organization. From the dataset detail page → Share, pick a target project and a role (viewer / editor). The dataset appears in the target project under Datasets → Shared with me. Revoke from the same modal.
Two list endpoints reflect each side:
GET /api/datasets/shares/outgoing— datasets you've shared with other projects.GET /api/datasets/shares/incoming— datasets other projects have shared with you.
Sharing only grants read access. To grant write access on a shared dataset, the consuming project must build a recipe that reads from it and writes to a project-local dataset.
Sync
For dataset types that need a refresh (file uploads, custom SQL backed by an external connector), the Sync tab shows status and triggers manual runs.
A sync acquires a row-level lock in honeyframe.job_runs to prevent overlapping runs of the same dataset. Concurrent calls return 409 Conflict with the in-flight run's ID. Sync history is queryable via GET /api/datasets/{name}/sync-status.
For dbt-built datasets, "sync" means re-running the model — schedule that through the Flow page or the project's Schedules tab, not from the dataset detail page.
Maintenance
Two maintenance actions per dataset:
- Vacuum (
POST /api/datasets/{name}/vacuum) — drops old soft-deleted rows and reclaims space. Safe to run anytime. - Optimize (
POST /api/datasets/{name}/optimize) — rebuilds indexes. Holds an exclusive lock on the underlying table for the duration; schedule outside business hours for large datasets.
API reference
| Endpoint | Description |
|---|---|
GET /api/datasets | List datasets (filter by layer / search). |
GET /api/datasets/available | Union of dbt models + sources + imported datasets the caller can read. |
GET /api/datasets/{name} | Full dataset detail (schema, row count, freshness, lineage refs). |
GET /api/datasets/{name}/explore | Schema + a sample rows window with masking applied. |
GET /api/datasets/{name}/preview | Paginated rows with stats. |
GET /api/datasets/{name}/profile | Column-level statistics. |
GET /api/datasets/{name}/classification | PII classification per column. |
GET /api/datasets/{name}/sync-status | Last sync timestamp + status. |
POST /api/datasets/{name}/sync | Trigger a manual sync. |
POST /api/datasets | Create from a connector or custom SQL. |
POST /api/datasets/file/preview | Pre-upload schema inference. |
POST /api/datasets/file/upload | CSV / Excel upload. |
POST /api/datasets/import | Wrap a dbt model as a managed dataset. |
DELETE /api/datasets/import/{name} | Remove an imported dataset (does not touch the dbt model). |
DELETE /api/datasets/{name} | Remove. Recipes that depend on the dataset will fail until repointed. |
POST /api/datasets/{name}/vacuum | Drop soft-deleted rows. |
POST /api/datasets/{name}/optimize | Rebuild indexes. |
GET /api/datasets/browse/schemas | List schemas reachable through the project's connectors. |
GET /api/datasets/browse/tables | List tables in a given schema. |
GET /api/datasets/shares/outgoing | Datasets you've shared. |
GET /api/datasets/shares/incoming | Datasets shared with you. |
Gotchas
- Oracle source discovery uses
information_schema.columnsand requires uppercase table/column names. If a discovery returns no columns, double-check case. - File upload column inference is sampled, not exhaustive — heterogeneous columns (numbers mixed with text) may infer to
texteven if 99% of rows are numeric. Set the type explicitly in the upload review step before saving. - Semantic type inference is lightweight and pattern-based. It catches obvious cases (
user_email,phone_number) but misses non-standard column names. Run Classification explicitly for any dataset that holds PII. - Dataset deletion is hard delete; there is no trash. Dependent recipes break until the dataset is recreated or the recipe is repointed.