Lewati ke konten utama
Versi: v0.0.29

Datasets

A dataset is a queryable, governed view of data — a table, an external view, a CSV upload, or a dbt-built model. Datasets are the primary unit Honeyframe operates on: recipes consume them, dashboards visualize them, agents query them, and access control gates on them.

Two things become datasets automatically:

  • dbt models discovered from the project's manifest — every staging, intermediate, and mart model lands in the dataset catalog with its layer tag.
  • Sources declared in sources.yml — external tables behind a Connector appear with the source's schema and freshness metadata.

Anything else (a one-off CSV upload, a custom SQL view against a connector, an imported external schema) becomes a dataset through the + New Dataset flow.

Layers

Each dataset carries a layer tag aligned with the dbt convention:

LayerMeaning
sourceRaw external table reached through a connector. Read-only.
stagingLightly cleaned per-source projection (stg_*).
intermediateReusable transformation between staging and marts (int_*).
martsBusiness-modeled output. The default surface for dashboards and agents.
viewsAd-hoc views (custom SQL datasets). Not part of the dbt build.

The layer drives where the dataset lands on the Flow canvas and which schemas it materialises into. It does not by itself control access — that's the permission model.

Browsing datasets

The dataset list page (/datasets) serves four lenses:

  • All — every dataset in the project (default).
  • By layer — filter to one of the layers above.
  • By search — substring match on name and description.
  • Available — the union of dbt models, sources, and user-imported datasets the caller can read.

For each dataset, the row shows: name, layer, row count, freshness status (fresh / stale / error), last refresh, and ownership flag. Click through for the dataset detail page.

Detail page

The detail page has six tabs:

TabWhat it shows
OverviewDescription, ownership, lineage edges (upstream + downstream count).
SchemaColumn list with name, data type, nullable, semantic type, masking.
PreviewPaginated rows with PII masking applied.
ProfilePer-column stats — distinct count, null %, min/max, top values.
ClassificationPII classification per column with a one-click masking suggestion.
SyncLast sync status, schedule, and a manual Sync now button.

The Schema → Edit button opens an inline editor for column descriptions, semantic types, and masking config. Edits save to the datasets row, not to dbt — the next dbt build does not overwrite the metadata.

Creating a dataset

Three creation paths in the UI:

From a connector

  1. + New Dataset → From connector.
  2. Pick a connector. The browser lists schemas (GET /api/datasets/browse/schemas), then tables in the selected schema (/browse/tables).
  3. Select a table or write custom SQL.
  4. Name it, pick a layer, save. The dataset appears immediately on the Flow canvas as a source node.

From a file

  1. + New Dataset → Upload file.
  2. Drag a CSV or Excel file. The platform infers column types from a sample of rows (POST /api/datasets/file/preview returns the inferred schema for review).
  3. Adjust column types if needed, name the dataset, save. The file is persisted into the platform's storage and becomes a queryable dataset.

CSV uploads are subject to the nginx client_max_body_size (default 200 MB). For larger files, write to an object-storage connector and import as a source.

Importing a dbt model

dbt models are auto-discovered, but sometimes you want to surface a model explicitly with custom metadata (custom layer, a different display name, ownership). Import dbt model does that — it creates a dataset row that wraps the dbt model and lets you edit its surface metadata without touching the dbt project.

Semantic types and masking

The platform infers a semantic_type per column from the column name and data type. Recognised semantic types include email, phone, name, address, and national_id (the catalog ships with country-specific patterns — for example nik for Indonesia — and is extensible).

For each semantic type, you can set a masking method:

MethodWhat it does
nonePass through.
hashSHA-256 the value. Stable for joins, irreversible.
redactReplace with ***.
truncateKeep the first N characters; replace the rest with *.

Masking is applied at query time in the Python layer before results are JSON-serialized — the underlying SQL still selects the raw column, but operators without data.read_unmasked see only the masked value.

The classification tab is a fast way to set masking on every column at once. The platform proposes a default policy based on the semantic type; review and apply.

Sharing datasets

Datasets can be shared across projects within an organization. From the dataset detail page → Share, pick a target project and a role (viewer / editor). The dataset appears in the target project under Datasets → Shared with me. Revoke from the same modal.

Two list endpoints reflect each side:

  • GET /api/datasets/shares/outgoing — datasets you've shared with other projects.
  • GET /api/datasets/shares/incoming — datasets other projects have shared with you.

Sharing only grants read access. To grant write access on a shared dataset, the consuming project must build a recipe that reads from it and writes to a project-local dataset.

Sync

For dataset types that need a refresh (file uploads, custom SQL backed by an external connector), the Sync tab shows status and triggers manual runs.

A sync acquires a row-level lock in honeyframe.job_runs to prevent overlapping runs of the same dataset. Concurrent calls return 409 Conflict with the in-flight run's ID. Sync history is queryable via GET /api/datasets/{name}/sync-status.

For dbt-built datasets, "sync" means re-running the model — schedule that through the Flow page or the project's Schedules tab, not from the dataset detail page.

Maintenance

Two maintenance actions per dataset:

  • Vacuum (POST /api/datasets/{name}/vacuum) — drops old soft-deleted rows and reclaims space. Safe to run anytime.
  • Optimize (POST /api/datasets/{name}/optimize) — rebuilds indexes. Holds an exclusive lock on the underlying table for the duration; schedule outside business hours for large datasets.

API reference

EndpointDescription
GET /api/datasetsList datasets (filter by layer / search).
GET /api/datasets/availableUnion of dbt models + sources + imported datasets the caller can read.
GET /api/datasets/{name}Full dataset detail (schema, row count, freshness, lineage refs).
GET /api/datasets/{name}/exploreSchema + a sample rows window with masking applied.
GET /api/datasets/{name}/previewPaginated rows with stats.
GET /api/datasets/{name}/profileColumn-level statistics.
GET /api/datasets/{name}/classificationPII classification per column.
GET /api/datasets/{name}/sync-statusLast sync timestamp + status.
POST /api/datasets/{name}/syncTrigger a manual sync.
POST /api/datasetsCreate from a connector or custom SQL.
POST /api/datasets/file/previewPre-upload schema inference.
POST /api/datasets/file/uploadCSV / Excel upload.
POST /api/datasets/importWrap a dbt model as a managed dataset.
DELETE /api/datasets/import/{name}Remove an imported dataset (does not touch the dbt model).
DELETE /api/datasets/{name}Remove. Recipes that depend on the dataset will fail until repointed.
POST /api/datasets/{name}/vacuumDrop soft-deleted rows.
POST /api/datasets/{name}/optimizeRebuild indexes.
GET /api/datasets/browse/schemasList schemas reachable through the project's connectors.
GET /api/datasets/browse/tablesList tables in a given schema.
GET /api/datasets/shares/outgoingDatasets you've shared.
GET /api/datasets/shares/incomingDatasets shared with you.

Gotchas

  • Oracle source discovery uses information_schema.columns and requires uppercase table/column names. If a discovery returns no columns, double-check case.
  • File upload column inference is sampled, not exhaustive — heterogeneous columns (numbers mixed with text) may infer to text even if 99% of rows are numeric. Set the type explicitly in the upload review step before saving.
  • Semantic type inference is lightweight and pattern-based. It catches obvious cases (user_email, phone_number) but misses non-standard column names. Run Classification explicitly for any dataset that holds PII.
  • Dataset deletion is hard delete; there is no trash. Dependent recipes break until the dataset is recreated or the recipe is repointed.