Skip to main content

Datasets API

The /api/datasets/* surface is where most analytics integrations land. It covers connector-backed datasets (rows queried on demand from PostgreSQL, BigQuery, etc.), Lakehouse-backed datasets (Parquet under $DATA_DIR), and uploaded files (CSV).

For the conceptual model, see Connectors. The endpoints below cover the API surface; pair them with the OpenAPI spec at /openapi.json for full request/response schemas.

Listing and discovery

GET/api/datasets🔒 auth

List datasets visible to the caller in the current org. Supports limit, offset, q (free-text), and sort query parameters.

GET/api/datasets/available🔒 auth

List datasets the caller can read for use as a recipe input or chart source. Filtered by per-dataset share grants.

GET/api/datasets/browse/schemas🔒 auth

Browse the schemas (databases, namespaces) on a configured connector. Used by the dataset-creation wizard.

GET/api/datasets/browse/tables🔒 auth

Browse the tables in a schema on a configured connector.

GET/api/datasets/{name}🔒 auth

Full dataset metadata including schema, settings (masking rules, refresh policy), and provenance.

Creating datasets

POST/api/datasets🔒 auth

Register a new dataset against a connector. Body includes name, connector_id, source_table (or source_query), and optional settings.

POST/api/datasets/import🔒 auth

Import a single source table from a connector as a dataset, optionally with a column-rename map.

POST/api/datasets/import-batch🔒 auth

Bulk import — point at a schema and import every table that matches a pattern.

DELETE/api/datasets/import/{name}🔒 auth

Cancel an in-progress batch import.

File upload

POST/api/datasets/file/preview🔒 auth

Upload a CSV / Parquet / Excel file as multipart/form-data and get a column-detection preview without committing it as a dataset. Useful for the upload UI's "does the platform parse this correctly?" step.

POST/api/datasets/file/upload🔒 auth

Upload a file and commit it as a managed dataset. Subject to client_max_body_size (default 200 MB at the reverse proxy).

Querying data

GET/api/datasets/{name}/preview🔒 auth

Return the first N rows (default 100) of the dataset, with masking rules applied.

GET/api/datasets/{name}/explore🔒 auth

Run a parameterized query against the dataset — projection, filter, group, order, limit. Returns rows + column metadata. The chart and table blocks on dashboards use this endpoint internally.

GET/api/datasets/{name}/profile🔒 auth

Column-level statistics: cardinality, null rate, min/max, top values for categorical columns. Useful for auto-classification and data-quality dashboards.

GET/api/datasets/{name}/column-stats🔒 auth

Same as /profile but returns rolling stats over a time window.

History and maintenance

GET/api/datasets/{name}/history🔒 auth

Versioned history for Lakehouse-backed datasets — list of past commits with timestamp, row count, and the upstream recipe run that produced each version.

POST/api/datasets/{name}/vacuum🔒 auth

Run a Delta Lake VACUUM on a Lakehouse-backed dataset to drop old file versions.

POST/api/datasets/{name}/optimize🔒 auth

Run a Delta Lake OPTIMIZE (compact small files) on a Lakehouse-backed dataset.

Governance

GET/api/datasets/governance/summary🔒 auth

Org-wide governance summary — counts by sensitivity level, masking coverage, share counts.

GET/api/datasets/{name}/classification🔒 auth

Per-column classification (semantic types and sensitivity levels) for a dataset. Used by the data-policies UI.

Auth and access

All endpoints require a valid JWT. The legacy authorization layer gates writes on require_role("editor", "admin"); reads are open to all org members. The new layer (planned) will use dataset.read / dataset.readwrite per dataset — see Permissions Reference.

When a dataset has masking rules, the endpoints respect them: previews and explore queries return masked values; raw values are visible only to users in the column's unmask_roles. See Data Policies.

Pagination convention

List endpoints accept limit and offset. The maximum limit is 200; larger values are clamped. The response body includes total so clients can compute page count without an extra count query.

{
"total": 1247,
"items": [...]
}

Errors

Errors follow the platform standard — see API overview. Dataset-specific errors:

StatusErrorCause
404dataset_not_foundDataset name doesn't exist in the current org.
409dataset_name_takenA dataset with that name already exists.
422invalid_queryThe query parameters didn't validate (bad column reference, malformed filter).
503connector_unavailableThe underlying connector failed its health check; the dataset is temporarily unreadable.