Datasets API
The /api/datasets/* surface is where most analytics integrations land. It covers connector-backed datasets (rows queried on demand from PostgreSQL, BigQuery, etc.), Lakehouse-backed datasets (Parquet under $DATA_DIR), and uploaded files (CSV).
For the conceptual model, see Connectors. The endpoints below cover the API surface; pair them with the OpenAPI spec at /openapi.json for full request/response schemas.
Listing and discovery
/api/datasets🔒 authList datasets visible to the caller in the current org. Supports limit, offset, q (free-text), and sort query parameters.
/api/datasets/available🔒 authList datasets the caller can read for use as a recipe input or chart source. Filtered by per-dataset share grants.
/api/datasets/browse/schemas🔒 authBrowse the schemas (databases, namespaces) on a configured connector. Used by the dataset-creation wizard.
/api/datasets/browse/tables🔒 authBrowse the tables in a schema on a configured connector.
/api/datasets/{name}🔒 authFull dataset metadata including schema, settings (masking rules, refresh policy), and provenance.
Creating datasets
/api/datasets🔒 authRegister a new dataset against a connector. Body includes name, connector_id, source_table (or source_query), and optional settings.
/api/datasets/import🔒 authImport a single source table from a connector as a dataset, optionally with a column-rename map.
/api/datasets/import-batch🔒 authBulk import — point at a schema and import every table that matches a pattern.
/api/datasets/import/{name}🔒 authCancel an in-progress batch import.
File upload
/api/datasets/file/preview🔒 authUpload a CSV / Parquet / Excel file as multipart/form-data and get a column-detection preview without committing it as a dataset. Useful for the upload UI's "does the platform parse this correctly?" step.
/api/datasets/file/upload🔒 authUpload a file and commit it as a managed dataset. Subject to client_max_body_size (default 200 MB at the reverse proxy).
Querying data
/api/datasets/{name}/preview🔒 authReturn the first N rows (default 100) of the dataset, with masking rules applied.
/api/datasets/{name}/explore🔒 authRun a parameterized query against the dataset — projection, filter, group, order, limit. Returns rows + column metadata. The chart and table blocks on dashboards use this endpoint internally.
/api/datasets/{name}/profile🔒 authColumn-level statistics: cardinality, null rate, min/max, top values for categorical columns. Useful for auto-classification and data-quality dashboards.
/api/datasets/{name}/column-stats🔒 authSame as /profile but returns rolling stats over a time window.
History and maintenance
/api/datasets/{name}/history🔒 authVersioned history for Lakehouse-backed datasets — list of past commits with timestamp, row count, and the upstream recipe run that produced each version.
/api/datasets/{name}/vacuum🔒 authRun a Delta Lake VACUUM on a Lakehouse-backed dataset to drop old file versions.
/api/datasets/{name}/optimize🔒 authRun a Delta Lake OPTIMIZE (compact small files) on a Lakehouse-backed dataset.
Governance
/api/datasets/governance/summary🔒 authOrg-wide governance summary — counts by sensitivity level, masking coverage, share counts.
/api/datasets/{name}/classification🔒 authPer-column classification (semantic types and sensitivity levels) for a dataset. Used by the data-policies UI.
Auth and access
All endpoints require a valid JWT. The legacy authorization layer gates writes on require_role("editor", "admin"); reads are open to all org members. The new layer (planned) will use dataset.read / dataset.readwrite per dataset — see Permissions Reference.
When a dataset has masking rules, the endpoints respect them: previews and explore queries return masked values; raw values are visible only to users in the column's unmask_roles. See Data Policies.
Pagination convention
List endpoints accept limit and offset. The maximum limit is 200; larger values are clamped. The response body includes total so clients can compute page count without an extra count query.
{
"total": 1247,
"items": [...]
}
Errors
Errors follow the platform standard — see API overview. Dataset-specific errors:
| Status | Error | Cause |
|---|---|---|
| 404 | dataset_not_found | Dataset name doesn't exist in the current org. |
| 409 | dataset_name_taken | A dataset with that name already exists. |
| 422 | invalid_query | The query parameters didn't validate (bad column reference, malformed filter). |
| 503 | connector_unavailable | The underlying connector failed its health check; the dataset is temporarily unreadable. |