Skip to content

Ingesting Data

Write data to your dataset tables using the /dataset-tables/{tableId}/rows endpoint. Before ingesting data, your table must exist with a defined schema—see the Schema Management guide to create tables.

Catalyzed accepts data in two formats:

FormatContent-TypeUse Case
JSONapplication/jsonSimple integration, readable, array of row objects. Buffered in memory.
Arrow IPCapplication/vnd.apache.arrow.streamHigh performance, typed columnar data. Streamed end-to-end — no size limit, no full-payload buffering.

Choose a write mode based on how you want to modify the table:

ModeDescriptionPrimary Key Required
appendInsert new rows without checking for duplicates (fastest)No
upsertUpdate existing rows by primary key, insert new rowsYes
overwriteReplace all existing data in the tableNo
deleteRemove rows matching the provided primary keysYes

Specify the mode using the mode query parameter: ?mode=append, ?mode=upsert, etc.

The simplest way to add data. Append mode inserts rows without duplicate checking, making it the fastest option:

Append rows to a table

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=append" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{"id": "1", "name": "Alice", "email": "alice@example.com"},
{"id": "2", "name": "Bob", "email": "bob@example.com"}
]'

Upsert mode updates existing rows by primary key and inserts new rows. The table must have a primary key defined:

Upsert rows (update or insert)

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=upsert" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{"id": "1", "name": "Alice Updated", "email": "alice.new@example.com"},
{"id": "3", "name": "Charlie", "email": "charlie@example.com"}
]'

In this example, if row with id="1" exists, it gets updated. Row with id="3" is inserted as new.

Overwrite mode replaces all existing data in the table with the new rows:

Overwrite entire table

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=overwrite" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[
{"id": "10", "name": "New User", "email": "new@example.com"}
]'

Delete mode removes rows matching the provided primary keys.

Delete rows by primary key

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=delete" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '["1", "2", "3"]'

For large datasets or when working with typed columnar data, use Apache Arrow IPC stream format. Arrow IPC requests are streamed end-to-end — the API forwards data directly to the storage engine without buffering the full payload in memory. This means there is no practical size limit beyond what your network can sustain, and the server starts writing data before the upload completes.

Ingest using Arrow IPC

Terminal window
# Arrow IPC is binary format - use TypeScript/Python libraries
# cURL example omitted (not practical for binary data)

For datasets that don’t fit in memory, write Arrow IPC in batches. The API streams each batch to the storage engine as it arrives:

Stream large dataset with Arrow IPC

Terminal window
# Arrow IPC is binary format - use TypeScript/Python libraries
# cURL example omitted (not practical for binary data)

When to use Arrow IPC:

  • Datasets larger than 10MB or too large to buffer in memory
  • You already have data in columnar format (Pandas, Polars, DuckDB)
  • Type safety is critical (no JSON string/number ambiguity)
  • Maximum throughput is required — data streams end-to-end with no intermediate buffering

Tables with binary or largebinary columns accept binary data through both ingestion formats:

JSON ingestion — provide values as base64-encoded strings:

[
{"id": "doc_1", "content": "SGVsbG8gV29ybGQ="},
{"id": "doc_2", "content": "AQID"}
]

The API validates and decodes base64 automatically. Invalid base64 strings are rejected with a COERCION_FAILED error.

Arrow IPC ingestion — provide binary data natively as Binary or LargeBinary Arrow arrays. No encoding needed — bytes are preserved as-is. This is the recommended path for high-throughput binary ingestion (files, embeddings, serialized payloads) since it avoids the ~33% base64 overhead.

All ingestion requests return metrics about the operation:

{
"rows_affected": 100,
"rows_inserted": 95,
"rows_updated": 5,
"rows_deleted": 0,
"dataset_version": 42,
"duration_ms": 150,
"usage": {
"bytes_read": 1024,
"bytes_written": 2048
}
}
FieldDescription
rows_affectedTotal rows modified
rows_insertedNew rows added
rows_updatedExisting rows changed
rows_deletedRows removed
dataset_versionNew version number after operation
duration_msTime taken for the operation
usageStorage I/O metrics for billing

Control ingestion behavior with query parameters:

ParameterTypeDescription
modeappend | upsert | overwrite | deleteWrite operation mode (required)
skip_validationbooleanSkip schema validation for faster writes (optional)

By default, incoming data is validated against the table schema. For trusted data sources, skip validation for faster writes:

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/rows?mode=append&skip_validation=true" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '[{"id": "1", "name": "Alice"}]'

For JSON, send up to 100MB per request. For Arrow IPC, there is no payload size limit — data is streamed:

// Good: Batch 1000 rows (JSON)
const batch = rows.slice(0, 1000);
await ingestRows(tableId, batch);
// Better: Use Arrow IPC for large datasets (streamed, no size limit)
const arrowData = tableToIPC(table, "stream");
await fetch(url, {
headers: { "Content-Type": "application/vnd.apache.arrow.stream" },
body: arrowData,
});
// Avoid: Single row per request
for (const row of rows) {
await ingestRows(tableId, [row]); // Too many HTTP requests
}
// For small batches (<1000 rows): JSON is simpler
const response = await fetch(url, {
headers: { "Content-Type": "application/json" },
body: JSON.stringify(rows),
});
// For large batches (>10,000 rows) or streaming: Arrow IPC
const arrowData = tableToIPC(table, "stream");
const response = await fetch(url, {
headers: { "Content-Type": "application/vnd.apache.arrow.stream" },
body: arrowData,
});

For batch imports or critical data, always use idempotency keys:

const key = `import-${dataSource}-${timestamp}-${batchId}`;
await fetch(`${url}?mode=append&idempotency_key=${key}`, {
method: "POST",
body: JSON.stringify(rows),
});

When loading data for the first time, use append mode—it’s the fastest:

Terminal window
# Initial load: use append
curl -X POST ".../rows?mode=append" -d '[...]'
# Subsequent updates: use upsert
curl -X POST ".../rows?mode=upsert" -d '[...]'

Track dataset_version in responses to detect concurrent writes:

const { dataset_version } = await ingestRows(tableId, rows);
console.log(`Data written at version ${dataset_version}`);

Common errors and solutions:

Error CodeCauseSolution
TABLE_NOT_FOUNDTable ID doesn’t existVerify table ID and team access
INVALID_BODYRequest body is not an arraySend JSON array of row objects
EMPTY_BODYArray is emptyInclude at least one row
TABLE_NOT_REGISTEREDTable not linked to data engineContact support (rare)
SCHEMA_VALIDATION_ERRORData doesn’t match schemaCheck field types and names
INGESTION_FAILEDInternal error during writeRetry the request; contact support if persistent
try {
const response = await fetch(url, {
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify(rows),
});
if (!response.ok) {
const error = await response.json();
if (error.code === "SCHEMA_VALIDATION_ERROR") {
console.error("Schema mismatch:", error.message);
// Log problematic rows or field types
}
throw new Error(`Ingestion failed: ${error.message}`);
}
const result = await response.json();
console.log(`Ingested ${result.rows_affected} rows`);
} catch (err) {
console.error("Failed to ingest data:", err);
}
  • Querying Data - Read and analyze your ingested data with SQL
  • Schema Management - Create tables and evolve schemas safely
  • Tables - Learn about table schemas, indexes, and data types