Skip to content

Query Optimization

Query performance in Catalyzed can be optimized through indexes, table maintenance, and query design. This guide covers strategies for improving query performance across datasets of all sizes.

  1. Identify slow queries using query statistics
  2. Analyze query plans with the explain endpoint
  3. Add appropriate indexes for frequent filters
  4. Maintain table statistics for optimal query planning
  5. Compact tables to optimize storage and I/O
  6. Monitor and iterate on query performance

Avoid SELECT * and only request the columns you need:

-- ❌ Inefficient: Returns all columns
SELECT * FROM products WHERE category = 'electronics'
-- ✅ Efficient: Returns only needed columns
SELECT product_id, name, price FROM products WHERE category = 'electronics'

Push filtering to the database rather than in application code:

-- ❌ Inefficient: Fetches all rows, filters in app
SELECT * FROM orders
-- Then filter in application: orders.filter(o => o.status === 'completed')
-- ✅ Efficient: Filters in database
SELECT * FROM orders WHERE status = 'completed'

###Always Use LIMIT During Development

Prevent accidentally fetching millions of rows while developing:

-- ✅ Safe for exploration
SELECT * FROM large_table LIMIT 100

Ensure queries use the correct types to avoid implicit conversions:

-- ❌ String comparison on timestamp column
SELECT * FROM events WHERE created_at > '2024-01-15'
-- ✅ Proper timestamp comparison
SELECT * FROM events WHERE created_at > TIMESTAMP '2024-01-15 00:00:00'

Indexes dramatically improve query performance for filtering, sorting, and joins.

Create btree indexes on columns frequently used in WHERE clauses:

Create a btree index

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"indexName": "idx_category",
"indexType": "btree",
"columns": ["category"]
}'

When to use btree indexes:

  • Equality filters: WHERE category = 'electronics'
  • Range queries: WHERE price BETWEEN 10 AND 100
  • Sorting: ORDER BY created_at DESC
  • Joins: JOIN orders ON products.id = orders.product_id

For semantic search and similarity queries, use vector indexes:

Index TypeUse CaseTrade-offs
ivf_pqStandard vector searchBalanced recall/speed
ivf_hnsw_pqHigh-recall requirementsBetter recall, slower indexing
ivf_hnsw_sqMemory-constrained environmentsLower memory usage, slightly lower recall

Create a vector index

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"indexName": "idx_embedding",
"indexType": "ivf_pq",
"columns": ["embedding"]
}'

Vector index sizing guidelines:

  • < 10,000 rows: Flat search acceptable (no index needed)
  • 10,000 - 100,000 rows: Index recommended for good performance
  • > 100,000 rows: Index required for acceptable query times

See the Vector Search guide for detailed tuning strategies.

Indexes must be rebuilt after schema changes:

Terminal window
# 1. Drop the old index
curl -X DELETE "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes/idx_category" \
-H "Authorization: Bearer $API_TOKEN"
# 2. Recreate it
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"indexName": "idx_category",
"indexType": "btree",
"columns": ["category"]
}'

Table statistics help the query optimizer choose efficient execution plans.

Catalyzed collects statistics about your data including:

  • Column cardinality - Number of distinct values
  • Data distribution - Histogram of value frequencies
  • NULL counts - Percentage of NULL values
  • Min/max values - Range of data in columns

These statistics help the query planner:

  • Estimate result set sizes
  • Choose optimal join strategies
  • Select best index usage
  • Determine filter order

Compute table statistics

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/statistics" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{}'

Recompute statistics when:

  • After large data loads - New data changes distribution significantly
  • After schema migrations - Column types or structure changes
  • Query plans look suboptimal - Optimizer makes poor choices
  • Periodic maintenance - Weekly or monthly for active tables

Performance impact: Statistics computation scans all data, so schedule during off-peak hours for large tables.

Compaction optimizes storage by merging small files into larger ones, improving I/O performance.

Catalyzed stores table data in multiple files. Over time, many small files accumulate from frequent writes. Compaction:

  • Merges small files into larger ones
  • Reduces file count and metadata overhead
  • Improves sequential read performance
  • Reclaims deleted row space

Compact a table

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/compact" \
-H "Authorization: Bearer $API_TOKEN"

Compact tables when:

  • After many small writes - Frequent upserts or appends
  • After bulk deletes - Reclaim space from deleted rows
  • Query performance degrades - Too many files slow down reads
  • Before long-running queries - Optimize for analytical workloads

Recommended frequency:

  • High-write tables: Weekly
  • Moderate-write tables: Monthly
  • Read-mostly tables: After significant data changes

Performance impact: Compaction runs in the background but may temporarily increase storage usage during the merge process.

Use the explain endpoint to understand query execution before running expensive queries:

Explain a query

Terminal window
curl -X POST "https://api.catalyzed.ai/queries/explain" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM products WHERE category = '\''electronics'\'' AND price > 100",
"tables": {"products": "table_abc123"}
}'

The response shows:

  • Execution plan - Step-by-step query execution
  • Estimated cost - Query complexity estimate
  • Index usage - Which indexes will be used
  • Filter pushdown - Where filters are applied

Look for:

  • Index scans - Good: Using indexes efficiently
  • ⚠️ Full table scans - Warning: May be slow on large tables
  • Filter pushdown - Good: Filters applied early
  • ⚠️ Large joins - Warning: May need join order optimization

Optimize data loading for best throughput:

Insert data in batches up to 100MB per request:

const batchSize = 5000;
for (let i = 0; i < rows.length; i += batchSize) {
const batch = rows.slice(i, i + batchSize);
await fetch(
`https://api.catalyzed.ai/dataset-tables/table_abc123/rows?mode=append`,
{
method: "POST",
headers: {
Authorization: `Bearer ${apiToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify(batch),
}
);
}

For datasets >10MB, use Arrow IPC format instead of JSON:

Terminal window
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/rows?mode=append" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/vnd.apache.arrow.stream" \
--data-binary @data.arrow

Benefits of Arrow IPC:

  • 5-10x faster than JSON for large payloads
  • No parsing overhead
  • Preserves type information
  • Supports streaming
ModeUse CasePerformance
appendInitial loads, no duplicatesFastest (no checks)
upsertIncremental updatesModerate (requires primary key lookups)
overwriteReplace all dataModerate (deletes + inserts)
deleteRemove specific rowsModerate (requires primary key lookups)

See the Ingesting Data guide for details on write modes.

Include statistics in query responses to monitor performance:

Terminal window
curl -X POST "https://api.catalyzed.ai/queries" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT * FROM products LIMIT 100",
"tables": {"products": "table_abc123"},
"includeStats": true
}'

Response includes:

{
"rows": [...],
"stats": {
"executionTimeMs": 145,
"planningTimeMs": 12,
"bytesScanned": 524288,
"rowsScanned": 1000
}
}

Key metrics:

  • executionTimeMs - Total query time
  • planningTimeMs - Query planning overhead
  • bytesScanned - Data read from storage
  • rowsScanned - Rows examined before filtering

Use this checklist to optimize query performance:

  • Use appropriate Arrow types (int64 vs string for IDs)
  • Define primary keys for tables with unique identifiers
  • Denormalize when appropriate to avoid complex joins
  • Create btree indexes on frequently filtered columns
  • Create vector indexes for tables with >10k rows and similarity search
  • Rebuild indexes after schema changes
  • Compute statistics after large data loads
  • Compact tables weekly (high-write) or monthly (moderate-write)
  • Monitor table file counts and sizes
  • Select only needed columns (avoid SELECT *)
  • Filter early with WHERE clauses
  • Use LIMIT during development and exploration
  • Use appropriate data types in filters
  • Batch writes (1,000-5,000 rows per request)
  • Use Arrow IPC for payloads >10MB
  • Choose appropriate write mode (append vs upsert)
  • Enable query statistics (includeStats: true)
  • Analyze slow queries with explain endpoint
  • Track query performance trends over time