Query Optimization

Query performance in Catalyzed can be optimized through indexes, table maintenance, and query design. This guide covers strategies for improving query performance across datasets of all sizes.

Performance Optimization Workflow

Identify slow queries using query statistics
Analyze query plans with the explain endpoint
Add appropriate indexes for frequent filters
Maintain table statistics for optimal query planning
Compact tables to optimize storage and I/O
Monitor and iterate on query performance

Query-Level Optimizations

Select Specific Columns

Avoid SELECT * and only request the columns you need:

-- ❌ Inefficient: Returns all columns
SELECT * FROM products WHERE category = 'electronics'

-- ✅ Efficient: Returns only needed columns
SELECT product_id, name, price FROM products WHERE category = 'electronics'

Filter Early with WHERE Clauses

Push filtering to the database rather than in application code:

-- ❌ Inefficient: Fetches all rows, filters in app
SELECT * FROM orders
-- Then filter in application: orders.filter(o => o.status === 'completed')

-- ✅ Efficient: Filters in database
SELECT * FROM orders WHERE status = 'completed'

###Always Use LIMIT During Development

Prevent accidentally fetching millions of rows while developing:

-- ✅ Safe for exploration
SELECT * FROM large_table LIMIT 100

Use Appropriate Data Types

Ensure queries use the correct types to avoid implicit conversions:

-- ❌ String comparison on timestamp column
SELECT * FROM events WHERE created_at > '2024-01-15'

-- ✅ Proper timestamp comparison
SELECT * FROM events WHERE created_at > TIMESTAMP '2024-01-15 00:00:00'

Indexes

Indexes dramatically improve query performance for filtering, sorting, and joins.

Traditional Indexes

Create btree indexes on columns frequently used in WHERE clauses:

Create a btree index

curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "indexName": "idx_category",
    "indexType": "btree",
    "columns": ["category"]
  }'

await fetch("https://api.catalyzed.ai/dataset-tables/table_abc123/indexes", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    indexName: "idx_category",
    indexType: "btree",
    columns: ["category"],
  }),
});

requests.post(
    "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "indexName": "idx_category",
        "indexType": "btree",
        "columns": ["category"]
    }
)

When to use btree indexes:

Equality filters: WHERE category = 'electronics'
Range queries: WHERE price BETWEEN 10 AND 100
Sorting: ORDER BY created_at DESC
Joins: JOIN orders ON products.id = orders.product_id

Vector Indexes

For semantic search and similarity queries, use vector indexes:

Index Type	Use Case	Trade-offs
`ivf_pq`	Standard vector search	Balanced recall/speed
`ivf_hnsw_pq`	High-recall requirements	Better recall, slower indexing
`ivf_hnsw_sq`	Memory-constrained environments	Lower memory usage, slightly lower recall

Create a vector index

curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "indexName": "idx_embedding",
    "indexType": "ivf_pq",
    "columns": ["embedding"]
  }'

Vector index sizing guidelines:

< 10,000 rows: Flat search acceptable (no index needed)
10,000 - 100,000 rows: Index recommended for good performance
> 100,000 rows: Index required for acceptable query times

See the Vector Search guide for detailed tuning strategies.

Index Maintenance

Indexes must be rebuilt after schema changes:

# 1. Drop the old index
curl -X DELETE "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes/idx_category" \
  -H "Authorization: Bearer $API_TOKEN"

# 2. Recreate it
curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/indexes" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "indexName": "idx_category",
    "indexType": "btree",
    "columns": ["category"]
  }'

Table Statistics

Table statistics help the query optimizer choose efficient execution plans.

What Are Statistics?

Catalyzed collects statistics about your data including:

Column cardinality - Number of distinct values
Data distribution - Histogram of value frequencies
NULL counts - Percentage of NULL values
Min/max values - Range of data in columns

These statistics help the query planner:

Estimate result set sizes
Choose optimal join strategies
Select best index usage
Determine filter order

Computing Statistics

Compute table statistics

curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/statistics" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'

await fetch("https://api.catalyzed.ai/dataset-tables/table_abc123/statistics", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({}),
});

requests.post(
    "https://api.catalyzed.ai/dataset-tables/table_abc123/statistics",
    headers={"Authorization": f"Bearer {api_token}"},
    json={}
)

When to Refresh Statistics

Recompute statistics when:

After large data loads - New data changes distribution significantly
After schema migrations - Column types or structure changes
Query plans look suboptimal - Optimizer makes poor choices
Periodic maintenance - Weekly or monthly for active tables

Performance impact: Statistics computation scans all data, so schedule during off-peak hours for large tables.

Table Compaction

Compaction optimizes storage by merging small files into larger ones, improving I/O performance.

What Is Compaction?

Catalyzed stores table data in multiple files. Over time, many small files accumulate from frequent writes. Compaction:

Merges small files into larger ones
Reduces file count and metadata overhead
Improves sequential read performance
Reclaims deleted row space

Triggering Compaction

Compact a table

curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/compact" \
  -H "Authorization: Bearer $API_TOKEN"

await fetch("https://api.catalyzed.ai/dataset-tables/table_abc123/compact", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiToken}` },
});

requests.post(
    "https://api.catalyzed.ai/dataset-tables/table_abc123/compact",
    headers={"Authorization": f"Bearer {api_token}"}
)

When to Compact

Compact tables when:

After many small writes - Frequent upserts or appends
After bulk deletes - Reclaim space from deleted rows
Query performance degrades - Too many files slow down reads
Before long-running queries - Optimize for analytical workloads

Recommended frequency:

High-write tables: Weekly
Moderate-write tables: Monthly
Read-mostly tables: After significant data changes

Performance impact: Compaction runs in the background but may temporarily increase storage usage during the merge process.

Query Plan Analysis

Use the explain endpoint to understand query execution before running expensive queries:

Explain a query

curl -X POST "https://api.catalyzed.ai/queries/explain" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM products WHERE category = '\''electronics'\'' AND price > 100",
    "tables": {"products": "table_abc123"}
  }'

The response shows:

Execution plan - Step-by-step query execution
Estimated cost - Query complexity estimate
Index usage - Which indexes will be used
Filter pushdown - Where filters are applied

Interpreting Query Plans

Look for:

✅ Index scans - Good: Using indexes efficiently
⚠️ Full table scans - Warning: May be slow on large tables
✅ Filter pushdown - Good: Filters applied early
⚠️ Large joins - Warning: May need join order optimization

Data Ingestion Performance

Optimize data loading for best throughput:

Batch Writes

Insert data in batches up to 100MB per request:

const batchSize = 5000;
for (let i = 0; i < rows.length; i += batchSize) {
  const batch = rows.slice(i, i + batchSize);
  await fetch(
    `https://api.catalyzed.ai/dataset-tables/table_abc123/rows?mode=append`,
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${apiToken}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(batch),
    }
  );
}

Use Arrow IPC for Large Datasets

For datasets >10MB, use Arrow IPC format instead of JSON:

curl -X POST "https://api.catalyzed.ai/dataset-tables/table_abc123/rows?mode=append" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/vnd.apache.arrow.stream" \
  --data-binary @data.arrow

Benefits of Arrow IPC:

5-10x faster than JSON for large payloads
No parsing overhead
Preserves type information
Supports streaming

Choose the Right Write Mode

Mode	Use Case	Performance
`append`	Initial loads, no duplicates	Fastest (no checks)
`upsert`	Incremental updates	Moderate (requires primary key lookups)
`overwrite`	Replace all data	Moderate (deletes + inserts)
`delete`	Remove specific rows	Moderate (requires primary key lookups)

See the Ingesting Data guide for details on write modes.

Monitoring Query Performance

Include statistics in query responses to monitor performance:

curl -X POST "https://api.catalyzed.ai/queries" \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM products LIMIT 100",
    "tables": {"products": "table_abc123"},
    "includeStats": true
  }'

Response includes:

{
  "rows": [...],
  "stats": {
    "executionTimeMs": 145,
    "planningTimeMs": 12,
    "bytesScanned": 524288,
    "rowsScanned": 1000
  }
}

Key metrics:

executionTimeMs - Total query time
planningTimeMs - Query planning overhead
bytesScanned - Data read from storage
rowsScanned - Rows examined before filtering

Performance Checklist

Use this checklist to optimize query performance:

Schema Design

Use appropriate Arrow types (int64 vs string for IDs)
Define primary keys for tables with unique identifiers
Denormalize when appropriate to avoid complex joins

Indexing

Create btree indexes on frequently filtered columns
Create vector indexes for tables with >10k rows and similarity search
Rebuild indexes after schema changes

Table Maintenance

Compute statistics after large data loads
Compact tables weekly (high-write) or monthly (moderate-write)
Monitor table file counts and sizes

Query Design

Select only needed columns (avoid SELECT *)
Filter early with WHERE clauses
Use LIMIT during development and exploration
Use appropriate data types in filters

Data Ingestion

Batch writes (1,000-5,000 rows per request)
Use Arrow IPC for payloads >10MB
Choose appropriate write mode (append vs upsert)

Monitoring

Enable query statistics (includeStats: true)
Analyze slow queries with explain endpoint
Track query performance trends over time

Next Steps

Querying Data - SQL syntax and query capabilities
Vector Search - Optimize semantic search queries
Schema Management - Evolve schemas safely
Ingesting Data - Optimize data loading