Querying Data

Catalyzed supports standard SQL for querying your data. This guide covers common query patterns and best practices.

Executing Queries

Use the /queries endpoint to run SQL queries against your tables. The tables parameter maps the table names used in your SQL to their table IDs.

Basic SELECT query

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM users LIMIT 10",
    "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: "SELECT * FROM users LIMIT 10",
    tables: { users: "KzaMsfA0LSw_Ld0KyaXIS" },
  }),
});
const data = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": "SELECT * FROM users LIMIT 10",
        "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
    }
)
data = response.json()

Joining Tables

Query across multiple tables by including them all in the tables mapping:

JOIN across tables

curl -X POST https://api.catalyzed.ai/queries \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT o.order_id, c.name, o.total FROM orders o JOIN customers c ON o.customer_id = c.id",
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "customers": "6fTBbbj4uv8TVMVh0gVch"
    }
  }'

const response = await fetch("https://api.catalyzed.ai/queries", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: `SELECT o.order_id, c.name, o.total
          FROM orders o
          JOIN customers c ON o.customer_id = c.id`,
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      customers: "6fTBbbj4uv8TVMVh0gVch",
    },
  }),
});

response = requests.post(
    "https://api.catalyzed.ai/queries",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "sql": """
            SELECT o.order_id, c.name, o.total
            FROM orders o
            JOIN customers c ON o.customer_id = c.id
        """,
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "customers": "6fTBbbj4uv8TVMVh0gVch"
        }
    }
)

SQL Syntax Reference

SELECT Clauses

-- Select specific columns
SELECT name, email, created_at FROM users

-- Select all columns
SELECT * FROM users

-- Column aliases
SELECT name AS customer_name, email AS contact FROM users

-- Expressions
SELECT name, price * quantity AS total FROM orders

Filtering (WHERE)

-- Basic comparisons
SELECT * FROM users WHERE status = 'active'
SELECT * FROM orders WHERE amount > 100
SELECT * FROM orders WHERE created_at >= '2024-01-01'

-- Multiple conditions
SELECT * FROM orders WHERE status = 'active' AND amount > 100
SELECT * FROM orders WHERE status = 'active' OR status = 'pending'

-- NULL checks
SELECT * FROM users WHERE deleted_at IS NULL
SELECT * FROM orders WHERE notes IS NOT NULL

-- IN operator
SELECT * FROM orders WHERE status IN ('active', 'pending', 'review')

-- LIKE pattern matching
SELECT * FROM users WHERE email LIKE '%@example.com'
SELECT * FROM users WHERE name LIKE 'John%'

-- BETWEEN range
SELECT * FROM orders WHERE amount BETWEEN 100 AND 500

Sorting (ORDER BY)

-- Ascending (default)
SELECT * FROM orders ORDER BY created_at

-- Descending
SELECT * FROM orders ORDER BY created_at DESC

-- Multiple columns
SELECT * FROM orders ORDER BY status ASC, created_at DESC

-- With NULLS handling
SELECT * FROM orders ORDER BY completed_at NULLS LAST

Pagination (LIMIT/OFFSET)

-- First 10 rows
SELECT * FROM users LIMIT 10

-- Skip first 20, get next 10
SELECT * FROM users LIMIT 10 OFFSET 20

-- Recommended: use ORDER BY with pagination
SELECT * FROM users ORDER BY id LIMIT 10 OFFSET 20

Aggregations

-- Count
SELECT COUNT(*) FROM orders
SELECT COUNT(DISTINCT customer_id) FROM orders

-- Sum, Average, Min, Max
SELECT SUM(amount) FROM orders
SELECT AVG(amount) FROM orders WHERE status = 'completed'
SELECT MIN(created_at), MAX(created_at) FROM orders

-- Group by
SELECT status, COUNT(*) as count, SUM(amount) as total
FROM orders
GROUP BY status

-- Having (filter groups)
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5

Joins

-- INNER JOIN (only matching rows)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id

-- LEFT JOIN (all from left, matching from right)
SELECT c.name, o.order_id
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id

-- Multiple joins
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id

Subqueries

-- Subquery in WHERE
SELECT * FROM orders
WHERE customer_id IN (
  SELECT id FROM customers WHERE status = 'premium'
)

-- Subquery in FROM
SELECT avg_amount, COUNT(*) as customer_count
FROM (
  SELECT customer_id, AVG(amount) as avg_amount
  FROM orders
  GROUP BY customer_id
) customer_averages
GROUP BY avg_amount

Common Table Expressions (CTEs)

WITH high_value_customers AS (
  SELECT customer_id, SUM(amount) as total_spent
  FROM orders
  GROUP BY customer_id
  HAVING SUM(amount) > 10000
)
SELECT c.name, hvc.total_spent
FROM high_value_customers hvc
JOIN customers c ON hvc.customer_id = c.id
ORDER BY hvc.total_spent DESC

Window Functions

-- Row number
SELECT name, amount,
  ROW_NUMBER() OVER (ORDER BY amount DESC) as rank
FROM orders

-- Partition by
SELECT customer_id, order_date, amount,
  SUM(amount) OVER (PARTITION BY customer_id) as customer_total
FROM orders

-- Running total
SELECT order_date, amount,
  SUM(amount) OVER (ORDER BY order_date) as running_total
FROM orders

Query Response Format

Queries return results in a structured JSON format:

{
  "queryId": "qry_abc123",
  "columns": [
    {"name": "id", "type": "Utf8"},
    {"name": "name", "type": "Utf8"},
    {"name": "email", "type": "Utf8"},
    {"name": "created_at", "type": "Timestamp(Microsecond, Some(\"UTC\"))"}
  ],
  "rows": [
    {"id": "1", "name": "Alice", "email": "alice@example.com", "created_at": "2024-01-15T10:00:00Z"},
    {"id": "2", "name": "Bob", "email": "bob@example.com", "created_at": "2024-01-15T11:00:00Z"}
  ],
  "rowCount": 2,
  "truncated": false
}

Query Options

Limit Results

Prevent accidentally returning too many rows:

{
  "sql": "SELECT * FROM users",
  "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"},
  "maxRows": 1000
}

Get Execution Plan

Preview how a query will execute without running it:

curl -X POST https://api.catalyzed.ai/queries/explain \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT * FROM users WHERE status = '\''active'\''",
    "tables": {"users": "KzaMsfA0LSw_Ld0KyaXIS"}
  }'

Performance Tips

1. Use Specific Columns

-- Good: Select only needed columns
SELECT id, name, status FROM users

-- Avoid: Select all columns
SELECT * FROM users

2. Filter Early

-- Good: Filter in WHERE
SELECT * FROM orders WHERE created_at > '2024-01-01'

-- Avoid: Filter in application code
SELECT * FROM orders  -- then filter in code

3. Use LIMIT

-- Good: Always limit during development
SELECT * FROM orders WHERE status = 'active' LIMIT 100

-- Avoid: Unbounded queries
SELECT * FROM orders

4. Create Indexes

For columns frequently used in WHERE clauses:

curl -X POST https://api.catalyzed.ai/dataset-tables/KzaMsfA0LSw_Ld0KyaXIS/indexes \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "indexName": "idx_status",
    "columnName": "status",
    "indexType": "btree"
  }'

5. Use Appropriate Data Types

Use timestamp for dates, not strings
Use numeric types (int64, float64) for numbers, not strings
This enables efficient comparisons and aggregations

Error Handling

Syntax Errors

{
  "error": "QUERY_ERROR",
  "message": "SQL syntax error at position 45: unexpected token 'FORM'"
}

Missing Table Bindings

{
  "error": "VALIDATION_ERROR",
  "message": "Table 'orders' referenced in SQL but not found in tables mapping"
}

Timeout

Long-running queries may timeout. Use filters and limits to reduce data scanned.

SQL Translation

Convert natural language queries to SQL using the /queries/translate endpoint. The translator understands table schemas, column types, and supports semantic search functions.

Translate natural language to SQL

curl -X POST https://api.catalyzed.ai/queries/translate \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tables": {
      "orders": "Ednc5U676CO4hn-FqsXeA",
      "customers": "6fTBbbj4uv8TVMVh0gVch"
    },
    "query": "Find customers who placed orders over $1000 last month",
    "includeExplanation": true
  }'

const response = await fetch("https://api.catalyzed.ai/queries/translate", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${apiToken}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    tables: {
      orders: "Ednc5U676CO4hn-FqsXeA",
      customers: "6fTBbbj4uv8TVMVh0gVch",
    },
    query: "Find customers who placed orders over $1000 last month",
    includeExplanation: true,
  }),
});
const { sql, explanation, confidence } = await response.json();

response = requests.post(
    "https://api.catalyzed.ai/queries/translate",
    headers={"Authorization": f"Bearer {api_token}"},
    json={
        "tables": {
            "orders": "Ednc5U676CO4hn-FqsXeA",
            "customers": "6fTBbbj4uv8TVMVh0gVch"
        },
        "query": "Find customers who placed orders over $1000 last month",
        "includeExplanation": True
    }
)
result = response.json()
sql = result["sql"]
explanation = result.get("explanation")
confidence = result["confidence"]

Response:

{
  "sql": "SELECT DISTINCT c.name, c.email FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.amount > 1000 AND o.created_at >= date_trunc('month', CURRENT_DATE - INTERVAL '1 month')",
  "confidence": "high",
  "explanation": "This query joins customers and orders tables, filters for orders over $1000, and restricts to the previous month using date_trunc.",
  "warnings": [],
  "referencedTables": ["customers", "orders"],
  "referencedColumns": ["customers.name", "customers.email", "customers.id", "orders.customer_id", "orders.amount", "orders.created_at"],
  "usesVectorSearch": false
}

Translation Features

Schema Awareness — understands table structures, column types, and sample data
Semantic Search — auto-detects vector columns and enables knn_search(), knn_cosine(), knn_l2(), and text_to_embedding()
SQL Validation — validates generated SQL syntax via EXPLAIN (enabled by default)
Confidence Levels — returns "high", "medium", or "low" confidence for the translation
Explanations — optional natural language explanation of the generated SQL

Request Options

Option	Type	Description
`tables`	object	Map of table aliases (as used in SQL) to table IDs
`query`	string	Natural language query to translate (max 4000 chars)
`includeExplanation`	boolean	Include explanation of generated SQL (default: `false`)
`validateSyntax`	boolean	Validate SQL syntax before returning (default: `true`)

Example: Semantic Search Query

{
  "tables": {
    "products": "tbl_abc123"
  },
  "query": "Find products similar to wireless headphones",
  "includeExplanation": true
}

Generated SQL:

SELECT * FROM knn_cosine(
    'products',
    'embedding',
    text_to_embedding('wireless headphones'),
    10
)

For optimizing translation quality — table naming, column naming, confidence handling, and the translate→execute workflow — see the SQL Translation Guide.

See the SQL Translation API for the complete endpoint reference.

Vector Search

Catalyzed also supports semantic similarity search using vector embeddings. Use knn_search() to find similar documents, products, or any embedded content:

SELECT * FROM knn_cosine(
    'products',
    'embedding',
    text_to_embedding('wireless headphones'),
    10
)

See the Vector Search Guide for complete documentation on:

knn_search(), knn_cosine(), and knn_l2() functions
text_to_embedding() for query-time embeddings
Filtered vector search
Distance metrics and thresholds

Next Steps

Vector Search - Semantic similarity search with embeddings
Tables - Learn about table schemas and indexes
Schema Management - Evolve your schemas safely