SQL Query Engine Feature, Dev, and Test Plan

This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics.

Feature Plan

1. Goal

To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics. This enables:

Basic querying with SELECT, WHERE, aggregations on MQ topics
Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE)
In-place analytics on Parquet-stored messages without data movement

2. Key Features

Schema Discovery and Metadata:
- SHOW DATABASES - List all MQ namespaces
- SHOW TABLES - List all topics in a namespace
- DESCRIBE table_name - Show topic schema details
- Automatic schema detection from existing Parquet data
Basic Query Engine:
- SELECT support with WHERE, LIMIT, OFFSET
- Aggregation functions: COUNT(), SUM(), AVG(), MIN(), MAX()
- Temporal queries with timestamp-based filtering
User Interfaces:
- New CLI command weed sql with interactive shell mode
- Optional: Web UI for query execution and result visualization
Output Formats:
- JSON (default), CSV, Parquet for result sets
- Streaming results for large queries
- Pagination support for result navigation

Development Plan

3. Data Source Integration

MQ Topic Connector (Primary):
- Build on existing weed/mq/logstore/read_parquet_to_log.go
- Implement efficient Parquet scanning with predicate pushdown
- Support schema evolution and backward compatibility
- Handle partition-based parallelism for scalable queries
Schema Registry Integration:
- Extend weed/mq/schema/schema.go for SQL metadata operations
- Read existing topic schemas for query planning
- Handle schema evolution during query execution

4. API & CLI Integration

CLI Command:
- New weed sql command with interactive shell mode (similar to weed shell)
- Support for script execution and result formatting
- Connection management for remote SeaweedFS clusters
gRPC API:
- Add SQL service to existing MQ broker gRPC interface
- Enable efficient query execution with streaming results

Example Usage Scenarios

Scenario 1: Schema Discovery and Metadata

-- List all namespaces (databases)
SHOW DATABASES;

-- List topics in a namespace
USE my_namespace;
SHOW TABLES;

-- View topic structure and discovered schema
DESCRIBE user_events;

Scenario 2: Data Querying

-- Basic filtering and projection
SELECT user_id, event_type, timestamp 
FROM user_events 
WHERE timestamp > 1640995200000 
LIMIT 100;

-- Aggregation queries  
SELECT COUNT(*) as event_count
FROM user_events 
WHERE timestamp >= 1640995200000;

-- More aggregation examples
SELECT MAX(timestamp), MIN(timestamp) 
FROM user_events;

Scenario 3: Analytics & Monitoring

-- Basic analytics
SELECT COUNT(*) as total_events
FROM user_events 
WHERE timestamp >= 1640995200000;

-- Simple monitoring
SELECT AVG(response_time) as avg_response
FROM api_logs
WHERE timestamp >= 1640995200000;

## Architecture Overview

SQL Query Flow: 1. Parse SQL 2. Plan & Optimize 3. Execute Query ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐ │ Client │ │ SQL Parser │ │ Query Planner │ │ Execution │ │ (CLI) │──→ │ PostgreSQL │──→ │ & Optimizer │──→ │ Engine │ │ │ │ (Custom) │ │ │ │ │ └─────────────┘ └──────────────┘ └─────────────────┘ └──────────────┘ │ │ │ Schema Lookup │ Data Access ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Schema Catalog │ │ • Namespace → Database mapping │ │ • Topic → Table mapping │ │ • Schema version management │ └─────────────────────────────────────────────────────────────┘ ▲ │ Metadata │ ┌─────────────────────────────────────────────────────────────────────────────┐ │ MQ Storage Layer │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ▲ │ │ │ Topic A │ │ Topic B │ │ Topic C │ │ ... │ │ │ │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ └──────────────────────────────────────────────────────────────────────────│──┘ │ Data Access



## Success Metrics

*   **Feature Completeness:** Support for all specified SELECT operations and metadata commands
*   **Performance:** 
    *   **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records
    *   **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records
    *   **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows
*   **Scalability:** Handle topics with millions of messages efficiently

7.1 KiB Raw Permalink Blame History

SQL Query Engine Feature, Dev, and Test Plan

Feature Plan

Development Plan

Example Usage Scenarios

7.1 KiB

Raw Permalink Blame History