7.1 KiB
SQL Query Engine Feature, Dev, and Test Plan
This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics.
Feature Plan
1. Goal
To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics. This enables:
- Basic querying with SELECT, WHERE, aggregations on MQ topics
- Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE)
- In-place analytics on Parquet-stored messages without data movement
2. Key Features
- Schema Discovery and Metadata:
SHOW DATABASES
- List all MQ namespacesSHOW TABLES
- List all topics in a namespaceDESCRIBE table_name
- Show topic schema details- Automatic schema detection from existing Parquet data
- Basic Query Engine:
SELECT
support withWHERE
,LIMIT
,OFFSET
- Aggregation functions:
COUNT()
,SUM()
,AVG()
,MIN()
,MAX()
- Temporal queries with timestamp-based filtering
- User Interfaces:
- New CLI command
weed sql
with interactive shell mode - Optional: Web UI for query execution and result visualization
- New CLI command
- Output Formats:
- JSON (default), CSV, Parquet for result sets
- Streaming results for large queries
- Pagination support for result navigation
Development Plan
3. Data Source Integration
- MQ Topic Connector (Primary):
- Build on existing
weed/mq/logstore/read_parquet_to_log.go
- Implement efficient Parquet scanning with predicate pushdown
- Support schema evolution and backward compatibility
- Handle partition-based parallelism for scalable queries
- Build on existing
- Schema Registry Integration:
- Extend
weed/mq/schema/schema.go
for SQL metadata operations - Read existing topic schemas for query planning
- Handle schema evolution during query execution
- Extend
4. API & CLI Integration
- CLI Command:
- New
weed sql
command with interactive shell mode (similar toweed shell
) - Support for script execution and result formatting
- Connection management for remote SeaweedFS clusters
- New
- gRPC API:
- Add SQL service to existing MQ broker gRPC interface
- Enable efficient query execution with streaming results
Example Usage Scenarios
Scenario 1: Schema Discovery and Metadata
-- List all namespaces (databases)
SHOW DATABASES;
-- List topics in a namespace
USE my_namespace;
SHOW TABLES;
-- View topic structure and discovered schema
DESCRIBE user_events;
Scenario 2: Data Querying
-- Basic filtering and projection
SELECT user_id, event_type, timestamp
FROM user_events
WHERE timestamp > 1640995200000
LIMIT 100;
-- Aggregation queries
SELECT COUNT(*) as event_count
FROM user_events
WHERE timestamp >= 1640995200000;
-- More aggregation examples
SELECT MAX(timestamp), MIN(timestamp)
FROM user_events;
Scenario 3: Analytics & Monitoring
-- Basic analytics
SELECT COUNT(*) as total_events
FROM user_events
WHERE timestamp >= 1640995200000;
-- Simple monitoring
SELECT AVG(response_time) as avg_response
FROM api_logs
WHERE timestamp >= 1640995200000;
## Architecture Overview
SQL Query Flow: 1. Parse SQL 2. Plan & Optimize 3. Execute Query ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐ │ Client │ │ SQL Parser │ │ Query Planner │ │ Execution │ │ (CLI) │──→ │ PostgreSQL │──→ │ & Optimizer │──→ │ Engine │ │ │ │ (Custom) │ │ │ │ │ └─────────────┘ └──────────────┘ └─────────────────┘ └──────────────┘ │ │ │ Schema Lookup │ Data Access ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Schema Catalog │ │ • Namespace → Database mapping │ │ • Topic → Table mapping │ │ • Schema version management │ └─────────────────────────────────────────────────────────────┘ ▲ │ Metadata │ ┌─────────────────────────────────────────────────────────────────────────────┐ │ MQ Storage Layer │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ▲ │ │ │ Topic A │ │ Topic B │ │ Topic C │ │ ... │ │ │ │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ └──────────────────────────────────────────────────────────────────────────│──┘ │ Data Access
## Success Metrics
* **Feature Completeness:** Support for all specified SELECT operations and metadata commands
* **Performance:**
* **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records
* **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records
* **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows
* **Scalability:** Handle topics with millions of messages efficiently