1
0
Fork 0
mirror of https://github.com/chrislusf/seaweedfs synced 2025-08-16 17:12:46 +02:00
seaweedfs/docker/admin_integration/EC-TESTING-README.md
Chris Lu 891a2fb6eb
Admin: misc improvements on admin server and workers. EC now works. (#7055)
* initial design

* added simulation as tests

* reorganized the codebase to move the simulation framework and tests into their own dedicated package

* integration test. ec worker task

* remove "enhanced" reference

* start master, volume servers, filer

Current Status
 Master: Healthy and running (port 9333)
 Filer: Healthy and running (port 8888)
 Volume Servers: All 6 servers running (ports 8080-8085)
🔄 Admin/Workers: Will start when dependencies are ready

* generate write load

* tasks are assigned

* admin start wtih grpc port. worker has its own working directory

* Update .gitignore

* working worker and admin. Task detection is not working yet.

* compiles, detection uses volumeSizeLimitMB from master

* compiles

* worker retries connecting to admin

* build and restart

* rendering pending tasks

* skip task ID column

* sticky worker id

* test canScheduleTaskNow

* worker reconnect to admin

* clean up logs

* worker register itself first

* worker can run ec work and report status

but:
1. one volume should not be repeatedly worked on.
2. ec shards needs to be distributed and source data should be deleted.

* move ec task logic

* listing ec shards

* local copy, ec. Need to distribute.

* ec is mostly working now

* distribution of ec shards needs improvement
* need configuration to enable ec

* show ec volumes

* interval field UI component

* rename

* integration test with vauuming

* garbage percentage threshold

* fix warning

* display ec shard sizes

* fix ec volumes list

* Update ui.go

* show default values

* ensure correct default value

* MaintenanceConfig use ConfigField

* use schema defined defaults

* config

* reduce duplication

* refactor to use BaseUIProvider

* each task register its schema

* checkECEncodingCandidate use ecDetector

* use vacuumDetector

* use volumeSizeLimitMB

* remove

remove

* remove unused

* refactor

* use new framework

* remove v2 reference

* refactor

* left menu can scroll now

* The maintenance manager was not being initialized when no data directory was configured for persistent storage.

* saving config

* Update task_config_schema_templ.go

* enable/disable tasks

* protobuf encoded task configurations

* fix system settings

* use ui component

* remove logs

* interface{} Reduction

* reduce interface{}

* reduce interface{}

* avoid from/to map

* reduce interface{}

* refactor

* keep it DRY

* added logging

* debug messages

* debug level

* debug

* show the log caller line

* use configured task policy

* log level

* handle admin heartbeat response

* Update worker.go

* fix EC rack and dc count

* Report task status to admin server

* fix task logging, simplify interface checking, use erasure_coding constants

* factor in empty volume server during task planning

* volume.list adds disk id

* track disk id also

* fix locking scheduled and manual scanning

* add active topology

* simplify task detector

* ec task completed, but shards are not showing up

* implement ec in ec_typed.go

* adjust log level

* dedup

* implementing ec copying shards and only ecx files

* use disk id when distributing ec shards

🎯 Planning: ActiveTopology creates DestinationPlan with specific TargetDisk
📦 Task Creation: maintenance_integration.go creates ECDestination with DiskId
🚀 Task Execution: EC task passes DiskId in VolumeEcShardsCopyRequest
💾 Volume Server: Receives disk_id and stores shards on specific disk (vs.store.Locations[req.DiskId])
📂 File System: EC shards and metadata land in the exact disk directory planned

* Delete original volume from all locations

* clean up existing shard locations

* local encoding and distributing

* Update docker/admin_integration/EC-TESTING-README.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* check volume id range

* simplify

* fix tests

* fix types

* clean up logs and tests

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-30 12:38:03 -07:00

438 lines
No EOL
11 KiB
Markdown

# SeaweedFS EC Worker Testing Environment
This Docker Compose setup provides a comprehensive testing environment for SeaweedFS Erasure Coding (EC) workers using **official SeaweedFS commands**.
## 📂 Directory Structure
The testing environment is located in `docker/admin_integration/` and includes:
```
docker/admin_integration/
├── Makefile # Main management interface
├── docker-compose-ec-test.yml # Docker compose configuration
├── EC-TESTING-README.md # This documentation
└── run-ec-test.sh # Quick start script
```
## 🏗️ Architecture
The testing environment uses **official SeaweedFS commands** and includes:
- **1 Master Server** (port 9333) - Coordinates the cluster with 50MB volume size limit
- **6 Volume Servers** (ports 8080-8085) - Distributed across 2 data centers and 3 racks for diversity
- **1 Filer** (port 8888) - Provides file system interface
- **1 Admin Server** (port 23646) - Detects volumes needing EC and manages workers using official `admin` command
- **3 EC Workers** - Execute erasure coding tasks using official `worker` command with task-specific working directories
- **1 Load Generator** - Continuously writes and deletes files using SeaweedFS shell commands
- **1 Monitor** - Tracks cluster health and EC progress using shell scripts
## ✨ New Features
### **Task-Specific Working Directories**
Each worker now creates dedicated subdirectories for different task types:
- `/work/erasure_coding/` - For EC encoding tasks
- `/work/vacuum/` - For vacuum cleanup tasks
- `/work/balance/` - For volume balancing tasks
This provides:
- **Organization**: Each task type gets isolated working space
- **Debugging**: Easy to find files/logs related to specific task types
- **Cleanup**: Can clean up task-specific artifacts easily
- **Concurrent Safety**: Different task types won't interfere with each other's files
## 🚀 Quick Start
### Prerequisites
- Docker and Docker Compose installed
- GNU Make installed
- At least 4GB RAM available for containers
- Ports 8080-8085, 8888, 9333, 23646 available
### Start the Environment
```bash
# Navigate to the admin integration directory
cd docker/admin_integration/
# Show available commands
make help
# Start the complete testing environment
make start
```
The `make start` command will:
1. Start all services using official SeaweedFS images
2. Configure workers with task-specific working directories
3. Wait for services to be ready
4. Display monitoring URLs and run health checks
### Alternative Commands
```bash
# Quick start aliases
make up # Same as 'make start'
# Development mode (higher load for faster testing)
make dev-start
# Build images without starting
make build
```
## 📋 Available Make Targets
Run `make help` to see all available targets:
### **🚀 Main Operations**
- `make start` - Start the complete EC testing environment
- `make stop` - Stop all services
- `make restart` - Restart all services
- `make clean` - Complete cleanup (containers, volumes, images)
### **📊 Monitoring & Status**
- `make health` - Check health of all services
- `make status` - Show status of all containers
- `make urls` - Display all monitoring URLs
- `make monitor` - Open monitor dashboard in browser
- `make monitor-status` - Show monitor status via API
- `make volume-status` - Show volume status from master
- `make admin-status` - Show admin server status
- `make cluster-status` - Show complete cluster status
### **📋 Logs Management**
- `make logs` - Show logs from all services
- `make logs-admin` - Show admin server logs
- `make logs-workers` - Show all worker logs
- `make logs-worker1/2/3` - Show specific worker logs
- `make logs-load` - Show load generator logs
- `make logs-monitor` - Show monitor logs
- `make backup-logs` - Backup all logs to files
### **⚖️ Scaling & Testing**
- `make scale-workers WORKERS=5` - Scale workers to 5 instances
- `make scale-load RATE=25` - Increase load generation rate
- `make test-ec` - Run focused EC test scenario
### **🔧 Development & Debug**
- `make shell-admin` - Open shell in admin container
- `make shell-worker1` - Open shell in worker container
- `make debug` - Show debug information
- `make troubleshoot` - Run troubleshooting checks
## 📊 Monitoring URLs
| Service | URL | Description |
|---------|-----|-------------|
| Master UI | http://localhost:9333 | Cluster status and topology |
| Filer | http://localhost:8888 | File operations |
| Admin Server | http://localhost:23646/ | Task management |
| Monitor | http://localhost:9999/status | Complete cluster monitoring |
| Volume Servers | http://localhost:8080-8085/status | Individual volume server stats |
Quick access: `make urls` or `make monitor`
## 🔄 How EC Testing Works
### 1. Continuous Load Generation
- **Write Rate**: 10 files/second (1-5MB each)
- **Delete Rate**: 2 files/second
- **Target**: Fill volumes to 50MB limit quickly
### 2. Volume Detection
- Admin server scans master every 30 seconds
- Identifies volumes >40MB (80% of 50MB limit)
- Queues EC tasks for eligible volumes
### 3. EC Worker Assignment
- **Worker 1**: EC specialist (max 2 concurrent tasks)
- **Worker 2**: EC + Vacuum hybrid (max 2 concurrent tasks)
- **Worker 3**: EC + Vacuum hybrid (max 1 concurrent task)
### 4. Comprehensive EC Process
Each EC task follows 6 phases:
1. **Copy Volume Data** (5-15%) - Stream .dat/.idx files locally
2. **Mark Read-Only** (20-25%) - Ensure data consistency
3. **Local Encoding** (30-60%) - Create 14 shards (10+4 Reed-Solomon)
4. **Calculate Placement** (65-70%) - Smart rack-aware distribution
5. **Distribute Shards** (75-90%) - Upload to optimal servers
6. **Verify & Cleanup** (95-100%) - Validate and clean temporary files
### 5. Real-Time Monitoring
- Volume analysis and EC candidate detection
- Worker health and task progress
- No data loss verification
- Performance metrics
## 📋 Key Features Tested
### ✅ EC Implementation Features
- [x] Local volume data copying with progress tracking
- [x] Local Reed-Solomon encoding (10+4 shards)
- [x] Intelligent shard placement with rack awareness
- [x] Load balancing across available servers
- [x] Backup server selection for redundancy
- [x] Detailed step-by-step progress tracking
- [x] Comprehensive error handling and recovery
### ✅ Infrastructure Features
- [x] Multi-datacenter topology (dc1, dc2)
- [x] Rack diversity (rack1, rack2, rack3)
- [x] Volume size limits (50MB)
- [x] Worker capability matching
- [x] Health monitoring and alerting
- [x] Continuous workload simulation
## 🛠️ Common Usage Patterns
### Basic Testing Workflow
```bash
# Start environment
make start
# Watch progress
make monitor-status
# Check for EC candidates
make volume-status
# View worker activity
make logs-workers
# Stop when done
make stop
```
### High-Load Testing
```bash
# Start with higher load
make dev-start
# Scale up workers and load
make scale-workers WORKERS=5
make scale-load RATE=50
# Monitor intensive EC activity
make logs-admin
```
### Debugging Issues
```bash
# Check port conflicts and system state
make troubleshoot
# View specific service logs
make logs-admin
make logs-worker1
# Get shell access for debugging
make shell-admin
make shell-worker1
# Check detailed status
make debug
```
### Development Iteration
```bash
# Quick restart after code changes
make restart
# Rebuild and restart
make clean
make start
# Monitor specific components
make logs-monitor
```
## 📈 Expected Results
### Successful EC Testing Shows:
1. **Volume Growth**: Steady increase in volume sizes toward 50MB limit
2. **EC Detection**: Admin server identifies volumes >40MB for EC
3. **Task Assignment**: Workers receive and execute EC tasks
4. **Shard Distribution**: 14 shards distributed across 6 volume servers
5. **No Data Loss**: All files remain accessible during and after EC
6. **Performance**: EC tasks complete within estimated timeframes
### Sample Monitor Output:
```bash
# Check current status
make monitor-status
# Output example:
{
"monitor": {
"uptime": "15m30s",
"master_addr": "master:9333",
"admin_addr": "admin:9900"
},
"stats": {
"VolumeCount": 12,
"ECTasksDetected": 3,
"WorkersActive": 3
}
}
```
## 🔧 Configuration
### Environment Variables
You can customize the environment by setting variables:
```bash
# High load testing
WRITE_RATE=25 DELETE_RATE=5 make start
# Extended test duration
TEST_DURATION=7200 make start # 2 hours
```
### Scaling Examples
```bash
# Scale workers
make scale-workers WORKERS=6
# Increase load generation
make scale-load RATE=30
# Combined scaling
make scale-workers WORKERS=4
make scale-load RATE=40
```
## 🧹 Cleanup Options
```bash
# Stop services only
make stop
# Remove containers but keep volumes
make down
# Remove data volumes only
make clean-volumes
# Remove built images only
make clean-images
# Complete cleanup (everything)
make clean
```
## 🐛 Troubleshooting
### Quick Diagnostics
```bash
# Run complete troubleshooting
make troubleshoot
# Check specific components
make health
make debug
make status
```
### Common Issues
**Services not starting:**
```bash
# Check port availability
make troubleshoot
# View startup logs
make logs-master
make logs-admin
```
**No EC tasks being created:**
```bash
# Check volume status
make volume-status
# Increase load to fill volumes faster
make scale-load RATE=30
# Check admin detection
make logs-admin
```
**Workers not responding:**
```bash
# Check worker registration
make admin-status
# View worker logs
make logs-workers
# Restart workers
make restart
```
### Performance Tuning
**For faster testing:**
```bash
make dev-start # Higher default load
make scale-load RATE=50 # Very high load
```
**For stress testing:**
```bash
make scale-workers WORKERS=8
make scale-load RATE=100
```
## 📚 Technical Details
### Network Architecture
- Custom bridge network (172.20.0.0/16)
- Service discovery via container names
- Health checks for all services
### Storage Layout
- Each volume server: max 100 volumes
- Data centers: dc1, dc2
- Racks: rack1, rack2, rack3
- Volume limit: 50MB per volume
### EC Algorithm
- Reed-Solomon RS(10,4)
- 10 data shards + 4 parity shards
- Rack-aware distribution
- Backup server redundancy
### Make Integration
- Color-coded output for better readability
- Comprehensive help system (`make help`)
- Parallel execution support
- Error handling and cleanup
- Cross-platform compatibility
## 🎯 Quick Reference
```bash
# Essential commands
make help # Show all available targets
make start # Start complete environment
make health # Check all services
make monitor # Open dashboard
make logs-admin # View admin activity
make clean # Complete cleanup
# Monitoring
make volume-status # Check for EC candidates
make admin-status # Check task queue
make monitor-status # Full cluster status
# Scaling & Testing
make test-ec # Run focused EC test
make scale-load RATE=X # Increase load
make troubleshoot # Diagnose issues
```
This environment provides a realistic testing scenario for SeaweedFS EC workers with actual data operations, comprehensive monitoring, and easy management through Make targets.