Chapter 7: Implementation for Human-Robot Musical Ensemble

Real-Time Multimodal Synchronization and Robotic Ensemble

Sutirtha Chakraborty
Maynooth University

🎵 Introduction

Chapter Overview

This chapter presents the culmination of our research journey, demonstrating the practical application of theoretical frameworks from previous chapters. We integrate:

  • Audio-based beat detection (Chapter 4)
  • Visual cues extraction (Chapter 5)
  • Multimodal synchronization approaches (Chapter 6)

Research Objective

Create an environment where human musicians and robotic agents interact musically, adapting to each other's:

  • 🎼 Tempo variations
  • 🤲 Gestural cues
  • 🎭 Non-verbal communication
Gesture-based synchronization
A participant interacting with the system, demonstrating gesture-based synchronization.

Key Contributions

🔍 Real-time Integration

Combines visual pose estimation, audio beat detection, and Kuramoto synchronization in real-time

🤖 Robotic Implementation

Physical robotic system that responds to human gestures and musical cues

📊 Experimental Validation

Comprehensive user studies demonstrating system robustness and adaptability

🏗️ System Overview

System Pipeline

📹 Data Acquisition
Live video & audio capture
⬇️
🤸 Pose Estimation
YOLO-based keypoint detection
⬇️
📈 Motion Analysis & BPM Inference
Temporal pattern analysis
⬇️
🔄 Kuramoto Synchronization
Phase alignment & coupling
⬇️
🎛️ Multimodal Integration
Audio + visual fusion
⬇️
🎹 MIDI Output & Robotic Control
Real-time instrument triggering

System Parameters

Parameter Description Typical Values Impact
Video Frame Rate Input video capture rate 25-30 fps Higher = better temporal resolution
Pose Confidence Threshold YOLO confidence for keypoint detection 0.3-0.5 Lower = more detections, higher noise
Motion Buffer Size Frames stored for BPM estimation 30 frames Larger = more stable, slower adaptation
Natural Frequency Baseline oscillator frequency (~120 BPM) 2.0 Hz Default tempo when no input detected
Coupling Strength Kuramoto oscillator coupling 0.1-0.2 Higher = stronger synchronization
MIDI Velocity Note intensity for percussion 90-110 (of 127) Controls robotic strike force
BPM Range Allowed tempo estimation range 60-180 BPM Filters unrealistic tempo estimates

⚙️ Technical Components

🎯 Visual Pose Estimation

Technology: YOLO-based pose detection

Keypoints: Wrists, elbows, shoulders

Performance: 25 fps on GPU

Focus: Wrist motion for tempo inference

Why YOLO-Pose-Tiny?
Optimized for real-time performance while maintaining accuracy for gesture recognition

📊 Motion-to-Tempo Conversion

Method: Peak detection algorithms

Buffer: Rolling position & timestamp data

Smoothing: Prevents sudden BPM jumps

BPM = 60 / mean(Δt_peaks)
UI showing detected BPM
System Interface: The UI displays detected BPM from each person in an oscillatory circle. The dotted red line represents Kuramoto's phase, and the circles in the top-right show 4/4 beat bars.

🔄 Kuramoto Synchronization Model

Mathematical Foundation

Each participant is represented as an oscillator with:

  • Phase θᵢ: Current beat position
  • Frequency ωᵢ: BPMᵢ/60 (intrinsic tempo)
Phase Update Rule:
dθᵢ/dt = ωᵢ + (K/N) Σⱼ sin(θⱼ - θᵢ)

Where:
  • • K = coupling strength
  • • N = number of participants
  • • Coupling term drives synchronization

Global Phase & Frequency

Global Phase: Circular mean of all participant phases

Global BPM: ω_global × 60

Convergence: Phases align over time through differential adjustment

2-3s Typical Convergence Time

🎹 MIDI Event Generation & Robotic Control

Rhythmic Pattern Implementation

Instrument 16-Step Pattern (1-16)
Kick 10000100 10000100
Snare 00010001 00010001
Hi-hat 11111111 11111111
Clap 00000100 00000100
Robotic Arm Interface
Robotic Arm Interface connected via USB to laptop, controlling a MIDI keyboard for synchronized performance

🔧 Hardware Implementation

System Architecture

The robotic piano interface combines solenoid actuators, microcontroller control, and MIDI communication to create a responsive musical robot that can perform alongside human musicians.

🎹 Robotic Piano Interface

  • Base: Standard electronic keyboard
  • Actuators: One solenoid per key
  • Alignment: Precise key-to-solenoid mapping
  • Response: Tactile feedback simulation

⚡ Solenoid Actuators

  • Power: 12V supply for adequate force
  • Housing: Custom wooden enclosure
  • Driver: ULN2003A Darlington array
  • Protection: Current limiting & isolation

🧠 Control System

  • MCU: Arduino Mega 2560
  • I/O: Multiple digital output pins
  • Processing: Real-time MIDI handling
  • Communication: USB to computer

🔌 Power & Circuitry

  • Voltage: 12V dedicated supply
  • Current: Managed via driver array
  • Protection: Surge & overcurrent
  • Efficiency: On-demand activation
Solenoid control circuit
Circuit Schematic: Shows the solenoid control circuit with ULN2003A driver array, Mega 2560 microcontroller interfacing, and power distribution system.

Integration & Synchronization

Signal Flow

  1. MIDI Generation: Computer generates MIDI events
  2. USB Transfer: Commands sent to Mega 2560
  3. Signal Processing: MCU processes MIDI messages
  4. Driver Control: Digital signals to ULN2003A
  5. Solenoid Activation: Physical key strikes
<10ms MIDI to Physical Response Latency

🔄 Algorithmic Description

Real-Time Multimodal Synchronization Algorithm
INPUT:
• Video frames F_t
• Pre-trained YOLO-Pose model M
• Coupling strength K
• MIDI patterns P
OUTPUT:
• Real-time MIDI events
• Robotic actuator commands
INITIALIZATION:
1. Set initial BPM estimates (BPM_i = 120) for all participants
2. Initialize buffers for wrist positions
3. Initialize oscillator phases θ_i and frequencies ω_i
4. Configure MIDI output devices and robotic interfaces
MAIN LOOP:
WHILE system_running:
1. Video Acquisition: Capture frame F_t
2. Pose Inference: Detect keypoints with M(F_t)
Extract wrist positions for each participant
3. BPM Estimation:
Normalize and filter wrist motion data
Compute BPM_i using peak detection
4. Kuramoto Update:
Update oscillator phases θ_i using coupling equation
5. MIDI Scheduling:
Generate MIDI Note On/Off events based on θ_global
6. Robotic Actuation:
Send MIDI messages to robotic actuators
END WHILE
Video Acquisition
⬇️
Pose Inference
⬇️
BPM Estimation
⬇️
Kuramoto Update
⬇️
MIDI Scheduling
⬇️
Robotic Actuation

🧪 Experimental Setup and Evaluation

Research Objectives

Comprehensive evaluation of the multimodal synchronization framework's effectiveness, adaptability, and perceived responsiveness through controlled user studies.

Assess Synchronization Accuracy: Determine alignment between robotic percussionist and human-induced tempo/phase, measuring robustness to intentional tempo variations.
Evaluate Adaptation Speed: Measure system response time to abrupt tempo changes and stability maintenance during fluctuations.
Examine Perceived Responsiveness: Gather subjective feedback on whether the robot is perceived as an active partner vs. static metronome.
Explore Multi-Condition Influence: Investigate impact of environmental factors, multimodal cues, and ensemble size on synchronization quality.

👥 Participant Demographics

6 Total Participants
3M / 3F Gender Distribution
22-35 Age Range
2+ years Musical Experience

🏗️ Experimental Setup

📹 Video Capture

  • HD camera at 30 fps
  • 2.5m distance from participants
  • Upper body visibility ensured
  • Uniform overhead lighting

🥁 Robotic Drummer

  • Solenoid-driven percussion
  • Snare-like surface
  • Real-time MIDI control
  • Sub-10ms response time

🎵 Audio System

  • Reference click track (120 BPM)
  • Bimodal scenario support
  • High-quality playback
  • Synchronized with visual

👤 Participant Setup

  • Comfortable attire
  • Free gesture choice
  • Periodic motion required
  • Conducting patterns encouraged

📋 Experimental Conditions

1️⃣ Baseline (Visual-Only, Steady)

Duration: 3-4 minutes

Task: Steady tempo ~120 BPM

Input: Visual cues only

Purpose: Baseline synchronization measure

2️⃣ Tempo-Change (Visual-Only, Variable)

Pattern: 120 → 130 → 120 BPM

Timing: Changes at 60s and 130s marks

Transition: 10-second gradual change

Purpose: Test adaptation capability

3️⃣ Multimodal (Audio + Visual)

Audio: 120 BPM click track

Visual: Participant gestures at ±5 BPM

Conflict: Intentional audio-visual mismatch

Purpose: Test conflict resolution

4️⃣ Occlusion (Visual Interference)

Interference: 2-3 second camera obstruction

Tempo: Steady 120 BPM maintained

Challenge: Lost/noisy visual data

Purpose: Test robustness

📊 Results & Analysis

📈 Quantitative Metrics

🎯 Tempo Estimation Error (TEE)

$$\text{TEE} = |\text{BPM}_{\text{estimated}} - \text{BPM}_{\text{intended}}|$$

Measures accuracy of system's tempo estimation compared to participant's intended tempo.

🎵 Synchronization Accuracy (SyncAcc)

$$\text{SyncAcc} = \frac{1}{N} \sum_{i=1}^{N} |t_{\text{human},i} - t_{\text{robot},i}|$$

Mean absolute deviation between human beats and robotic drum hits.

⏱️ Adaptation Time (AT)

$$\text{AT} = t_{\text{settle}} - t_{\text{change}}$$

Duration for system to align within ±3 BPM of new target tempo.

🔄 Phase Variance (σ_θ)

$$\sigma^2_\theta = \frac{1}{N} \sum_{i=1}^{N} (\theta_i - \bar{\theta})^2$$

Variability in oscillator phases, indicating synchronization stability.

📋 Quantitative Results

Condition TEE (BPM) SyncAcc (ms) AT (s) Performance
Baseline (Visual) 2.5 ± 0.9 18 ± 5 N/A 🟢 Excellent
Tempo-Change (Visual) 3.1 ± 1.2 22 ± 6 2.8 ± 0.7 🟡 Good
Multimodal (Audio+Visual) 2.2 ± 1.0 15 ± 4 2.5 ± 0.9 🟢 Best
Occlusion (Visual) 3.5 ± 1.5 25 ± 8 N/A 🟡 Acceptable

💭 Qualitative Feedback

Condition Responsiveness (1-5) Naturalness (1-5) Confidence (1-5)
Baseline (Visual) 4.0 ± 0.6 3.9 ± 0.7 3.8 ± 0.5
Tempo-Change (Visual) 3.8 ± 0.8 3.7 ± 0.9 3.5 ± 0.6
Multimodal (Audio+Visual) 4.3 ± 0.5 4.1 ± 0.6 4.0 ± 0.6
Occlusion (Visual) 3.4 ± 1.0 3.2 ± 1.1 3.0 ± 1.0

🔍 Key Findings

Hypothesis Confirmed: ±3 BPM Accuracy
2.5s Adaptation Time (Target: <3s)
4.3/5 Peak Responsiveness (Multimodal)
15ms Best Sync Accuracy (Audio+Visual)

📝 Detailed Analysis

🎯 Tempo Estimation Performance

  • Baseline: 2.5 BPM error - excellent alignment
  • Tempo Changes: 3.1 BPM - good adaptation capability
  • Multimodal: 2.2 BPM - best performance with audio stabilization
  • Occlusions: 3.5 BPM - acceptable degradation

🎵 Synchronization Quality

  • Audio cues improved accuracy to ~15ms
  • Visual-only achieved stable ~18ms baseline
  • Tempo changes slightly worsened to ~22ms
  • Occlusions degraded to ~25ms but remained functional

⚡ Adaptation Capabilities

  • Quick response: 2.5-2.8s adaptation time
  • Minimal disruption to musical flow
  • Smooth transitions between tempo changes
  • Rapid recovery from visual interruptions

👥 User Perception

  • High responsiveness in stable conditions (4.0+/5)
  • Natural interaction feeling developed over time
  • Multimodal preference - robot seemed more "attentive"
  • Temporary confusion during occlusions but quick recovery

🚀 Advanced Observations

Multimodal Conflict Resolution

When audio and visual cues conflicted, the system reached a compromise tempo - participants described this as the robot "negotiating" the tempo, creating a more musical and collaborative experience.

Group Performance Dynamics

In supplementary 2-3 participant tests, the Kuramoto model effectively synchronized multiple oscillators. Phase variance decreased from 1.2 to ~0.3 radians after 30 seconds, demonstrating robust ensemble synchronization.

🎯 Conclusion

🏆 Research Achievement

Successfully demonstrated the feasibility and practicality of integrating Kuramoto-based synchronization with multimodal cues to create compelling human-robot musical ensemble experiences.

✅ Validated Capabilities

🎯 Accuracy & Stability

  • Maintained tempo alignment within ±3 BPM
  • Synchronization accuracy below 20ms average
  • Stable performance under normal conditions

🔄 Adaptability

  • Rapid adaptation to tempo changes (2-3s)
  • Smooth handling of intentional variations
  • Quick recovery from disruptions

👥 User Experience

  • High perceived responsiveness (4.0+/5)
  • Natural interaction development
  • Positive collaborative experience

🛡️ Robustness

  • Resilience to visual occlusions
  • Effective multimodal integration
  • Functional degradation, not failure

🔮 Future Implications

🎼 Musical Applications

  • Complex ensembles: Larger groups with multiple robots
  • Genre diversity: Beyond percussion to melodic instruments
  • Expressive control: Dynamics and articulation adaptation
  • Composition tools: AI-assisted musical creation

🤖 Technical Advances

  • Enhanced sensing: Additional modalities (haptic, spatial)
  • Learning systems: Adaptive personal style recognition
  • Distributed performance: Remote ensemble capabilities
  • Mobile platforms: Portable robotic musicians

🎵 "Cyborg Philharmonic" Vision Realized

This chapter completes our journey from theoretical foundations to practical implementation, demonstrating that human-robot musical collaboration is not just possible, but can be natural, responsive, and genuinely collaborative.

The future of music may well include artificial performers who listen, adapt, and contribute as true ensemble partners.