research

papers, experience, and other research works


πŸ“„

Speech Emotion Recognition using Multimodal LLMs and Quality-Controlled TTS-based Data Augmentation for Iberian Languages

Status: Published|2026

We propose multimodal LLM-based speech emotion recognition for low-resource Iberian languages and introduce a quality-controlled TTS augmentation pipeline with translation self-verification and filtering (ASR-WER, speaker similarity, emotion). The best setup improves mean F1 by +4.9 points over an MLP baseline, showing filtered TTS data can effectively complement classical augmentation.

πŸ“„

A Personalized, Multimodal AI Assistant for Enhancing Museum Visitor Experience

Status: Accepted|2025

Developed under the EIC ASSIST and ASTOUND projects, this work presents a multimodal, context-aware museum AI assistant integrating vision-language models, RAG, ASR/TTS, and a Neo4j knowledge graph. The system enables artwork recognition, personalization, and real-time interaction, validated with expert feedback from major Spanish museums.

πŸ“„

NLPineers@ NLU of Devanagari Script Languages: Hate Speech Detection using Ensembling of BERT-based Models

Status: Accepted|2025|arXiv badge

We study hate speech detection in Hindi and Nepali (CHIPSAL@COLING 2025), leveraging multilingual BERT-based ensembles. Our best system achieves 0.7762 recall (Rank 3/31) and competitive F1, highlighting the effectiveness of transformer ensembling for nuanced moderation tasks.

πŸ“„

Real-Time Scream Detection and Position Estimation for Worker Safety in Construction Sites

Status: Accepted|2025|arXiv badge

We combine Wav2Vec2-based scream detection with GCC-PHAT time-delay estimation and gradient-based localization for real-time position estimation in noisy construction sites, enabling improved worker safety monitoring.

πŸ“„

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

Status: Published|2024|arXiv badge

An ensemble of speech foundation models with Squeeze-and-Excitation Aggregation achieves 1.79% pooled EER, securing 3rd place in the CtrSVDD 2024 challenge.

πŸ“„

Attention Isn’t All You Need for Emotion Recognition: Domain Features Outperform Transformers on the EAV Dataset

Status: Preprint|2026|arXiv badge

A systematic study on multimodal emotion recognition shows that complex attention mechanisms underperform on small datasets. Domain-informed features (delta MFCCs, EEG frequency features) outperform transformers, improving accuracy up to +7.62 points over baselines.