research
papers, experience, and other research works
Speech Emotion Recognition using Multimodal LLMs and Quality-Controlled TTS-based Data Augmentation for Iberian Languages
We propose multimodal LLM-based speech emotion recognition for low-resource Iberian languages and introduce a quality-controlled TTS augmentation pipeline with translation self-verification and filtering (ASR-WER, speaker similarity, emotion). The best setup improves mean F1 by +4.9 points over an MLP baseline, showing filtered TTS data can effectively complement classical augmentation.
A Personalized, Multimodal AI Assistant for Enhancing Museum Visitor Experience
Developed under the EIC ASSIST and ASTOUND projects, this work presents a multimodal, context-aware museum AI assistant integrating vision-language models, RAG, ASR/TTS, and a Neo4j knowledge graph. The system enables artwork recognition, personalization, and real-time interaction, validated with expert feedback from major Spanish museums.
NLPineers@ NLU of Devanagari Script Languages: Hate Speech Detection using Ensembling of BERT-based Models
We study hate speech detection in Hindi and Nepali (CHIPSAL@COLING 2025), leveraging multilingual BERT-based ensembles. Our best system achieves 0.7762 recall (Rank 3/31) and competitive F1, highlighting the effectiveness of transformer ensembling for nuanced moderation tasks.
Real-Time Scream Detection and Position Estimation for Worker Safety in Construction Sites
We combine Wav2Vec2-based scream detection with GCC-PHAT time-delay estimation and gradient-based localization for real-time position estimation in noisy construction sites, enabling improved worker safety monitoring.
Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
An ensemble of speech foundation models with Squeeze-and-Excitation Aggregation achieves 1.79% pooled EER, securing 3rd place in the CtrSVDD 2024 challenge.
Attention Isnβt All You Need for Emotion Recognition: Domain Features Outperform Transformers on the EAV Dataset
A systematic study on multimodal emotion recognition shows that complex attention mechanisms underperform on small datasets. Domain-informed features (delta MFCCs, EEG frequency features) outperform transformers, improving accuracy up to +7.62 points over baselines.