research
papers, experience, and other research works
Speech Emotion Recognition using Multimodal LLMs and Quality-Controlled TTS-based Data Augmentation for Iberian Languages
We propose multimodal LLM-based speech emotion recognition for low-resource Iberian languages and introduce a quality-controlled TTS augmentation pipeline with translation self-verification and filtering (ASR-WER, speaker similarity, emotion). The best setup improves mean F1 by +4.9 points over an MLP baseline, showing filtered TTS data can effectively complement classical augmentation.
thaulab@EEUCA 2026: Who Said What to Whom? A Targeting-Aware Neural-Symbolic Pipeline for Gaming Toxicity Detection
A targeting-aware neural-symbolic pipeline for gaming toxicity detection that jointly identifies what was said, who said it, and who was targeted. Achieved 3rd Place out of 35 teams at the EEUCA 2026 Shared Task co-located with ACL 2026.
Mixture of Phonetic Experts Based Low-Rank Adaptation of Conformer Models for Accented English Speech Recognition
We propose a Mixture of Phonetic Experts approach combined with Low-Rank Adaptation (LoRA) of Conformer-based ASR models to improve robustness on accented English speech, targeting domain adaptation without full fine-tuning.
A Personalized, Multimodal AI Assistant for Enhancing Museum Visitor Experience
Developed under the EIC ASSIST and ASTOUND projects, this work presents a multimodal, context-aware museum AI assistant integrating vision-language models, RAG, ASR/TTS, and a Neo4j knowledge graph. The system enables artwork recognition, personalization, and real-time interaction, validated with expert feedback from major Spanish museums.
NLPineers@ NLU of Devanagari Script Languages: Hate Speech Detection using Ensembling of BERT-based Models
We study hate speech detection in Hindi and Nepali (CHIPSAL@COLING 2025), leveraging multilingual BERT-based ensembles. Our best system achieves 0.7762 recall (Rank 3/31) and competitive F1, highlighting the effectiveness of transformer ensembling for nuanced moderation tasks.
Real-Time Scream Detection and Position Estimation for Worker Safety in Construction Sites
We combine Wav2Vec2-based scream detection with GCC-PHAT time-delay estimation and gradient-based localization for real-time position estimation in noisy construction sites, enabling improved worker safety monitoring.
Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024
An ensemble of speech foundation models with Squeeze-and-Excitation Aggregation achieves 1.79% pooled EER, securing 3rd place in the CtrSVDD 2024 challenge.
Attention Isnβt All You Need for Emotion Recognition: Domain Features Outperform Transformers on the EAV Dataset
A systematic study on multimodal emotion recognition shows that complex attention mechanisms underperform on small datasets. Domain-informed features (delta MFCCs, EEG frequency features) outperform transformers, improving accuracy up to +7.62 points over baselines.