research

papers, experience, and other research works


πŸ“„

Speech Emotion Recognition using Multimodal LLMs and Quality-Controlled TTS-based Data Augmentation for Iberian Languages

Status: Published|2026

We propose multimodal LLM-based speech emotion recognition for low-resource Iberian languages and introduce a quality-controlled TTS augmentation pipeline with translation self-verification and filtering (ASR-WER, speaker similarity, emotion). The best setup improves mean F1 by +4.9 points over an MLP baseline, showing filtered TTS data can effectively complement classical augmentation.

πŸ“„

thaulab@EEUCA 2026: Who Said What to Whom? A Targeting-Aware Neural-Symbolic Pipeline for Gaming Toxicity Detection

Status: Published|2026

A targeting-aware neural-symbolic pipeline for gaming toxicity detection that jointly identifies what was said, who said it, and who was targeted. Achieved 3rd Place out of 35 teams at the EEUCA 2026 Shared Task co-located with ACL 2026.

πŸ“„

Mixture of Phonetic Experts Based Low-Rank Adaptation of Conformer Models for Accented English Speech Recognition

Status: Under Review|2026

We propose a Mixture of Phonetic Experts approach combined with Low-Rank Adaptation (LoRA) of Conformer-based ASR models to improve robustness on accented English speech, targeting domain adaptation without full fine-tuning.

πŸ“„

A Personalized, Multimodal AI Assistant for Enhancing Museum Visitor Experience

Status: Accepted|2025

Developed under the EIC ASSIST and ASTOUND projects, this work presents a multimodal, context-aware museum AI assistant integrating vision-language models, RAG, ASR/TTS, and a Neo4j knowledge graph. The system enables artwork recognition, personalization, and real-time interaction, validated with expert feedback from major Spanish museums.

πŸ“„

NLPineers@ NLU of Devanagari Script Languages: Hate Speech Detection using Ensembling of BERT-based Models

Status: Accepted|2025|arXiv badge

We study hate speech detection in Hindi and Nepali (CHIPSAL@COLING 2025), leveraging multilingual BERT-based ensembles. Our best system achieves 0.7762 recall (Rank 3/31) and competitive F1, highlighting the effectiveness of transformer ensembling for nuanced moderation tasks.

πŸ“„

Real-Time Scream Detection and Position Estimation for Worker Safety in Construction Sites

Status: Accepted|2025|arXiv badge

We combine Wav2Vec2-based scream detection with GCC-PHAT time-delay estimation and gradient-based localization for real-time position estimation in noisy construction sites, enabling improved worker safety monitoring.

πŸ“„

Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

Status: Published|2024|arXiv badge

An ensemble of speech foundation models with Squeeze-and-Excitation Aggregation achieves 1.79% pooled EER, securing 3rd place in the CtrSVDD 2024 challenge.

πŸ“„

Attention Isn’t All You Need for Emotion Recognition: Domain Features Outperform Transformers on the EAV Dataset

Status: Preprint|2026|arXiv badge

A systematic study on multimodal emotion recognition shows that complex attention mechanisms underperform on small datasets. Domain-informed features (delta MFCCs, EEG frequency features) outperform transformers, improving accuracy up to +7.62 points over baselines.