HiTZ's Comprehensive Basque Language AI Models and Datasets on Hugging Face

Tools & Engineering

The Engineer

7 Feb 2024 · 3 min read

HiTZ's release of Basque language AI models and datasets on Hugging Face offers researchers a valuable resource for advancing NLP in under-resourced languages, covering text-to-speech, speech-to-text, multimodal, and instruction-tuned applications.

HiTZ, a research group focused on the Basque language, has recently released an extensive collection of models and datasets on Hugging Face. This collection is a treasure trove for researchers and practitioners working in natural language processing (NLP) and machine learning (ML), particularly those interested in under-resourced languages like Basque.

Key Highlights

Diverse Model Types: The collection includes text-to-speech (TTS), speech-to-text (ASR), multimodal, and instruction-tuned models.
Rich Datasets: A variety of datasets for pretraining, evaluation, and specific tasks such as lemmatization and medical translation.
Multilingual Support: Models like Multilingual TruthfulQA and Medical-mT5 extend the utility beyond Basque.

Technical Overview

Latxa Instruct

Purpose: An instruction-tuned model specifically designed for generating high-quality responses to user queries in Basque.
Architecture: Built on top of a transformer architecture, fine-tuned on a large corpus of instructional text.
Use Case: Ideal for chatbots and conversational agents that need to understand and respond to complex instructions.

TTS (Text-to-Speech)

Purpose: Converts text into spoken language in Basque.
Architecture: Utilizes a deep neural network, likely based on Tacotron or similar architectures, trained on high-quality audio data.
Use Case: Enhances accessibility for visually impaired users and improves user experience in voice-enabled applications.

Latxa VL (Vision-Language)

Purpose: A multimodal model that can process both visual and textual information.
Architecture: Combines a vision encoder (e.g., ResNet) with a language decoder (e.g., transformer).
Use Case: Useful for tasks like image captioning, visual question answering, and cross-modal retrieval.

Cap&Punct

Purpose: Adds capitalization and punctuation to text.
Architecture: A sequence-to-sequence model trained on a dataset of raw and corrected text.
Use Case: Improves the readability of unformatted or poorly structured text in documents and transcripts.

ASR Datasets

Content: Includes a variety of datasets for training and evaluating automatic speech recognition models.
Use Case: Essential for developing and testing ASR systems, particularly those focused on Basque.

Merge and Conquer

Purpose: A method for merging multiple datasets to create a more robust training set.
Architecture: Involves data cleaning, normalization, and augmentation techniques.
Use Case: Enhances the performance of models by providing a diverse and comprehensive dataset.

Notable Datasets

EusCrawl

Content: A large web-crawled dataset in Basque.
Use Case: Suitable for pretraining language models and expanding the corpus of available text data.

BERTeus

Purpose: A BERT model fine-tuned on Basque text.
Architecture: Based on the transformer architecture, trained on a diverse set of Basque texts.
Use Case: Effective for a wide range of NLP tasks, including sentiment analysis, named entity recognition, and question answering.

Pretraining Datasets

Content: Various datasets used for pretraining language models.
Use Case: Essential for initializing models with a strong foundation in the Basque language.

Additional Models and Tools

Whisper: A speech-to-text model that can transcribe audio into text in multiple languages, including Basque.
Pyannote: A toolkit for speaker diarization and voice activity detection.
Nvidia NeMo: A framework for building conversational AI applications.
Medical-mT5: A multilingual transformer model fine-tuned on medical data.

Conclusion

HiTZ's collection on Hugging Face is a significant contribution to the field of NLP, particularly for under-resourced languages like Basque. The diverse range of models and datasets provides researchers and practitioners with powerful tools to advance their work in language processing and machine learning. Whether you're building a chatbot, developing a speech recognition system, or working on multimodal applications, this collection offers valuable resources to support your projects.