Multi-Label Classification Model From Scratch: Step-by-Step Tutorial
Community Article Published January 8, 2024
Valerii-Knowledgator Valerii VasylevskyiThis tutorial will guide you through each step of creating an efficient ML model for multi-label text classification. We will use DeBERTa as a base model, which is currently the best choice for encoder models, and fine-tune it on our dataset. This dataset contains 3140 meticulously validated training examples of significant business events in the biotech industry. Although a specific topic, the dataset is universal and extremely beneficial for various business data classification tasks. Our team open-sourced the dataset, aiming to transcend the limitations of existing benchmarks, which are more academic than practical. By the end of this tutorial, you will get an actionable model that surpasses most of the popular solutions in this field.
Text classification is the most widely required NLP task. Despite its simple formulation, for most real business use cases, it's a complicated task that requires expertise to collect high-quality datasets and train performant and accurate models at the same time. Multi-label classification is even more complicated and problematic. This research area is significantly underrepresented in the ML community, with public datasets being too simple to train actionable models. Moreover, engineers often benchmark models of Sentiment Analysis and other relatively simplistic tasks which don't represent the real problem-solving capacity of models. In this blog, we will train a multi-label classification model on an open-source dataset collected by our team to prove that everyone can develop a better solution.
Before starting the project, please make sure that you have installed the following packages:
!pip install datasets transformers evaluate sentencepiece accelerate
Recently, we have open-sourced a dataset specifically tailored to the biotech news sector, aiming to transcend the limitations of existing benchmarks. This dataset is rich in complex content, comprising thousands of biotech news articles covering various significant business events, thus providing a more nuanced view of information extraction challenges.
The dataset encompasses 31 classes, including a 'None' category, to cover various events and information types such as event organisation, executive statements, regulatory approvals, hiring announcements, and more.
Key aspects
Labels:
Check more details about the dataset in our article.
Guidance
First of all, let's initialize the dataset and perform preprocessing of classes:
from datasets import load_dataset dataset = load_dataset('knowledgator/events_classification_biotech') classes = [class_ for class_ in dataset['train'].features['label 1'].names if class_] class2id = id for id, class_ in enumerate(classes)> id2class = id:class_ for class_, id in class2id.items()>
After that, we tokenise the dataset and process labels for multi-label classification. Firstly, we gonna initialise the tokeniser. In our tutorial, we will use DeBERTa model, currently the best choice for encoder-base models.
from transformers import AutoTokenizer model_path = 'microsoft/deberta-v3-small' tokenizer = AutoTokenizer.from_pretrained(model_path)
Then, we tokenize the dataset and process labels for multi-label classification
def preprocess_function(example): text = f"'title' ]>.\n'content' ]>" all_labels = example['all_labels'].split(', ') labels = [0. for i in range(len(classes))] for label in all_labels: label_id = class2id[label] labels[label_id] = 1. example = tokenizer(text, truncation=True) example['labels'] = labels return example tokenized_dataset = dataset.map(preprocess_function)
After that, we initialize the DataCollatorWithPadding. It's more efficient to dynamically pad the sentences to the longest length in a batch during collation instead of padding the whole dataset to the maximum length.
from transformers import DataCollatorWithPadding data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Implementing metrics during training is super helpful for monitoring model performance over time. It can help avoid over-fitting and build a more general model.
import evaluate import numpy as np clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"]) def sigmoid(x): return 1/(1 + np.exp(-x)) def compute_metrics(eval_pred): predictions, labels = eval_pred predictions = sigmoid(predictions) predictions = (predictions > 0.5).astype(int).reshape(-1) return clf_metrics.compute(predictions=predictions, references=labels.astype(int).reshape(-1)) references=labels.astype(int).reshape(-1))
Let's initialise our model and pass all necessary details about our classification task, such as the number of labels, class names and their IDs, and type of classification.
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer model = AutoModelForSequenceClassification.from_pretrained( model_path, num_labels=len(classes), id2label=id2class, label2id=class2id, problem_type = "multi_label_classification"
Next, we must configure the training arguments, and then we can begin the training process.
training_args = TrainingArguments( output_dir="my_awesome_model", learning_rate=2e-5, per_device_train_batch_size=3, per_device_eval_batch_size=3, num_train_epochs=2, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset["train"], eval_dataset=tokenized_dataset["test"], tokenizer=tokenizer, ` data_collator=data_collator, compute_metrics=compute_metrics, ) trainer.train()
If you've carefully followed each step of this guide, you now possess a highly effective tool. The training procedure is straightforward and yields remarkable results. We built it using exclusively open-source solutions and achieved excellent outcomes.
Should you have any inquiries or suggestions on how to streamline the process further and enhance the results, please feel free to leave a comment!
Or join the Discord community, where we actively engage with everyone!
Conclusion
In this article, we have demonstrated how to build a multi-label text classifier from scratch using Transformers and Datasets. We leveraged the DeBERTa encoder model and fine-tuned it on a custom biotech news dataset with 31 labels. After tokenising the text and encoding the multi-hot labels, we set up the training loop with helpful metrics for monitoring. The model can now accurately classify biotech news articles into multiple relevant categories. This end-to-end workflow showcases how new datasets can be prepared and fed into powerful pre-trained models like DeBERTa to create custom and powerful text classifiers for real-world applications. Now, you can add more data and labels to train the model tailored to your specific requirements and reach your professional goals. And we're happy to help with your research & development efforts!
FAQ