Problem statement for text classification

Multi-Label Classification Model From Scratch: Step-by-Step Tutorial

Community Article Published January 8, 2024

Valerii-Knowledgator Valerii Vasylevskyi

This tutorial will guide you through each step of creating an efficient ML model for multi-label text classification. We will use DeBERTa as a base model, which is currently the best choice for encoder models, and fine-tune it on our dataset. This dataset contains 3140 meticulously validated training examples of significant business events in the biotech industry. Although a specific topic, the dataset is universal and extremely beneficial for various business data classification tasks. Our team open-sourced the dataset, aiming to transcend the limitations of existing benchmarks, which are more academic than practical. By the end of this tutorial, you will get an actionable model that surpasses most of the popular solutions in this field.

Text classification is the most widely required NLP task. Despite its simple formulation, for most real business use cases, it's a complicated task that requires expertise to collect high-quality datasets and train performant and accurate models at the same time. Multi-label classification is even more complicated and problematic. This research area is significantly underrepresented in the ML community, with public datasets being too simple to train actionable models. Moreover, engineers often benchmark models of Sentiment Analysis and other relatively simplistic tasks which don't represent the real problem-solving capacity of models. In this blog, we will train a multi-label classification model on an open-source dataset collected by our team to prove that everyone can develop a better solution.

Before starting the project, please make sure that you have installed the following packages:

!pip install datasets transformers evaluate sentencepiece accelerate

Datasets: is a powerful tool that provides a unified and easy-to-use interface to access and work with a wide range of datasets commonly used in NLP research and applications.
Transformers: library for working with pre-trained models in natural language processing, it provides an extensive collection of state-of-the-art models for tasks such as text generation, translation, sentiment analysis, and more.
Evaluate: a library for easily evaluating machine learning models and datasets.
Sentencepiece: is an unsupervised text tokenizer and detokenizer mainly used for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training
Accelerate: a library providing a high-level interface for distributed training, making it easier for researchers and practitioners to scale their deep learning workflows across multiple GPUs or machines.

Recently, we have open-sourced a dataset specifically tailored to the biotech news sector, aiming to transcend the limitations of existing benchmarks. This dataset is rich in complex content, comprising thousands of biotech news articles covering various significant business events, thus providing a more nuanced view of information extraction challenges.

The dataset encompasses 31 classes, including a 'None' category, to cover various events and information types such as event organisation, executive statements, regulatory approvals, hiring announcements, and more.

Key aspects

Event extraction;
Multi-class classification;
Biotech news domain;
31 classes;
3140 total number of examples;

Labels:

event organization - organizing or participating in an event like a conference, exhibition, etc.
executive statement - a statement or quote from an executive of a company.
regulatory approval - getting approval from regulatory bodies for products, services, trials, etc.
hiring - announcing new hires or appointments at the company.
foundation - establishing a new charitable foundation.
closing - shutting down a facility/office/division or ceasing an initiative.
partnerships & alliances - forming partnerships or strategic alliances with other companies.
expanding industry - expanding into new industries or markets.
new initiatives or programs - announcing new initiatives, programs, or campaigns.
m&a - mergers, acquisitions, or divestitures.

Check more details about the dataset in our article.

Guidance

First of all, let's initialize the dataset and perform preprocessing of classes:

from datasets import load_dataset dataset = load_dataset('knowledgator/events_classification_biotech') classes = [class_ for class_ in dataset['train'].features['label 1'].names if class_] class2id = id for id, class_ in enumerate(classes)> id2class = id:class_ for class_, id in class2id.items()>

After that, we tokenise the dataset and process labels for multi-label classification. Firstly, we gonna initialise the tokeniser. In our tutorial, we will use DeBERTa model, currently the best choice for encoder-base models.

from transformers import AutoTokenizer model_path = 'microsoft/deberta-v3-small' tokenizer = AutoTokenizer.from_pretrained(model_path)

Then, we tokenize the dataset and process labels for multi-label classification

def preprocess_function(example): text = f"'title']>.\n'content']>" all_labels = example['all_labels'].split(', ') labels = [0. for i in range(len(classes))] for label in all_labels: label_id = class2id[label] labels[label_id] = 1. example = tokenizer(text, truncation=True) example['labels'] = labels return example tokenized_dataset = dataset.map(preprocess_function)

After that, we initialize the DataCollatorWithPadding. It's more efficient to dynamically pad the sentences to the longest length in a batch during collation instead of padding the whole dataset to the maximum length.

from transformers import DataCollatorWithPadding data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Implementing metrics during training is super helpful for monitoring model performance over time. It can help avoid over-fitting and build a more general model.

import evaluate import numpy as np clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"]) def sigmoid(x): return 1/(1 + np.exp(-x)) def compute_metrics(eval_pred): predictions, labels = eval_pred predictions = sigmoid(predictions) predictions = (predictions > 0.5).astype(int).reshape(-1) return clf_metrics.compute(predictions=predictions, references=labels.astype(int).reshape(-1)) references=labels.astype(int).reshape(-1))

Let's initialise our model and pass all necessary details about our classification task, such as the number of labels, class names and their IDs, and type of classification.

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer model = AutoModelForSequenceClassification.from_pretrained( model_path, num_labels=len(classes), id2label=id2class, label2id=class2id, problem_type = "multi_label_classification"

Next, we must configure the training arguments, and then we can begin the training process.

training_args = TrainingArguments( output_dir="my_awesome_model", learning_rate=2e-5, per_device_train_batch_size=3, per_device_eval_batch_size=3, num_train_epochs=2, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset["train"], eval_dataset=tokenized_dataset["test"], tokenizer=tokenizer, ` data_collator=data_collator, compute_metrics=compute_metrics, ) trainer.train()

If you've carefully followed each step of this guide, you now possess a highly effective tool. The training procedure is straightforward and yields remarkable results. We built it using exclusively open-source solutions and achieved excellent outcomes.

Should you have any inquiries or suggestions on how to streamline the process further and enhance the results, please feel free to leave a comment!

Or join the Discord community, where we actively engage with everyone!

Conclusion

In this article, we have demonstrated how to build a multi-label text classifier from scratch using Transformers and Datasets. We leveraged the DeBERTa encoder model and fine-tuned it on a custom biotech news dataset with 31 labels. After tokenising the text and encoding the multi-hot labels, we set up the training loop with helpful metrics for monitoring. The model can now accurately classify biotech news articles into multiple relevant categories. This end-to-end workflow showcases how new datasets can be prepared and fed into powerful pre-trained models like DeBERTa to create custom and powerful text classifiers for real-world applications. Now, you can add more data and labels to train the model tailored to your specific requirements and reach your professional goals. And we're happy to help with your research & development efforts!

FAQ

What's zero-shot learning?
Zero-shot learning refers to the ability of a machine learning model to recognize and classify data that it has never seen during training. In this context, "zero-shot" implies that the model is able to make correct predictions for tasks or categories without having been explicitly trained on them. This is achieved by leveraging knowledge gained during training on different, but related tasks or categories. The model uses this learned information to infer and generalize to new, unseen situations. In the context of this article, you're ML model will be able to work with new labels apart from the 31 categories it was fine-tuned on. For example, it may recognize "Buying Property" or "New Advisory Board Member" classes.
What is model fine-tuning?
Fine-tuning in machine learning is a process where a pre-trained model (a model that has been trained on a large dataset) is further trained or adjusted with a smaller, specific dataset. This process helps the model to specialize and perform better on tasks related to the new dataset. Fine-tuning essentially customizes a general model to make it more relevant and effective for particular needs or datasets. In our example, DeBerta is a pre-trained model that was trained on large text corpora and was further fine-tuned on the Biotech news dataset to be able to execute a specific task with better accuracy.
What is multi-label classification?
Multi-label classification is a type of machine learning problem where each instance (like an image, text, etc.) can belong to multiple classes or categories simultaneously like in our example. Unlike binary or multi-class classification where an instance is assigned to only one class, in multi-label classification, it can be associated with several different labels or categories at the same time. This approach is useful in scenarios where the items to be classified naturally fit into more than one category. For example, a news article might be relevant to both 'sports' and 'economics'.