00129 文本分类 windows11
前言
本文介绍了如何进行文本分类。
Hugging Face Github 主页: https://github.com/huggingface
Text classification is a common NLP task that assigns a label or class to text. One of the most popular forms of text classification is sentiment analysis, which assigns a label like 🙂 positive, 🙁 negative, or 😐 neutral to a sequence of text.
This guide will show you how to:
- Finetune DistilBERT on the IMDb dataset to determine whether a movie review is positive or negative.
- Use your finetuned model for inference.
The task illustrated in this tutorial is supported by the following model architectures:
ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BioGpt, BLOOM, CamemBERT, CANINE, CodeLlama, ConvBERT, CTRL, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, Gemma, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LED, LiLT, LLaMA, Longformer, LUKE, MarkupLM, mBART, MEGA, Megatron-BERT, Mistral, Mixtral, MobileBERT, MPNet, MPT, MRA, MT5, MVP, Nezha, Nyströmformer, OpenLlama, OpenAI GPT, OPT, Perceiver, Persimmon, Phi, PLBart, QDQBert, Qwen2, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, StableLm, Starcoder2, T5, TAPAS, Transformer-XL, UMT5, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO
Before you begin, make sure you have all the necessary libraries installed:
1 | pip install transformers datasets evaluate accelerate |
We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:
1 | from huggingface_hub import notebook_login |
操作系统:Windows 11 家庭中文版
参考文档
Load IMDb dataset
Start by loading the IMDb dataset from the 🤗 Datasets library:
1 | from datasets import load_dataset |
Then take a look at an example:
1 | "test"][0] imdb[ |
There are two fields in this dataset:
text
: the movie review text.label
: a value that is either0
for a negative review or1
for a positive review.
Preprocess
The next step is to load a DistilBERT tokenizer to preprocess the text
field:
1 | from transformers import AutoTokenizer |
Create a preprocessing function to tokenize text
and truncate sequences to be no longer than DistilBERT’s maximum input length:
1 | def preprocess_function(examples): |
To apply the preprocessing function over the entire dataset, use 🤗 Datasets [~datasets.Dataset.map
] function. You can speed up map
by setting batched=True
to process multiple elements of the dataset at once:
1 | tokenized_imdb = imdb.map(preprocess_function, batched=True) |
Now create a batch of examples using [DataCollatorWithPadding
]. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
1 | from transformers import DataCollatorWithPadding |
Evaluate
Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):
1 | import evaluate |
Then create a function that passes your predictions and labels to [~evaluate.EvaluationModule.compute
] to calculate the accuracy:
1 | import numpy as np |
Your compute_metrics
function is ready to go now, and you’ll return to it when you setup your training.
Train
Before you start training your model, create a map of the expected ids to their labels with id2label
and label2id
:
1 | 0: "NEGATIVE", 1: "POSITIVE"} id2label = { |
You’re ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification
] along with the number of expected labels, and the label mappings:
1 | from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer |
At this point, only three steps remain:
- Define your training hyperparameters in [
TrainingArguments
]. The only required parameter isoutput_dir
which specifies where to save your model. You’ll push this model to the Hub by settingpush_to_hub=True
(you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer
] will evaluate the accuracy and save the training checkpoint. - Pass the training arguments to [
Trainer
] along with the model, dataset, tokenizer, data collator, andcompute_metrics
function. - Call [
~Trainer.train
] to finetune your model.
1 | training_args = TrainingArguments( |
[Trainer
] applies dynamic padding by default when you pass tokenizer
to it. In this case, you don’t need to specify a data collator explicitly.
Once training is completed, share your model to the Hub with the [~transformers.Trainer.push_to_hub
] method so everyone can use your model:
1 | trainer.push_to_hub() |
For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding PyTorch notebook or TensorFlow notebook.
Inference
Grab some text you’d like to run inference on:
1 | "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three." text = |
The simplest way to try out your finetuned model for inference is to use it in a [pipeline
]. Instantiate a pipeline
for sentiment analysis with your model, and pass your text to it:
1 | from transformers import pipeline |
You can also manually replicate the results of the pipeline
if you’d like:
Tokenize the text and return PyTorch tensors:
1 | from transformers import AutoTokenizer |
Pass your inputs to the model and return the logits
:
1 | from transformers import AutoModelForSequenceClassification |
Get the class with the highest probability, and use the model’s id2label
mapping to convert it to a text label:
1 | predicted_class_id = logits.argmax().item() |
结语
第一百二十九篇博文写完,开心!!!!
今天,也是充满希望的一天。