Transformer Modelle mit fastai und blurr

Deep Learning Modelle, die auf der Transformer Modellarchitektur aufbauen, allen voran Encoder/Decoder Modellarchitekturen wie BERT oder GPT2, GPT3 etc. prägen aktuell die Erfolge von deep learning. Insbesondere bei Anwendungen im Bereich NLU - Natural Language Understanding sind Transformermodelle nicht mehr wegzudenken.

Mit Transformermodellen gelingen NLU Aufgaben wie:

Sentimentberechnung
One-Shot-Classification
Textgeneration
Entitity Extraction
Summarization
Question Answering
Sentence Similarity

rasch und zuverlässig.

Doch benötigen diese Modelle in der Regel einen hohen Trainingsaufwand, weshalb es meist nicht möglich ist, tiefe Transformer Modelle wie GPTx selbst von Beginn an zu trainieren.

Hierfür wollen wir uns Transferlearning zu Nutze machen und bereits vortrainierte Modelle für unseren konkreten Anwendungsfall weitertrainieren.

Diese vortrainierten Modelle können wir mithilfe von Hugging-Face in unsere eigenen deep learning Projekte integrieren.

Hugging Face Model Hub

Der Hugging Face Model Hub ist sozusagen der “Marktplatz” für vortrainierten Transformer Modelle.

Hier kann nach Modellarchitekturen oder auch nach der gewünschten Aufgabe für das Modell gesucht werden und so das passende Modell herausgesucht werden.

Hugging Face Model Hub

Um Hugging Face Modelle in unser fastai Projekt integrieren zu können, installieren wir die transformers Bibliothek:

!pip install -qq datasets transformers[sentencepiece]

[K     |████████████████████████████████| 270 kB 5.3 MB/s 
[K     |████████████████████████████████| 2.8 MB 41.0 MB/s 
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
[K     |████████████████████████████████| 243 kB 44.3 MB/s 
[K     |████████████████████████████████| 119 kB 49.7 MB/s 
[K     |████████████████████████████████| 1.3 MB 41.9 MB/s 
[K     |████████████████████████████████| 294 kB 48.9 MB/s 
[K     |████████████████████████████████| 142 kB 49.0 MB/s 
[K     |████████████████████████████████| 895 kB 39.6 MB/s 
[K     |████████████████████████████████| 3.3 MB 33.9 MB/s 
[K     |████████████████████████████████| 636 kB 40.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 30.7 MB/s 
[?25h

Behind using the pipeline api

Die einfachste und schnellste Möglichkeit, mit Hugging Face Transformer Modellen zu arbeiten, bietet die pipeline api. Wir wollen in weiterer Folge etwas tiefer in die Materie eintauchen und analysieren, was im Hintergrund der Anwendung der pipeline api erfolgt.

Als erstes müssen wir die pipeline aus der Transformer Bibliothek importieren.

from transformers import pipeline

Nun können wir einen classifier mithilfe der pipeline api anlegen. Dem Aufruf von pipeline geben wir die Aufgabe, die unser classifier erfüllen soll als Aufrufparameter mit.

Wir wollen in diesem Beispiel eine Sentimentanalyse durchführen und übergeben daher sentiment-analysis.

Wenn wir uns mit classifier.model.name_or_path die zugrunde liegende Modellarchitektur anzeigen lassen, so sehen wir, dass pipeline folgendes Modell gewählt hat: distilbert-base-uncased-finetuned-sst-2-english.

classifier = pipeline("sentiment-analysis")
classifier(["I've been waiting for a HuggingFace course my whole life.",
            "I hate this really much"])
print(classifier.model.name_or_path)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)



Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]


distilbert-base-uncased-finetuned-sst-2-english

Schritt 1: Tokenization

Bevor wir den Text-Input in unser Modell leiten, erfolgt als ersten die tokenization. Das bedeutet, wir bereiten den Text als Sequenz von Tokens für unser Modell auf. Hierbei werden zusätzliche Informationen, wie Beginn und Ende und besondere Zeichen wir Ausrufezeichen oder mehrfache Wortwiederholungen mit speziellen Token markiert.

Wir setzen unseren Checkpoint auf die gewählte Modellarchitektur, sodass wir später wieder darauf zugreifen können.

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

Mithilfe der Klasse AutoTokenizer lassen wir die transformer Bibliothek den passenden Tokenizer für unsere Aufgabenstellung auswählen.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Nun können wir unseren Tokenizer bereits anwenden. Wir sehen, dass der Tokenizer nun eine Liste von input_ids gmeinsam mit einer attention_mask liefert. Die attention_mask sorgt dafür, dass das KI-Modell seine Aufmerksamkeit verstärkt auf Bereiche und Wortgruppen bezieht, die zueinander in Beziehung stehen.

raw_inputs = [
  "I've been waiting for a HuggingFace course my whole life.",
  "I hate this so much!"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

Schritt 2: Inputs durch das Modell senden

Nachdem wir unsere Inputdaten durch den Tokenizer geschickt haben, können wir unser Transformer Modell damit füttern. Dazu verwenden wir die Klasse AutoModel, die die passende Modellarchitektur auswählt. Mithilfe der Methode from_pretrained setzen wir die Aufgabe auf Basis unseres gespeicherten Checkpoints.

from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Nun befüllen wir unser Modell mit den Ausgabewerten des Tokenizers. Wenn wir den letzten Layer unseres Modells betrachten, sehen wir dass wir 2 Ausgangskanäle haben (positives / negatives Sentiment).

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Wir können nun die gesamte Modellarchitektur ausgeben:

print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

Der Output unseres Modells stellt Logits dar. Das heißt, wir müssen die Ausgabewerte unseres Modells noch bearbeiten, sodass wir die gewünschten Werte (0 bzw. 1) als Prognose unseres Modells erhalten.

print(outputs.logits.shape)
print(outputs.logits)

torch.Size([2, 2])
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

Processing Outputs

Zuerst wandeln wir die Logit-Werte mithilfe der Softmax Funktion in Wahrscheinlichkeitswerte um.

import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

Nun können wir die Wahrscheinlichkeitswerte mithilfe von id2label den gewünschten Labels unseres Modells zuordnen.

model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Hinter pipeline: Blurr

Nun steigen wir einen Schritt hinab in der Abstraktion und betrachten, wie wir das gleiche Problem auf einer Ebene darunter mithilfe von blurr lösen können.

blurr

A library that integrates huggingface transformers with version 2 of the fastai framework

blurr auf gitgub

Dazu installieren wir die erforderlichen Abhängigkeiten:

!pip install -qq fastai
!pip install -qq ohmeow-blurr

[K     |████████████████████████████████| 91 kB 4.1 MB/s 
[K     |████████████████████████████████| 43 kB 1.9 MB/s 
[K     |████████████████████████████████| 186 kB 30.8 MB/s 
[K     |████████████████████████████████| 56 kB 3.5 MB/s 
[K     |████████████████████████████████| 46 kB 4.1 MB/s 
[K     |████████████████████████████████| 51 kB 279 kB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone

… und importieren die erforderlichen Module:

from fastai.text.all import *
from blurr.utils import *
from blurr.data.core import *
from blurr.modeling.core import *

Vorbereitungen

Wir setzen auch hier wieder unseren Checkpoint auf die gewünschte Modellarchitektur.

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

Als Trainingsdaten für unseren Sentiment-Task nehmen wir das imdb Trainingsdatenset mit Filmkritiken zur Hand. In diesem Trainingsdatenset sind Filmkritiken gemeinsam mit dem jeweiligen Sentimentwert abgespeichert.

Mithilfe der Methode untar_data aus dem fastai Framework laden wir die Daten herunter und extrahieren das Trainingsdatenset.

Im Anschluss daran laden wir die Daten mithilfe von read_csv direkt in ein Pandas DataFrame Objekt.

Wir sehen hier also drei Spalten:

das jeweilige Label (positiv / negativ)
den Text der Filmkritik
ein Flag, ob das Item in das Trainings- oder Validationdatenset gehört.

path = untar_data(URLs.IMDB_SAMPLE)
imdb_df = pd.read_csv(path/'texts.csv')

imdb_df.head()

	label	text	is_valid
0	negative	Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!	False
1	positive	This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...	False
2	negative	Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...	False
3	positive	Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic...	False
4	negative	This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...	False

Nun holen wir mithilfe von get_hf_obejcts die erforderlichen Parameter für unsere Transformer-Modellarchitektur von HuggingFace ab. Hier benötigen wir wieder unseren Checkpoint und die gewünschte Model-Klasse.

Wir erhalten die Modellarchitektur, den Tokenizer und unser DL-Modell.

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(checkpoint, model_cls=AutoModelForSequenceClassification)

print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))

distilbert
<class 'transformers.models.distilbert.configuration_distilbert.DistilBertConfig'>
<class 'transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast'>
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification'>

Als nächsten bauen wir unseren Data-Block entsprechend dem fastai Framework auf. Dafür wählen wir als unabängige Variable einen speziellen Block vom Typ HF_TextBlock, dem wir die zuvor extrahieren Kennwerte unserer Transformerarchitektur übergeben, und als abhängige Variable einen CategoryBlock, da wir ja positiv bzw. negativ als Prognose unseres Modells errechnen wollen.

Mithilfe dieser beiden Blöcke können wir danach unsere Dataloader für unseren fastai-Learner erstellen.

blocks = (HF_TextBlock(hf_arch,hf_config, hf_tokenizer, 
                       hf_model, max_length=None, padding=True,
                       truncation=True), 
          CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'),
                   splitter=ColSplitter())

dls = dblock.dataloaders(imdb_df, bs=4)

Betrachten wir nun einen Beispiel-Batch aus unseren Trainingsdaten.

dls.show_batch(dataloaders=dls, max_n=2)

	text	target
0	raising victor vargas : a review < br / > < br / > you know, raising victor vargas is like sticking your hands into a big, steaming bowl of oatmeal. it's warm and gooey, but you're not sure if it feels right. try as i might, no matter how warm and gooey raising victor vargas became i was always aware that something didn't quite feel right. victor vargas suffers from a certain overconfidence on the director's part. apparently, the director thought that the ethnic backdrop of a latino family on the lower east side, and an idyllic storyline would make the film critic proof. he was right, but it didn't fool me. raising victor vargas is the story about a seventeen - year old boy called, you guessed it, victor vargas ( victor rasuk ) who lives his teenage years chasing more skirt than the rolling stones could do in all the years they've toured. the movie starts off in ` ugly fat'donna's bedroom where victor is sure to seduce her, but a cry from outside disrupts his plans when his best - friend harold ( kevin rivera ) comes - a - looking for him. caught in the attempt by harold and his sister, victor vargas runs off for damage control. yet even with the embarrassing implication that he's been boffing the homeliest girl in the neighborhood, nothing dissuades young victor from going off on the hunt for more fresh meat. on a hot, new york city day they make way to the local public swimming pool where victor's eyes catch a glimpse of the lovely young nymph judy ( judy marte ), who's not just pretty, but a strong and independent too. the relationship that develops between victor and judy becomes the focus of the film. the story also focuses on victor's family that is comprised of his grandmother or abuelita ( altagracia guzman ), his brother nino ( also played by real life brother to victor, silvestre rasuk ) and his sister vicky ( krystal rodriguez ). the action follows victor between scenes with judy and scenes with his family. victor tries to cope with being an oversexed pimp - daddy, his feelings for judy and his grandmother's conservative catholic upbringing. < br / > < br / > the problems that arise from raising victor vargas are a few, but glaring errors. throughout the film you get to know certain characters like vicky, nino, grandma, judy and even	negative
1	although recognized as the best film treatment of the difficulties of having a house in the country built ( or bought ) to your specifications, it is not the first, nor the last. in 1940 jack benny and ann sheridan were the leads in the film version of the comedy george washington slept here by george s. kaufman and moss hart. and about fifteen years ago shelly long and tom hanks had the lead in the money pit. the former was about moving into an 18th century country house that... err, needs work. the latter was about building your dream house - in the late 1980s. although the two films have their moments, both are not as good as blandings, which was based on an autobiographical novel of the same name. < br / > < br / > jim blandings and his wife muriel ( cary grant and myrna loy ) are noticing the tight corners of their apartment, which they share with their two daughters joan and betsy ( sharyn moffett and connie marshall ). although blandings has a good income as an advertising executive ( in 1948 he is making $ 15, 000. 00 a year, which was like making $ 90, 000. 00 today ), and lives in a luxury apartment - which in the new york city of that day he rents! - he feels they should seek something better. he and muriel take a drive into the country ( connecticut ) and soon find an old ruin that both imagine can be fixed up as that dream house they want. < br / > < br / > and they both fall into the financial worm hole that buying land and construction can lead to. for one thing, they are so gung ho about the idea of building a home like this they fail to heed warning after warning by their wise, if cynical friend and lawyer bill cole ( melvin douglas, in a nicely sardonic role ). for example, jim buys land from a connecticut dealer ( ian wolfe, sucking his chops quietly ), with a check before double checking the correct cost for the land in that part of connecticut. bill points out he's paid about five or six thousand dollars more for the land than it is worth. there are problems about water supply that both blandings just never think about, such as hard and soft water - which leads to the zis - zis water softening machine. they find that the designs they have in mind, and have worked out with their architect ( reginald denny ), can't be dropped cheaply at a spur of the	positive

xb, yb = dls.one_batch()

Wir sehen, dass unser Modell auch hier wieder mit Attention arbeitet.

xb

{'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'),
 'input_ids': tensor([[  101,  6274,  5125,  ...,  1998,  2130,   102],
         [  101,  1996,  4497,  ...,  2090,  1005,   102],
         [  101,  1045,  2428,  ...,  3272,  1010,   102],
         [  101,  1045, 12524,  ...,  1999,  2008,   102]], device='cuda:0')}

len(xb), xb['input_ids'].shape, xb['attention_mask'].shape, len(xb['input_ids']), yb.shape

(2, torch.Size([4, 512]), torch.Size([4, 512]), 4, torch.Size([4]))

Inputs in das HuggingFace Modell füttern

Nun können wir unsere Eingangsdaten in das Modell schleusen und die Ooutput-Vektoren berechnen. (Diese liegen wiederum als Logits vor und müssen daher vor der Anzeige entsprechend konvertiert werden.

hf_model.cuda()
outputs = hf_model(**xb)

print(outputs.logits.shape)
print(outputs.logits)

torch.Size([4, 2])
tensor([[-1.0525,  1.2515],
        [ 0.9266, -0.6908],
        [ 2.7509, -2.2911],
        [ 4.4430, -3.5959]], device='cuda:0', grad_fn=<AddmmBackward>)

Output Verarbeitung (Postprocessing)

Genauso wie zuvor wenden wir wieder die softmax Funktion an, um Wahrscheinlichkeitswerte errechnen zu können. Diese können wir wieder mit den Labels des Modells verknüpfen, um aussagekräftige Ergebnisse zu erhalten.

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[9.0788e-02, 9.0921e-01],
        [8.3443e-01, 1.6557e-01],
        [9.9358e-01, 6.4191e-03],
        [9.9968e-01, 3.2256e-04]], device='cuda:0', grad_fn=<SoftmaxBackward>)

model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

FastAI Learning mit BLURR

Nachdem wir nun also unser Modell erstellt haben, können wir mithilfe von HF_BaseModelWrapper dieses HuggingFace Modell auch für unseren fastai Learner verwenden.

Mithilfe von freeze frieren wir alle bis auf den letzten Layer unseres Modells für das Training ein.

model = HF_BaseModelWrapper(hf_model)

learn = Learner(dls,
                model,
                opt_func=partial(OptimWrapper, opt=torch.optim.Adam),
                loss_func=CrossEntropyLossFlat(),
                metrics=[accuracy],
                cbs=[HF_BaseModelCallback],
                splitter=hf_splitter)

learn.freeze()

learn.show_results(learner=learn, max_n=2, trunc_at=500)

	text	target	prediction
0	the trouble with the book, " memoirs of a geisha " is that it had japanese surfaces but underneath the surfaces it was all an american man's way of thinking. reading the book is like watching a magnificent ballet with great music, sets, and costumes yet performed by barnyard animals dressed in those costumesso far from japanese ways of thinking were the characters. < br / > < br / > the movie isn't about japan or real geisha. it is a story about a few american men's mistaken ideas about japan an	negative	negative
1	< br / > < br / > i'm sure things didn't exactly go the same way in the real life of homer hickam as they did in the film adaptation of his book, rocket boys, but the movie " october sky " ( an anagram of the book's title ) is good enough to stand alone. i have not read hickam's memoirs, but i am still able to enjoy and understand their film adaptation. the film, directed by joe johnston and written by lewis colick, records the story of teenager homer hickam ( jake gyllenhaal ), beginning in oct	positive	positive

Nun können wir mit blurr_predict unser Modell zur Sentiment-Prognose beliebiger Text-Sequenzen verwenden.

learn.blurr_predict([
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much!"
])

[(('positive',), (#1) [tensor(1)], (#1) [tensor([0.0484, 0.9516])]),
 (('negative',), (#1) [tensor(0)], (#1) [tensor([9.9946e-01, 5.4418e-04])])]

Modelle verwenden

Wenn wir unser trainiertes Modell später weiterverwenden wollen, müssen wir es abspeichern. Dafür müssen wir das Modell und den generierten Tokenizer speichern, das wir für weiter Inputs, den gleichen Tokenizer wie beim Training das Modells verwenden müssen.

Wir erhalten insgesamt 5 Dateien, die die Informationen zu Modell, Tokenizer und Vokabular enthalten:

!mkdir -p 'my_model'

learn.model.hf_model.save_pretrained('my_model')
hf_tokenizer.save_pretrained('my_model')

('my_model/tokenizer_config.json',
 'my_model/special_tokens_map.json',
 'my_model/vocab.txt',
 'my_model/added_tokens.json',
 'my_model/tokenizer.json')

hf_model is learn.model.hf_model

True

!ls -lsha 'my_model'

total 257M
4.0K drwxr-xr-x 2 root root 4.0K Sep 21 12:06 .
4.0K drwxr-xr-x 1 root root 4.0K Sep 21 12:06 ..
4.0K -rw-r--r-- 1 root root  763 Sep 21 12:06 config.json
256M -rw-r--r-- 1 root root 256M Sep 21 12:06 pytorch_model.bin
4.0K -rw-r--r-- 1 root root  112 Sep 21 12:06 special_tokens_map.json
4.0K -rw-r--r-- 1 root root  405 Sep 21 12:06 tokenizer_config.json
456K -rw-r--r-- 1 root root 456K Sep 21 12:06 tokenizer.json
228K -rw-r--r-- 1 root root 227K Sep 21 12:06 vocab.txt

Nach dem Abspeichern können wir unser Modell wieder mithilfe von get_hf_objects laden:

hf_arch2, hf_config2, hf_tokenizer2, hf_model2 = BLURR.get_hf_objects('my_model', model_cls=AutoModelForSequenceClassification)

print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))

distilbert
<class 'transformers.models.distilbert.configuration_distilbert.DistilBertConfig'>
<class 'transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast'>
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification'>