Skip to content

Flair NLP and flaiR for Social Science

Flair NLP is an open-source Natural Language Processing (NLP) library developed by Zalando Research. Known for its state-of-the-art solutions, it excels in contextual string embeddings, Named Entity Recognition (NER), and Part-of-Speech tagging (POS). Flair offers robust text analysis tools through multiple embedding approaches, including Flair contextual string embeddings, transformer-based embeddings from Hugging Face, and traditional models like GloVe and fasttext. Additionally, it provides pre-trained models for various languages and seamless integration with fine-tuned transformers hosted on Hugging Face.

flaiR bridges these powerful NLP features from Python to the R environment, making advanced text analysis accessible for social science researcher by combining Flair’s ease of use with R’s familiar interface for integration with popular R packages such as quanteda and more.

Sentence and Token

Sentence and Token are fundamental classes.

Sentence

A Sentence in Flair is an object that contains a sequence of Token objects, and it can be annotated with labels, such as named entities, part-of-speech tags, and more. It also can store embeddings for the sentence as a whole and different kinds of linguistic annotations.

Here’s a simple example of how you create a Sentence:

# Creating a Sentence object
library(flaiR)
string <- "What I see in UCD today, what I have seen of UCD in its impact on my own life and the life of Ireland."
Sentence <- flair_data()$Sentence
sentence <- Sentence(string)

Sentence[26] means that there are a total of 26 tokens in the sentence.

print(sentence)
#> Sentence[26]: "What I see in UCD today, what I have seen of UCD in its impact on my own life and the life of Ireland."

Token

When you use Flair to handle text data,1 Sentence and Token objects often play central roles in many use cases. When you create a Sentence object, it automatically tokenizes the text, removing the need to create the Token object manually.

Unlike R, which indexes from 1, Python indexes from 0. Therefore, when using a for loop, I use seq_along(sentence) - 1. The output should be something like:

# The Sentence object has automatically created and contains multiple Token objects
# We can iterate through the Sentence object to view each Token

for (i in seq_along(sentence)-1) {
  print(sentence[[i]])
}
#> Token[0]: "What"
#> Token[1]: "I"
#> Token[2]: "see"
#> Token[3]: "in"
#> Token[4]: "UCD"
#> Token[5]: "today"
#> Token[6]: ","
#> Token[7]: "what"
#> Token[8]: "I"
#> Token[9]: "have"
#> Token[10]: "seen"
#> Token[11]: "of"
#> Token[12]: "UCD"
#> Token[13]: "in"
#> Token[14]: "its"
#> Token[15]: "impact"
#> Token[16]: "on"
#> Token[17]: "my"
#> Token[18]: "own"
#> Token[19]: "life"
#> Token[20]: "and"
#> Token[21]: "the"
#> Token[22]: "life"
#> Token[23]: "of"
#> Token[24]: "Ireland"
#> Token[25]: "."

Or you can directly use $tokens method to print all tokens.

print(sentence$tokens)
#> [[1]]
#> Token[0]: "What"
#> 
#> [[2]]
#> Token[1]: "I"
#> 
#> [[3]]
#> Token[2]: "see"
#> 
#> [[4]]
#> Token[3]: "in"
#> 
#> [[5]]
#> Token[4]: "UCD"
#> 
#> [[6]]
#> Token[5]: "today"
#> 
#> [[7]]
#> Token[6]: ","
#> 
#> [[8]]
#> Token[7]: "what"
#> 
#> [[9]]
#> Token[8]: "I"
#> 
#> [[10]]
#> Token[9]: "have"
#> 
#> [[11]]
#> Token[10]: "seen"
#> 
#> [[12]]
#> Token[11]: "of"
#> 
#> [[13]]
#> Token[12]: "UCD"
#> 
#> [[14]]
#> Token[13]: "in"
#> 
#> [[15]]
#> Token[14]: "its"
#> 
#> [[16]]
#> Token[15]: "impact"
#> 
#> [[17]]
#> Token[16]: "on"
#> 
#> [[18]]
#> Token[17]: "my"
#> 
#> [[19]]
#> Token[18]: "own"
#> 
#> [[20]]
#> Token[19]: "life"
#> 
#> [[21]]
#> Token[20]: "and"
#> 
#> [[22]]
#> Token[21]: "the"
#> 
#> [[23]]
#> Token[22]: "life"
#> 
#> [[24]]
#> Token[23]: "of"
#> 
#> [[25]]
#> Token[24]: "Ireland"
#> 
#> [[26]]
#> Token[25]: "."

Retrieve the Token

To comprehend the string representation format of the Sentence object, tagging at least one token is adequate. Python’s get_token(n) method allows us to retrieve the Token object for a particular token. Additionally, we can use [] to index a specific token.

# method in Python
sentence$get_token(5)
#> Token[4]: "UCD"
# indexing in R 
sentence[6]
#> Token[6]: ","

Each word (and punctuation) in the text is treated as an individual Token object. These Token objects store text information and other possible linguistic information (such as part-of-speech tags or named entity tags) and embedding (if you used a model to generate them).

While you do not need to create Token objects manually, understanding how to manage them is useful in situations where you might want to fine-tune the tokenization process. For example, you can control the exactness of tokenization by manually creating Token objects from a Sentence object.

This makes Flair very flexible when handling text data since the automatic tokenization feature can be used for rapid development, while also allowing users to fine-tune their tokenization.

Annotate POS tag and NER tag

The add_label(label_type, value) method can be employed to assign a label to the token. In Universal POS tags, if sentence[10] is ‘see’, ‘seen’ might be tagged as VERB, indicating it is a past participle form of a verb.

sentence[10]$add_label('manual-pos', 'VERB')
print(sentence[10])
#> Token[10]: "seen" → VERB (1.0000)

We can also add a NER (Named Entity Recognition) tag to sentence[4], “UCD”, identifying it as a university in Dublin.

sentence[4]$add_label('ner', 'ORG')
print(sentence[4])
#> Token[4]: "UCD" → ORG (1.0000)

If we print the sentence object, Sentence[50] provides information for 50 tokens → [‘in’/ORG, ‘seen’/VERB], thus displaying two tagging pieces of information.

print(sentence)
#> Sentence[26]: "What I see in UCD today, what I have seen of UCD in its impact on my own life and the life of Ireland." → ["UCD"/ORG, "seen"/VERB]

Corpus

The Corpus object in Flair is a fundamental data structure that represents a dataset containing text samples, usually comprising of a training set, a development set (or validation set), and a test set. It’s designed to work smoothly with Flair’s models for tasks like named entity recognition, text classification, and more.

Attributes:

  • train: A list of sentences (ListSentence) that form the training dataset.
  • dev (or development): A list of sentences (ListSentence) that form the development (or validation) dataset.
  • test: A list of sentences (ListSentence) that form the test dataset.

Important Methods:

  • downsample: This method allows you to downsample (reduce) the number of sentences in the train, dev, and test splits.
  • obtain_statistics: This method gives a quick overview of the statistics of the corpus, including the number of sentences and the distribution of labels.
  • make_vocab_dictionary: Used to create a vocabulary dictionary from the corpus.
library(flaiR)
Corpus <- flair_data()$Corpus
Sentence <- flair_data()$Sentence
# Create some example sentences
train <- list(Sentence('This is a training example.'))
dev <-  list(Sentence('This is a validation example.'))
test <- list(Sentence('This is a test example.'))

# Create a corpus using the custom data splits
corp <-  Corpus(train = train, dev = dev, test = test)

$obtain_statistics() method of the Corpus object in the Flair library provides an overview of the dataset statistics. The method returns a Python dictionary with details about the training, validation (development), and test datasets that make up the corpus. In R, you can use the jsonlite package to format JSON.

library(jsonlite)
data <- fromJSON(corp$obtain_statistics())
formatted_str <- toJSON(data, pretty=TRUE)
print(formatted_str)
#> {
#>   "TRAIN": {
#>     "dataset": ["TRAIN"],
#>     "total_number_of_documents": [1],
#>     "number_of_documents_per_class": {},
#>     "number_of_tokens_per_tag": {},
#>     "number_of_tokens": {
#>       "total": [6],
#>       "min": [6],
#>       "max": [6],
#>       "avg": [6]
#>     }
#>   },
#>   "TEST": {
#>     "dataset": ["TEST"],
#>     "total_number_of_documents": [1],
#>     "number_of_documents_per_class": {},
#>     "number_of_tokens_per_tag": {},
#>     "number_of_tokens": {
#>       "total": [6],
#>       "min": [6],
#>       "max": [6],
#>       "avg": [6]
#>     }
#>   },
#>   "DEV": {
#>     "dataset": ["DEV"],
#>     "total_number_of_documents": [1],
#>     "number_of_documents_per_class": {},
#>     "number_of_tokens_per_tag": {},
#>     "number_of_tokens": {
#>       "total": [6],
#>       "min": [6],
#>       "max": [6],
#>       "avg": [6]
#>     }
#>   }
#> }

In R

Below, we use data from the article The Temporal Focus of Campaign Communication by Stefan Muller, published in the Journal of Politics in 2020, as an example.

First, we vectorize the cc_muller$text using the Sentence function to transform it into a list object. Then, we reformat cc_muller$class_pro_retro as a factor. It’s essential to note that R handles numerical values differently than Python. In R, numerical values are represented with a floating point, so it’s advisable to convert them into factors or strings. Lastly, we employ the map function from the purrr package to assign labels to each sentence corpus using the $add_label method.

library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:jsonlite':
#> 
#>     flatten
data(cc_muller)
# The `Sentence` object tokenizes text 
text <- lapply( cc_muller$text, Sentence)
# split sentence object to train and test. 
labels <- as.factor(cc_muller$class_pro_retro)
# `$add_label` method assigns the corresponding coded type to each Sentence corpus.
text <- map2(text, labels, ~ .x$add_label("classification", .y), .progress = TRUE)

To perform a train-test split using base R, we can follow these steps:

set.seed(2046)
sample <- sample(c(TRUE, FALSE), length(text), replace=TRUE, prob=c(0.8, 0.2))
train  <- text[sample]
test   <- text[!sample]
sprintf("Corpus object sizes - Train: %d |  Test: %d", length(train), length(test))
#> [1] "Corpus object sizes - Train: 4710 |  Test: 1148"

If you don’t provide a dev set, flaiR will not force you to carve out a portion of your test set to serve as a dev set. However, in some cases when only the train and test sets are provided without a dev set, flaiR might automatically take a fraction of the train set (e.g., 10%) to use as a dev set (#2259). This is to offer a mechanism for model selection and to prevent the model from overfitting on the train set.

In the “Corpus” function, there is a random selection of the "dev" dataset. To ensure reproducibility, we need to set the seed in the flaiR framework. We can accomplish this by calling the top-level module “flair” from flaiR and using $set_seed(1964L) to set the seed.

flair <- import_flair()
flair$set_seed(1964L)
corp <- Corpus(train=train, 
                 # dev=test,
                 test=test)
#> 2025-01-19 17:55:55,972 No dev split found. Using 10% (i.e. 471 samples) of the train split as dev data
sprintf("Corpus object sizes - Train: %d | Test: %d | Dev: %d", 
        length(corp$train), 
        length(corp$test),
        length(corp$dev))
#> [1] "Corpus object sizes - Train: 4239 | Test: 1148 | Dev: 471"

In the later sections, there will be more similar processing using the Corpus. Following that, we will focus on advanced NLP applications.

 


Sequence Taggings

Tag Entities in Text

Let’s run named entity recognition over the following example sentence: “I love Berlin and New York”. To do this, all you need to do is make a Sentence object for this text, load a pre-trained model and use it to predict tags for the object.

NER Models

ID Task Language Training Dataset Accuracy Contributor / Notes
‘ner’ NER (4-class) English Conll-03 93.03 (F1)
‘ner-fast’ NER (4-class) English Conll-03 92.75 (F1) (fast model)
‘ner-large’ NER (4-class) English / Multilingual Conll-03 94.09 (F1) (large model)
‘ner-pooled’ NER (4-class) English Conll-03 93.24 (F1) (memory inefficient)
‘ner-ontonotes’ NER (18-class) English Ontonotes 89.06 (F1)
‘ner-ontonotes-fast’ NER (18-class) English Ontonotes 89.27 (F1) (fast model)
‘ner-ontonotes-large’ NER (18-class) English / Multilingual Ontonotes 90.93 (F1) (large model)
‘ar-ner’ NER (4-class) Arabic AQMAR & ANERcorp (curated) 86.66 (F1)
‘da-ner’ NER (4-class) Danish Danish NER dataset AmaliePauli
‘de-ner’ NER (4-class) German Conll-03 87.94 (F1)
‘de-ner-large’ NER (4-class) German / Multilingual Conll-03 92.31 (F1)
‘de-ner-germeval’ NER (4-class) German Germeval 84.90 (F1)
‘de-ner-legal’ NER (legal text) German LER dataset 96.35 (F1)
‘fr-ner’ NER (4-class) French WikiNER (aij-wikiner-fr-wp3) 95.57 (F1) mhham
‘es-ner-large’ NER (4-class) Spanish CoNLL-03 90.54 (F1) mhham
‘nl-ner’ NER (4-class) Dutch CoNLL 2002 92.58 (F1)
‘nl-ner-large’ NER (4-class) Dutch Conll-03 95.25 (F1)
‘nl-ner-rnn’ NER (4-class) Dutch CoNLL 2002 90.79 (F1)
‘ner-ukrainian’ NER (4-class) Ukrainian NER-UK dataset 86.05 (F1) dchaplinsky

Source: https://flairnlp.github.io/docs/tutorial-basics/tagging-entities

POS Models

ID Task Language Training Dataset Accuracy Contributor / Notes
‘pos’ POS-tagging English Ontonotes 98.19 (Accuracy)
‘pos-fast’ POS-tagging English Ontonotes 98.1 (Accuracy) (fast model)
‘upos’ POS-tagging (universal) English Ontonotes 98.6 (Accuracy)
‘upos-fast’ POS-tagging (universal) English Ontonotes 98.47 (Accuracy) (fast model)
‘pos-multi’ POS-tagging Multilingual UD Treebanks 96.41 (average acc.) (12 languages)
‘pos-multi-fast’ POS-tagging Multilingual UD Treebanks 92.88 (average acc.) (12 languages)
‘ar-pos’ POS-tagging Arabic (+dialects) combination of corpora
‘de-pos’ POS-tagging German UD German - HDT 98.50 (Accuracy)
‘de-pos-tweets’ POS-tagging German German Tweets 93.06 (Accuracy) stefan-it
‘da-pos’ POS-tagging Danish Danish Dependency Treebank AmaliePauli
‘ml-pos’ POS-tagging Malayalam 30000 Malayalam sentences 83 sabiqueqb
‘ml-upos’ POS-tagging Malayalam 30000 Malayalam sentences 87 sabiqueqb
‘pt-pos-clinical’ POS-tagging Portuguese PUCPR 92.39 LucasFerroHAILab for clinical texts
‘pos-ukrainian’ POS-tagging Ukrainian Ukrainian UD 97.93 (F1) dchaplinsky

Source: https://flairnlp.github.io/docs/tutorial-basics/part-of-speech-tagging

# attach flaiR in R
library(flaiR)

# make a sentence

Sentence <- flair_data()$Sentence
sentence <- Sentence('I love Berlin and New York.')

# load the NER tagger
SequenceTagger <- flair_models()$SequenceTagger
tagger <- SequenceTagger$load('flair/ner-english')  
#> 2025-01-19 17:55:56,601 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>

# run NER over sentence
tagger$predict(sentence)

To print all annotations:

# print the sentence with all annotations
print(sentence)
#> Sentence[7]: "I love Berlin and New York." → ["Berlin"/LOC, "New York"/LOC]

Use a for loop to print out each POS tag. It’s important to note that Python is indexed from 0. Therefore, in an R environment, we must use seq_along(sentence$get_labels()) - 1.

for (i in seq_along(sentence$get_labels())) {
      print(sentence$get_labels()[[i]])
  }
#> 'Span[2:3]: "Berlin"'/'LOC' (0.9812)
#> 'Span[4:6]: "New York"'/'LOC' (0.9957)

Tag Part-of-Speech

We use flaiR/POS-english for POS tagging in the standard models on Hugging Face.

# attach flaiR in R
library(flaiR)

# make a sentence
Sentence <- flair_data()$Sentence
sentence <- Sentence('I love Berlin and New York.')

# load the NER tagger
Classifier <- flair_nn()$Classifier
tagger <- Classifier$load('pos')
#> 2025-01-19 17:55:57,606 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD

Penn Treebank POS Tags Reference

Tag Description Example
DT Determiner the, a, these
NN Noun, singular cat, tree
NNS Noun, plural cats, trees
NNP Proper noun, singular John, London
NNPS Proper noun, plural Americans
VB Verb, base form take
VBD Verb, past tense took
VBG Verb, gerund/present participle taking
VBN Verb, past participle taken
VBP Verb, non-3rd person singular present take
VBZ Verb, 3rd person singular present takes
JJ Adjective big
RB Adverb quickly
O Other -
, Comma ,
. Period .
: Colon :
-LRB- Left bracket (
-RRB- Right bracket )
`` Opening quotation
’’ Closing quotation
HYPH Hyphen -
CD Cardinal number 1, 2, 3
IN Preposition in, on, at
PRP Personal pronoun I, you, he
PRP$ Possessive pronoun my, your
UH Interjection oh, wow
FW Foreign word café
SYM Symbol +, %
# run NER over sentence
tagger$predict(sentence)
# print the sentence with all annotations
print(sentence)
#> Sentence[7]: "I love Berlin and New York." → ["I"/PRP, "love"/VBP, "Berlin"/NNP, "and"/CC, "New"/NNP, "York"/NNP, "."/.]

Use a for loop to print out each POS tag.

for (i in seq_along(sentence$get_labels())) {
      print(sentence$get_labels()[[i]])
  }
#> 'Token[0]: "I"'/'PRP' (1.0)
#> 'Token[1]: "love"'/'VBP' (1.0)
#> 'Token[2]: "Berlin"'/'NNP' (0.9999)
#> 'Token[3]: "and"'/'CC' (1.0)
#> 'Token[4]: "New"'/'NNP' (1.0)
#> 'Token[5]: "York"'/'NNP' (1.0)
#> 'Token[6]: "."'/'.' (1.0)

Detect Sentiment

Let’s run sentiment analysis over the same sentence to determine whether it is POSITIVE or NEGATIVE.

You can do this with essentially the same code as above. Instead of loading the ‘ner’ model, you now load the 'sentiment' model:

# attach flaiR in R
library(flaiR)

# make a sentence
Sentence <- flair_data()$Sentence
sentence <- Sentence('I love Berlin and New York.')

# load the Classifier tagger from flair.nn module
Classifier <- flair_nn()$Classifier
tagger <- Classifier$load('sentiment')

# run sentiment analysis over sentence
tagger$predict(sentence)
# print the sentence with all annotations
print(sentence)
#> Sentence[7]: "I love Berlin and New York." → POSITIVE (0.9982)

Dealing with Dataframe

Parts-of-Speech Tagging Across Full DataFrame

You can apply Part-of-Speech (POS) tagging across an entire DataFrame using Flair’s pre-trained models. Let’s walk through an example using the pos-fast model. You can apply Part-of-Speech (POS) tagging across an entire DataFrame using Flair’s pre-trained models. Let’s walk through an example using the pos-fast model. First, let’s load our required packages and sample data:

library(flaiR)
data(uk_immigration)
uk_immigration <- uk_immigration[1:2,]

For POS tagging, we’ll use Flair’s pre-trained model. The pos-fast model offers a good balance between speed and accuracy. For more pre-trained models, check out Flair’s documentation at Flair POS Tagging Documentation. There are two ways to load the POS tagger:

  • Load with tag dictionary display (default):
tagger_pos <- load_tagger_pos("pos-fast")
#> Loading POS tagger model: pos-fast
#> 2025-01-19 17:56:01,256 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD
#> 
#> POS Tagger Dictionary:
#> ========================================
#> Total tags: 53
#> ----------------------------------------
#> Special:       <unk>, O, <START>, <STOP> 
#> Nouns:         PRP, PRP$, NN, NNS, NNP, WP, EX, NNPS, WP$ 
#> Verbs:         VBD, VB, VBP, VBG, VBZ, MD, VBN 
#> Adjectives:    JJ, JJR, POS, JJS 
#> Adverbs:       RB, WRB, RBR, RBS 
#> Determiners:   DT, WDT, PDT 
#> Prepositions:  IN, TO 
#> Conjunctions:  CC 
#> Numbers:       CD 
#> Punctuation:   <unk>, ,, ., :, HYPH, -LRB-, -RRB-, ``, '', $, NFP, <START>, <STOP> 
#> Others:        UH, FW, XX, LS, $, SYM, ADD 
#> ----------------------------------------
#> Common POS Tag Meanings:
#> NN*: Nouns (NNP: Proper, NNS: Plural)
#> VB*: Verbs (VBD: Past, VBG: Gerund)
#> JJ*: Adjectives (JJR: Comparative)
#> RB*: Adverbs
#> PRP: Pronouns, DT: Determiners
#> IN: Prepositions, CC: Conjunctions
#> ========================================

This will show you all available POS tags grouped by categories (nouns, verbs, adjectives, etc.).

  • Load without tag display for a cleaner output:
pos_tagger <- load_tagger_pos("pos-fast", show_tags = FALSE)
#> Loading POS tagger model: pos-fast
#> 2025-01-19 17:56:01,616 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD

Now we can process our texts:

results <- get_pos(texts = uk_immigration$text,
                   doc_ids = uk_immigration$speaker,
                   show.text_id = TRUE,
                   tagger = pos_tagger)

head(results, n = 10)
#>               doc_id token_id
#>               <char>    <num>
#>  1: Philip Hollobone        0
#>  2: Philip Hollobone        1
#>  3: Philip Hollobone        2
#>  4: Philip Hollobone        3
#>  5: Philip Hollobone        4
#>  6: Philip Hollobone        5
#>  7: Philip Hollobone        6
#>  8: Philip Hollobone        7
#>  9: Philip Hollobone        8
#> 10: Philip Hollobone        9
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   text_id
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <char>
#>  1: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  2: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  3: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  4: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  5: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  6: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  7: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  8: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>  9: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#> 10: I thank Mr. Speaker for giving me permission to hold this debate today. I welcome the Minister-I very much appreciate the contact from his office prior to today-and the Conservative and Liberal Democrat Front Benchers to the debate. I also welcome my hon. Friends on the Back Benches. Immigration is the most important issue for my constituents. I get more complaints, comments and suggestions about immigration than about anything else. In the Kettering constituency, the number of immigrants is actually very low. There is a well-settled Sikh community in the middle of Kettering town itself, which has been in Kettering for some 40 or 50 years and is very much part of the local community and of the fabric of local life. There are other very small migrant groups in my constituency, but it is predominantly made up of indigenous British people. However, there is huge concern among my constituents about the level of immigration into our country. I believe that I am right in saying that, in recent years, net immigration into the United Kingdom is the largest wave of immigration that our country has ever known and, proportionately, is probably the biggest wave of immigration since the Norman conquest. My contention is that our country simply cannot cope with immigration on that scale-to coin a phrase, we simply cannot go on like this. It is about time that mainstream politicians started airing the views of their constituents, because for too long people have muttered under their breath that they are concerned about immigration. They have been frightened to speak out about it because they are frightened of being accused of being racist. My contention is that immigration is not a racist issue; it is a question of numbers. I personally could not care tuppence about the ethnicity of the immigrants concerned, the colour of their skin or the language that they speak. What I am concerned about is the very large numbers of new arrivals to our country. My contention is that the United Kingdom simply cannot cope with them.
#>          token    tag  score
#>         <char> <char>  <num>
#>  1:          I    PRP 1.0000
#>  2:      thank    VBP 0.9992
#>  3:        Mr.    NNP 1.0000
#>  4:    Speaker    NNP 1.0000
#>  5:        for     IN 1.0000
#>  6:     giving    VBG 1.0000
#>  7:         me    PRP 1.0000
#>  8: permission     NN 0.9999
#>  9:         to     TO 0.9999
#> 10:       hold     VB 1.0000

Tagging Entities Across Full DataFrame

This section focuses on performing Named Entity Recognition (NER) on data stored in a dataframe format. My goal is to identify and tag named entities within text that is organized in a structured dataframe.

I load the flaiR package and use the built-in uk_immigration dataset. For demonstration purposes, I’m only taking the first two rows. This dataset contains discussions about immigration in the UK.

Load the pre-trained model ner. For more pre-trained models, see https://flairnlp.github.io/docs/tutorial-basics/tagging-entities.

library(flaiR)
data(uk_immigration)
uk_immigration <- head(uk_immigration, n = 2)

Next, I load the latest model hosted and maintained on Hugging Face by the Flair NLP team. For more Flair NER models, you can visit the official Flair NLP page on Hugging Face (https://huggingface.co/flair).

# Load model without displaying tags
# tagger <- load_tagger_ner("flair/ner-english-large", show_tags = FALSE)

library(flaiR)
tagger_ner <- load_tagger_ner("flair/ner-english-ontonotes")
#> 2025-01-19 17:56:07,751 SequenceTagger predicts: Dictionary with 75 tags: O, S-PERSON, B-PERSON, E-PERSON, I-PERSON, S-GPE, B-GPE, E-GPE, I-GPE, S-ORG, B-ORG, E-ORG, I-ORG, S-DATE, B-DATE, E-DATE, I-DATE, S-CARDINAL, B-CARDINAL, E-CARDINAL, I-CARDINAL, S-NORP, B-NORP, E-NORP, I-NORP, S-MONEY, B-MONEY, E-MONEY, I-MONEY, S-PERCENT, B-PERCENT, E-PERCENT, I-PERCENT, S-ORDINAL, B-ORDINAL, E-ORDINAL, I-ORDINAL, S-LOC, B-LOC, E-LOC, I-LOC, S-TIME, B-TIME, E-TIME, I-TIME, S-WORK_OF_ART, B-WORK_OF_ART, E-WORK_OF_ART, I-WORK_OF_ART, S-FAC
#> 
#> NER Tagger Dictionary:
#> ========================================
#> Total tags: 75
#> Model: flair/ner-english-ontonotes
#> ----------------------------------------
#> Special        : O, <START>, <STOP>
#> Person         : S-PERSON, B-PERSON, E-PERSON, I-PERSON
#> Organization   : S-ORG, B-ORG, E-ORG, I-ORG
#> Location       : S-GPE, B-GPE, E-GPE, I-GPE, S-LOC, B-LOC, E-LOC, I-LOC
#> Time           : S-DATE, B-DATE, E-DATE, I-DATE, S-TIME, B-TIME, E-TIME, I-TIME
#> Numbers        : S-CARDINAL, B-CARDINAL, E-CARDINAL, I-CARDINAL, S-MONEY, B-MONEY, E-MONEY, I-MONEY, S-PERCENT, B-PERCENT, E-PERCENT, I-PERCENT, S-ORDINAL, B-ORDINAL, E-ORDINAL, I-ORDINAL
#> Groups         : S-NORP, B-NORP, E-NORP, I-NORP
#> Facilities     : S-FAC, B-FAC, E-FAC, I-FAC
#> Products       : S-PRODUCT, B-PRODUCT, E-PRODUCT, I-PRODUCT
#> Events         : S-EVENT, B-EVENT, E-EVENT, I-EVENT
#> Art            : S-WORK_OF_ART, B-WORK_OF_ART, E-WORK_OF_ART, I-WORK_OF_ART
#> Languages      : S-LANGUAGE, B-LANGUAGE, E-LANGUAGE, I-LANGUAGE
#> Laws           : S-LAW, B-LAW, E-LAW, I-LAW
#> ----------------------------------------
#> Tag scheme: BIOES
#> B-: Beginning of multi-token entity
#> I-: Inside of multi-token entity
#> O: Outside (not part of any entity)
#> E-: End of multi-token entity
#> S-: Single token entity
#> ========================================

I load a pre-trained NER model. Since I’m using a Mac M1/M2, I set the model to run on the MPS device for faster processing. If I want to use other pre-trained models, I can check the Flair documentation website for available options.

Now I’m ready to process the text:

results <- get_entities(texts = uk_immigration$text,
                        doc_ids = uk_immigration$speaker,
                        tagger = tagger_ner,
                        batch_size = 2,
                        verbose = FALSE)
#> CPU is used.
head(results, n = 10)
#>               doc_id                          entity    tag     score
#>               <char>                          <char> <char>     <num>
#>  1: Philip Hollobone                           today   DATE 0.9843613
#>  2: Philip Hollobone                    Conservative   NORP 0.9976857
#>  3: Philip Hollobone Liberal Democrat Front Benchers    ORG 0.7668477
#>  4: Philip Hollobone                       Kettering    GPE 0.9885774
#>  5: Philip Hollobone                            Sikh   NORP 0.9939976
#>  6: Philip Hollobone                       Kettering    GPE 0.9955219
#>  7: Philip Hollobone                       Kettering    GPE 0.9948049
#>  8: Philip Hollobone             some 40 or 50 years   DATE 0.8059650
#>  9: Philip Hollobone                         British   NORP 0.9986913
#> 10: Philip Hollobone                    recent years   DATE 0.8596769

 


Highlight Entities with Colors

This tutorial demonstrates how to use the flaiR package to identify and highlight named entities (such as names, locations, organizations) in text.

Step 1 Create Text with Named Entities

First, we load the flaiR package and work with a sample text:

library(flaiR)
data("uk_immigration")
uk_immigration <- uk_immigration[30,]
tagger_ner <- load_tagger_ner("flair/ner-english-fast")
#> 2025-01-19 17:56:16,562 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
#> 
#> NER Tagger Dictionary:
#> ========================================
#> Total tags: 20
#> Model: flair/ner-english-fast
#> ----------------------------------------
#> Special        : <unk>, O, <START>, <STOP>
#> Organization   : S-ORG, B-ORG, E-ORG, I-ORG
#> Location       : S-LOC, B-LOC, E-LOC, I-LOC
#> Misc           : S-MISC, B-MISC, I-MISC, E-MISC
#> ----------------------------------------
#> Tag scheme: BIOES
#> B-: Beginning of multi-token entity
#> I-: Inside of multi-token entity
#> O: Outside (not part of any entity)
#> E-: End of multi-token entity
#> S-: Single token entity
#> ========================================

result <- get_entities(uk_immigration$text,
                       tagger = tagger_ner,
                       show.text_id = FALSE
                       )
#> CPU is used.
#> Warning in check_texts_and_ids(texts, doc_ids): doc_ids is NULL.
#> Auto-assigning doc_ids.

Step 2 Highlight the Named Entities

Use the highlight_text function to color-code the identified entities:

highlighted_text <- highlight_text(text = uk_immigration$text, 
                                   entities_mapping = map_entities(result))
highlighted_text
I am grateful to the Minister for that most helpful intervention. Perhaps I should bring us back on track. Although asylum is an important issue for all hon. Members and our constituents, the number of asylum claims is small compared with immigration as a whole. I believe that asylum claims are now running at a rate of about 30,000 a year, which is only 10 per cent. of net foreign migration. The big problem in this country is legal immigration, which brings us back to the population projections of 70 million. I understand that a migrant now arrives on our shores every minute. We must build a new home every six minutes for new migrants. Immigration will add 7 million to the population of England (LOC) in the next 15 to 20 years, which is seven times the population of Birmingham (LOC). Immigration directly added a million people to the UK (LOC)'s population in the years 2003 to 2007. There was a net inflow of 2.3 million people to the UK (LOC) between 1991 and 2006, and 8 per cent. came from the new east European (MISC) members of the EU (ORG). I differ from my party in not agreeing with the free movement of people across EU (ORG) borders. Effectively, we have, by agreement of our Government, uncontrolled immigration within the EU (ORG). The Government told us that there would be 13,000 new arrivals from the new entrant eastern European (MISC) EU (ORG) countries, but the figure was approaching 1 million at its peak-perhaps the Minister will confirm what the figure was-which was a world apart from the 13,000 that we were told about, and it has placed huge strain on local infrastructure.

Explanation:

  • load_tagger_ner("ner") loads the pre-trained NER model
  • get_entities() identifies named entities in the text
  • map_entities() maps the identified entities to colors
  • highlight_text() marks the original text using these colors

Each type of entity (such as person names, locations, organization names) will be displayed in a different color, making the named entities in the text immediately visible.

 


The Overview of Embedding

All word embedding classes inherit from the TokenEmbeddings class and call the embed() method to embed the text. In most cases when using Flair, various and complex embedding processes are hidden behind the interface. Users simply need to instantiate the necessary embedding class and call embed() to embed text.

Here are the types of embeddings currently supported in FlairNLP:

Class Type Paper
BytePairEmbeddings Subword-level word embeddings Heinzerling and Strube (2018)
CharacterEmbeddings Task-trained character-level embeddings of words Lample et al. (2016)
ELMoEmbeddings Contextualized word-level embeddings Peters et al. (2018)
FastTextEmbeddings Word embeddings with subword features Bojanowski et al. (2017)
FlairEmbeddings Contextualized character-level embeddings Akbik et al. (2018)
OneHotEmbeddings Standard one-hot embeddings of text or tags -
PooledFlairEmbeddings Pooled variant of FlairEmbeddings Akbik et al. (2019)
TransformerWordEmbeddings) Embeddings from pretrained transformers (BERT, XLM, GPT, RoBERTa, XLNet, DistilBERT etc.) Devlin et al. (2018) Radford et al. (2018) Liu et al. (2019) Dai et al. (2019) Yang et al. (2019) Lample and Conneau (2019)
WordEmbeddings Classic word embeddings

 


Byte Pair Embeddings

Please note that ihis document for R is a conversion of the Flair NLP document implemented in Python.

BytePairEmbeddings are word embeddings that operate at the subword level. They can embed any word by breaking it down into subwords and looking up their corresponding embeddings. This technique was introduced by Heinzerling and Strube (2018) , who demonstrated that BytePairEmbeddings achieve comparable accuracy to traditional word embeddings while requiring only a fraction of the model size. This makes them an excellent choice for training compact models.

To initialize BytePairEmbeddings, you need to specify:

  • A language code (275 languages supported)
  • Number of syllables
  • Number of dimensions (options: 50, 100, 200, or 300)
library(flaiR)

# Initialize embedding
BytePairEmbeddings <- flair_embeddings()$BytePairEmbeddings

# Create BytePairEmbeddings with specified parameters
embedding <- BytePairEmbeddings(
    language = "en",        # Language code (e.g., "en" for English)
    dim = 50L,              # Embedding dimensions: options are 50L, 100L, 200L, or 300L
    syllables = 100000L     # Subword vocabulary size

)

# Create a sample sentence
Sentence <- flair_data()$Sentence
sentence = Sentence('The grass is green .')

# Embed words in the sentence
embedding$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green ."

# Print embeddings 
for (i in 1:length(sentence$tokens)) {
    token <- sentence$tokens[[i]]
    cat("\nWord:", token$text, "\n")
    
    # Convert embedding to R vector and print
    # Python index starts from 0, so use i-1
    embedding_vector <- sentence[i-1]$embedding$numpy()
    cat("Embedding shape:", length(embedding_vector), "\n")
    cat("First 5 values:", head(embedding_vector, 5), "\n")
    cat("-------------------\n")
}
#> 
#> Word: The 
#> Embedding shape: 100 
#> First 5 values: -0.585645 0.55233 -0.335385 -0.117119 -0.3433 
#> -------------------
#> 
#> Word: grass 
#> Embedding shape: 100 
#> First 5 values: 0.370427 -0.717806 -0.489089 0.384228 0.68443 
#> -------------------
#> 
#> Word: is 
#> Embedding shape: 100 
#> First 5 values: -0.186592 0.52804 -1.011618 0.416936 -0.166446 
#> -------------------
#> 
#> Word: green 
#> Embedding shape: 100 
#> First 5 values: -0.075467 -0.874228 0.20425 1.061623 -0.246111 
#> -------------------
#> 
#> Word: . 
#> Embedding shape: 100 
#> First 5 values: -0.214652 0.212236 -0.607079 0.512853 -0.325556 
#> -------------------

More information can be found on the byte pair embeddings web page. BytePairEmbeddings also have a multilingual model capable of embedding any word in any language. You can instantiate it with:

embedding <- BytePairEmbeddings('multi')

You can also load custom BytePairEmbeddings by specifying a path to model_file_path and embedding_file_path arguments. They correspond respectively to a SentencePiece model file and to an embedding file (Word2Vec plain text or GenSim binary).

Flair Embeddings

The following example manual is translated into R from Flair NLP by Zalando Research. In Flair, the use of embedding is very quite straightforward. Here’s an example code snippet of how to use Flair’s contextual string embeddings:

library(flaiR)
FlairEmbeddings <- flair_embeddings()$FlairEmbeddings
# init embedding
flair_embedding_forward <- FlairEmbeddings('news-forward')

# create a sentence
Sentence <- flair_data()$Sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green ."
ID Language Embedding
‘multi-X’ 300+ JW300 corpus, as proposed by Agić and Vulić (2019). The corpus is licensed under CC-BY-NC-SA
‘multi-X-fast’ English, German, French, Italian, Dutch, Polish Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly
‘news-X’ English Trained with 1 billion word corpus
‘news-X-fast’ English Trained with 1 billion word corpus, CPU-friendly
‘mix-X’ English Trained with mixed corpus (Web, Wikipedia, Subtitles)
‘ar-X’ Arabic Added by @stefan-it: Trained with Wikipedia/OPUS
‘bg-X’ Bulgarian Added by @stefan-it: Trained with Wikipedia/OPUS
‘bg-X-fast’ Bulgarian Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or SETimes)
‘cs-X’ Czech Added by @stefan-it: Trained with Wikipedia/OPUS
‘cs-v0-X’ Czech Added by @stefan-it: LM embeddings (earlier version)
‘de-X’ German Trained with mixed corpus (Web, Wikipedia, Subtitles)
‘de-historic-ha-X’ German (historical) Added by @stefan-it: Historical German trained over Hamburger Anzeiger
‘de-historic-wz-X’ German (historical) Added by @stefan-it: Historical German trained over Wiener Zeitung
‘de-historic-rw-X’ German (historical) Added by @redewiedergabe: Historical German trained over 100 million tokens
‘es-X’ Spanish Added by @iamyihwa: Trained with Wikipedia
‘es-X-fast’ Spanish Added by @iamyihwa: Trained with Wikipedia, CPU-friendly
‘es-clinical-’ Spanish (clinical) Added by @matirojasg: Trained with Wikipedia
‘eu-X’ Basque Added by @stefan-it: Trained with Wikipedia/OPUS
‘eu-v0-X’ Basque Added by @stefan-it: LM embeddings (earlier version)
‘fa-X’ Persian Added by @stefan-it: Trained with Wikipedia/OPUS
‘fi-X’ Finnish Added by @stefan-it: Trained with Wikipedia/OPUS
‘fr-X’ French Added by @mhham: Trained with French Wikipedia
‘he-X’ Hebrew Added by @stefan-it: Trained with Wikipedia/OPUS
‘hi-X’ Hindi Added by @stefan-it: Trained with Wikipedia/OPUS
‘hr-X’ Croatian Added by @stefan-it: Trained with Wikipedia/OPUS
‘id-X’ Indonesian Added by @stefan-it: Trained with Wikipedia/OPUS
‘it-X’ Italian Added by @stefan-it: Trained with Wikipedia/OPUS
‘ja-X’ Japanese Added by @frtacoa: Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)
‘nl-X’ Dutch Added by @stefan-it: Trained with Wikipedia/OPUS
‘nl-v0-X’ Dutch Added by @stefan-it: LM embeddings (earlier version)
‘no-X’ Norwegian Added by @stefan-it: Trained with Wikipedia/OPUS
‘pl-X’ Polish Added by @borchmann: Trained with web crawls (Polish part of CommonCrawl)
‘pl-opus-X’ Polish Added by @stefan-it: Trained with Wikipedia/OPUS
‘pt-X’ Portuguese Added by @ericlief: LM embeddings
‘sl-X’ Slovenian Added by @stefan-it: Trained with Wikipedia/OPUS
‘sl-v0-X’ Slovenian Added by @stefan-it: Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018)
‘sv-X’ Swedish Added by @stefan-it: Trained with Wikipedia/OPUS
‘sv-v0-X’ Swedish Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018)
‘ta-X’ Tamil Added by @stefan-it
‘pubmed-X’ English Added by @jessepeng: Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)
‘de-impresso-hipe-v1-X’ German (historical) In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
‘en-impresso-hipe-v1-X’ English (historical) In-domain data (Chronicling America material) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
‘fr-impresso-hipe-v1-X’ French (historical) In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
‘am-X’ Amharic Based on 6.5m Amharic text corpus crawled from different sources. See this paper and the official GitHub Repository for more information.
‘uk-X’ Ukrainian Added by @dchaplinsky: Trained with UberText corpus.

Source: https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/FLAIR_EMBEDDINGS.md#flair-embeddings

So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:

flair_de_forward <- FlairEmbeddings('de-forward')

And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:

flair_bg_backward <- FlairEmbeddings('bg-backward')

 


We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard word embeddings into the mix. So, our recommended StackedEmbedding for most English tasks is:

FlairEmbeddings <- flair_embeddings()$FlairEmbeddings
WordEmbeddings <- flair_embeddings()$WordEmbeddings
StackedEmbeddings <- flair_embeddings()$StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings <- StackedEmbeddings(list(WordEmbeddings("glove"),
                                             FlairEmbeddings("news-forward"),
                                             FlairEmbeddings("news-backward")))

That’s it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

# create a sentence
Sentence <- flair_data()$Sentence
sentence = Sentence('The grass is green .')
# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings$embed(sentence)
# now check out the embedded tokens.
# Note that Python is indexing from 0. In an R for loop, using seq_along(sentence) - 1 achieves the same effect.
for (i in  seq_along(sentence)-1) {
  print(sentence[i])
  print(sentence[i]$embedding)
}
#> Token[0]: "The"
#> tensor([-0.0382, -0.2449,  0.7281,  ..., -0.0065, -0.0053,  0.0090])
#> Token[1]: "grass"
#> tensor([-0.8135,  0.9404, -0.2405,  ...,  0.0354, -0.0255, -0.0143])
#> Token[2]: "is"
#> tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00,  ..., -5.3691e-04,
#>         -9.6750e-03, -2.7541e-02])
#> Token[3]: "green"
#> tensor([-0.6791,  0.3491, -0.2398,  ..., -0.0007, -0.1333,  0.0161])
#> Token[4]: "."
#> tensor([-0.3398,  0.2094,  0.4635,  ...,  0.0005, -0.0177,  0.0032])

Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.

 


Pooled Flair Embeddings

We also developed a pooled variant of the FlairEmbeddings. These embeddings differ in that they constantly evolve over time, even at prediction time (i.e. after training is complete). This means that the same words in the same sentence at two different points in time may have different embeddings.

PooledFlairEmbeddings manage a ‘global’ representation of each distinct word by using a pooling operation of all past occurences. More details on how this works may be found in Akbik et al. (2019).

You can instantiate and use PooledFlairEmbeddings like any other embedding:

# initiate embedding from Flair NLP
PooledFlairEmbeddings <- flair_embeddings()$PooledFlairEmbeddings
flair_embedding_forward <- PooledFlairEmbeddings('news-forward')

# create a sentence object
sentence <- Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green ."

Note that while we get some of our best results with PooledFlairEmbeddings they are very ineffective memory-wise since they keep past embeddings of all words in memory. In many cases, regular FlairEmbeddings will be nearly as good but with much lower memory requirements.

 


Transformer Embeddings

Please note that content and examples in this section have been extensively revised from the TransformerWordEmbeddings official documentation. Flair supports various Transformer-based architectures like BERT or XLNet from HuggingFace, with two classes TransformerWordEmbeddings (to embed words or tokens) and TransformerDocumentEmbeddings (to embed documents).

 

Embeddings Words with Transformers

For instance, to load a standard BERT transformer model, do:

library(flaiR)
# initiate embedding and load BERT model from HugginFaces
TransformerWordEmbeddings <- flair_embeddings()$TransformerWordEmbeddings
embedding <- TransformerWordEmbeddings('bert-base-uncased')

# create a sentence
Sentence <- flair_data()$Sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green ."

If instead you want to use RoBERTa, do:

TransformerWordEmbeddings <- flair_embeddings()$TransformerWordEmbeddings
embedding <- TransformerWordEmbeddings('roberta-base')
sentence <- Sentence('The grass is green .')
embedding$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green ."

{flaiR} interacts with Flair NLP (Zalando Research), allowing you to use pre-trained models from HuggingFace , where you can search for models to use.

Embedding Documents with Transformers

To embed a whole sentence as one (instead of each word in the sentence), simply use the TransformerDocumentEmbeddings instead:

TransformerDocumentEmbeddings <- flair_embeddings()$TransformerDocumentEmbeddings
embedding <- TransformerDocumentEmbeddings('roberta-base')
sentence <- Sentence('The grass is green .')
embedding$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green ."

Arguments

There are several options that you can set when you init the TransformerWordEmbeddings and TransformerDocumentEmbeddings classes:

Argument Default Description
model bert-base-uncased The string identifier of the transformer model you want to use (see above)
layers all Defines the layers of the Transformer-based model that produce the embedding
subtoken_pooling first See Pooling operation section.
layer_mean True See Layer mean section.
fine_tune False Whether or not embeddings are fine-tuneable.
allow_long_sentences True Whether or not texts longer than maximal sequence length are supported.
use_context False Set to True to include context outside of sentences. This can greatly increase accuracy on some tasks, but slows down embedding generation.

Layers

The layers argument controls which transformer layers are used for the embedding. If you set this value to ‘-1,-2,-3,-4’, the top 4 layers are used to make an embedding. If you set it to ‘-1’, only the last layer is used. If you set it to “all”, then all layers are used. This affects the length of an embedding, since layers are just concatenated.

Sentence <- flair_data()$Sentence
TransformerWordEmbeddings <- flair_embeddings()$TransformerWordEmbeddings
sentence = Sentence('The grass is green.')

# use only last layers
embeddings <- TransformerWordEmbeddings('bert-base-uncased', layers='-1', layer_mean = FALSE)
embeddings$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green."
print(sentence[0]$embedding$size())
#> torch.Size([768])

sentence$clear_embeddings()
sentence <- Sentence('The grass is green.')

# use only last layers
embeddings <- TransformerWordEmbeddings('bert-base-uncased', layers = "-1", layer_mean = FALSE)
embeddings$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green."
print(sentence[0]$embedding$size())
#> torch.Size([768])

sentence$clear_embeddings()
# use last two layers
embeddings <- TransformerWordEmbeddings('bert-base-uncased', layers='-1,-2', layer_mean = FALSE)
embeddings$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green."
print(sentence[0]$embedding$size())
#> torch.Size([1536])

sentence$clear_embeddings()
# use ALL layers
embeddings = TransformerWordEmbeddings('bert-base-uncased', layers='all', layer_mean=FALSE)
embeddings$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green."
print(sentence[0]$embedding$size())
#> torch.Size([9984])

Here’s an example of how it might be done:

You can directly import torch from reticulate since it has already been installed through the flair dependency when you installed flair in Python.

# You can directly import torch from reticulate since it has already been installed through the flair dependency when you installed flair in Python.
torch <- reticulate::import('torch')

# Attempting to create a tensor with integer dimensions
torch$Size(list(768L))
#> torch.Size([768])
torch$Size(list(1536L))
#> torch.Size([1536])
torch$Size(list(9984L))
#> torch.Size([9984])

Notice the L after the numbers in the list? This ensures that R treats the numbers as integers. If you’re generating these numbers dynamically (e.g., through computation), you might want to ensure they are integers before attempting to create the tensor. I.e. the size of the embedding increases the mode layers we use (but ONLY if layer_mean is set to False, otherwise the length is always the same).

Pooling Operation

Most of the Transformer-based models use subword tokenization. E.g. the following token puppeteer could be tokenized into the subwords: pupp, ##ete and ##er.

We implement different pooling operations for these subwords to generate the final token representation:

  • first: only the embedding of the first subword is used
  • last: only the embedding of the last subword is used
  • first_last: embeddings of the first and last subwords are concatenated and used
  • mean: a torch.mean over all subword embeddings is calculated and used

You can choose which one to use by passing this in the constructor:

# use first and last subtoken for each word
embeddings = TransformerWordEmbeddings('bert-base-uncased', subtoken_pooling='first_last')
embeddings$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green."
print(sentence[0]$embedding$size())
#> torch.Size([9984])

Layer Mean

The Transformer-based models have a certain number of layers. By default, all layers you select are concatenated as explained above. Alternatively, you can set layer_mean=True to do a mean over all selected layers. The resulting vector will then always have the same dimensionality as a single layer:

# initiate embedding from transformer. This model will be downloaded from Flair NLP huggingface.
embeddings <- TransformerWordEmbeddings('bert-base-uncased', layers="all", layer_mean=TRUE)

# create a sentence object
sentence = Sentence("The Oktoberfest is the world's largest Volksfest .")

# embed words in sentence
embedding$embed(sentence)
#> [[1]]
#> Sentence[9]: "The Oktoberfest is the world's largest Volksfest ."

Fine-tuneable or Not

Here’s an example of how it might be done: In some setups, you may wish to fine-tune the transformer embeddings. In this case, set fine_tune=True in the init method. When fine-tuning, you should also only use the topmost layer, so best set layers='-1'.

# use first and last subtoken for each word
TransformerWordEmbeddings <- flair_embeddings()$TransformerWordEmbeddings
embeddings <- TransformerWordEmbeddings('bert-base-uncased', fine_tune=TRUE, layers='-1')
embeddings$embed(sentence)
#> [[1]]
#> Sentence[9]: "The Oktoberfest is the world's largest Volksfest ."

This will print a tensor that now has a gradient function and can be fine-tuned if you use it in a training routine.

print(sentence[0]$embedding)
#> tensor([-6.5871e-01,  1.0410e-01,  3.4632e-01, -3.3775e-01, -2.1013e-01,
#>         -1.3036e-02,  5.1998e-01,  1.6574e+00, -5.2521e-02, -4.8632e-02,
#>         -7.8968e-01, -9.5547e-01, -1.9723e-01,  9.4999e-01, -1.0337e+00,
#>          8.6668e-02,  9.8104e-02,  5.6511e-02,  3.1075e-02,  2.4157e-01,
#>         -1.1427e-01, -2.3692e-01, -2.0700e-01,  7.7985e-01,  2.5460e-01,
#>         -5.0831e-03, -2.4110e-01,  2.2436e-01, -7.3249e-02, -8.1094e-01,
#>         -1.8778e-01,  2.1219e-01, -5.9514e-01,  6.3129e-02, -4.8880e-01,
#>         -3.2300e-02, -1.9125e-02, -1.0991e-01, -1.5604e-02,  4.3068e-01,
#>         -1.7968e-01, -5.4499e-01,  7.0608e-01, -4.0512e-01,  1.7761e-01,
#>         -8.5820e-01,  2.3438e-02, -1.4981e-01, -9.0368e-01, -2.1097e-01,
#>         -3.3535e-01,  1.4920e-01, -7.4529e-03,  1.0239e+00, -6.1777e-02,
#>          3.3913e-01,  8.5811e-02,  6.9401e-01, -7.7483e-02,  3.1484e-01,
#>         -4.3921e-01,  1.2933e+00,  5.7995e-03, -7.0992e-01,  2.7525e-01,
#>          8.8792e-01,  2.6303e-03,  1.3640e+00,  5.6885e-01, -2.4904e-01,
#>         -4.5158e-02, -1.7575e-01, -3.4730e-01,  5.8362e-02, -2.0346e-01,
#>         -1.2505e+00, -3.0592e-01, -3.6105e-02, -2.4066e-01, -5.1250e-01,
#>          2.6930e-01,  1.4068e-01,  3.4056e-01,  7.3297e-01,  2.6848e-01,
#>          2.4303e-01, -9.4885e-01, -9.0367e-01, -1.3184e-01,  6.7348e-01,
#>         -3.2994e-02,  4.7660e-01, -7.1617e-03, -3.4141e-01,  6.8473e-01,
#>         -4.4869e-01, -4.9831e-01, -8.0143e-01,  1.4073e+00,  5.3251e-01,
#>          2.4643e-01, -4.2529e-01,  9.1615e-02,  6.4496e-01,  1.7931e-01,
#>         -2.1473e-01,  1.5447e-01, -3.2978e-01,  1.0799e-01, -1.9402e+00,
#>         -5.0380e-01, -2.7636e-01, -1.1227e-01,  1.1576e-01,  2.5885e-01,
#>         -1.7916e-01,  6.6166e-01, -9.6098e-01, -5.1242e-01, -3.5424e-01,
#>          2.1383e-01,  6.6456e-01,  2.5498e-01,  3.7250e-01, -1.1821e+00,
#>         -4.9551e-01, -2.0858e-01,  1.1511e+00, -1.0366e-02, -1.0682e+00,
#>          3.7277e-01,  6.4048e-01,  2.3308e-01, -9.3824e-01,  9.5015e-02,
#>          5.7904e-01,  6.3969e-01,  8.2359e-02, -1.4075e-01,  3.0107e-01,
#>          3.5827e-03, -4.4684e-01, -2.6913e+00, -3.3933e-01,  2.8731e-03,
#>         -1.3639e-01, -7.1054e-01, -1.1048e+00,  2.2374e-01,  1.1830e-01,
#>          4.8416e-01, -2.9110e-01, -6.7650e-01,  2.3202e-01, -1.0123e-01,
#>         -1.9174e-01,  4.9959e-02,  5.2067e-01,  1.3272e+00,  6.8250e-01,
#>          5.5332e-01, -1.0886e+00,  4.5160e-01, -1.5010e-01, -9.8074e-01,
#>          8.5110e-02,  1.6498e-01,  6.6032e-01,  1.0815e-02,  1.8952e-01,
#>         -5.6607e-01, -1.3743e-02,  9.1171e-01,  2.7812e-01,  2.9551e-01,
#>         -3.5637e-01,  3.2030e-01,  5.6738e-01, -1.5707e-01,  3.5326e-01,
#>         -4.7747e-01,  7.8646e-01,  1.3765e-01,  2.2440e-01,  4.2422e-01,
#>         -2.6504e-01,  2.2014e-02, -6.7154e-01, -8.7998e-02,  1.4284e-01,
#>          4.0983e-01,  1.0933e-02, -1.0704e+00, -1.9350e-01,  6.0051e-01,
#>          5.0545e-02,  1.1434e-02, -8.0243e-01, -6.6871e-01,  5.3953e-01,
#>         -5.9856e-01, -1.6915e-01, -3.5307e-01,  4.4568e-01, -7.2761e-01,
#>          1.1629e+00, -3.1553e-01, -7.9747e-01, -2.0582e-01,  3.7320e-01,
#>          5.9379e-01, -3.1898e-01, -1.6932e-01, -6.2492e-01,  5.7047e-01,
#>         -2.9779e-01, -5.9106e-01,  8.5436e-02, -2.1839e-01, -2.2214e-01,
#>          7.9233e-01,  8.0537e-01, -5.9785e-01,  4.0474e-01,  3.9266e-01,
#>          5.8169e-01, -5.2506e-01,  6.9786e-01,  1.1163e-01,  8.7435e-02,
#>          1.7549e-01,  9.1438e-02,  5.8816e-01,  6.4338e-01, -2.7138e-01,
#>         -5.3449e-01, -1.0168e+00, -5.1338e-02,  3.0099e-01, -7.6696e-02,
#>         -2.1126e-01,  5.8143e-01,  1.3599e-01,  6.2759e-01, -6.2810e-01,
#>          5.9966e-01,  3.5836e-01, -3.0706e-02,  1.5563e-01, -1.4016e-01,
#>         -2.0155e-01, -1.3755e+00, -9.1876e-02, -6.9892e-01,  7.9438e-02,
#>         -4.2926e-01,  3.7988e-01,  7.6741e-01,  5.3094e-01,  8.5981e-01,
#>          4.4185e-02, -6.3507e-01,  3.9587e-01, -3.6635e-01, -7.0770e-01,
#>          8.3675e-04, -3.0055e-01,  2.1360e-01, -4.1649e-01,  6.9457e-01,
#>         -6.2715e-01, -5.1101e-01,  3.0331e-01, -2.3804e+00, -1.0566e-02,
#>         -9.4488e-01,  4.3317e-02,  2.4188e-01,  1.9204e-02,  1.5718e-03,
#>         -3.0374e-01,  3.1933e-01, -7.4432e-01,  1.4599e-01, -5.2102e-01,
#>         -5.2269e-01,  1.3274e-01, -2.8936e-01,  4.1706e-02,  2.6143e-01,
#>         -4.4796e-01,  7.3136e-01,  6.3893e-02,  4.7398e-01, -5.1062e-01,
#>         -1.3705e-01,  2.0763e-01, -3.9115e-01,  2.8822e-01, -3.5283e-01,
#>          3.4881e-02, -3.3602e-01,  1.7210e-01,  1.3537e-02, -5.3036e-01,
#>          1.2847e-01, -4.5576e-01, -3.7251e-01, -3.2254e+00, -3.1650e-01,
#>         -2.6144e-01, -9.4983e-02,  2.7651e-02, -2.3750e-01,  3.1001e-01,
#>          1.1428e-01, -1.2870e-01, -4.7496e-01,  4.4594e-01, -3.6138e-01,
#>         -3.1009e-01, -9.9613e-02,  5.3968e-01,  1.2840e-02,  1.4507e-01,
#>         -2.5181e-01,  1.9310e-01,  4.1073e-01,  5.9776e-01, -2.5585e-01,
#>          5.7184e-02, -5.1505e-01, -6.8708e-02,  4.7767e-01, -1.2078e-01,
#>         -5.0894e-01, -9.2884e-01,  7.8471e-01,  2.0216e-01,  4.3243e-01,
#>          3.2803e-01, -1.0122e-01,  3.3529e-01, -1.2183e-01, -5.5060e-01,
#>          3.5427e-01,  7.4558e-02, -3.1411e-01, -1.7512e-01,  2.2485e-01,
#>          4.2295e-01,  7.7110e-02,  1.8063e+00,  7.6634e-03, -1.1083e-02,
#>         -2.8603e-02,  7.7143e-02,  8.2345e-02,  8.0272e-02, -1.1858e+00,
#>          2.0523e-01,  3.4053e-01,  2.0424e-01, -2.0574e-02,  3.0466e-01,
#>         -2.1858e-01,  6.3737e-01, -5.6264e-01,  1.4153e-01,  2.4319e-01,
#>         -5.6688e-01,  7.2375e-02, -2.9329e-01,  4.6561e-02,  1.8977e-01,
#>          2.4977e-01,  9.1892e-01,  1.1346e-01,  3.8588e-01, -3.5543e-01,
#>         -1.3380e+00, -8.5644e-01, -5.5443e-01, -7.2317e-01, -2.9225e-01,
#>         -1.4389e-01,  6.9715e-01, -5.9852e-01, -6.8932e-01, -6.0952e-01,
#>          1.8234e-01, -7.5841e-02,  3.6445e-01, -3.8286e-01,  2.6545e-01,
#>         -2.6569e-01, -4.9999e-01, -3.8354e-01, -2.2809e-01,  8.8314e-01,
#>          2.9041e-01,  5.4803e-01, -1.0668e+00,  4.7406e-01,  7.8804e-02,
#>         -1.1559e+00, -3.0649e-01,  6.0479e-02, -7.1279e-01, -4.3335e-01,
#>         -8.2428e-04, -1.0236e-01,  3.5497e-01,  1.8665e-01,  1.2045e-01,
#>          1.2071e-01,  6.2911e-01,  3.1421e-01, -2.1635e-01, -8.9416e-01,
#>          6.6361e-01, -9.2981e-01,  6.9193e-01, -2.5403e-01, -2.5835e-02,
#>          1.2342e+00, -6.5908e-01,  7.5741e-01,  2.9014e-01,  3.0760e-01,
#>         -1.0249e+00, -2.7089e-01,  4.6132e-01,  6.1510e-02,  2.5385e-01,
#>         -5.2075e-01, -3.5107e-01,  3.3694e-01, -2.5047e-01, -2.7855e-01,
#>          2.0280e-01, -1.5703e-01,  4.1619e-02,  1.4451e-01, -1.6666e-01,
#>         -3.0519e-01, -9.4271e-02, -1.7083e-01,  5.2454e-01,  2.4524e-01,
#>          2.0731e-01,  3.7948e-01,  9.7358e-02, -3.2452e-02,  5.5792e-01,
#>         -2.4703e-01,  5.2864e-01,  5.6343e-01, -1.9198e-01, -8.3369e-02,
#>         -6.5377e-01, -5.4104e-01,  1.8289e-01, -4.9146e-01,  6.6422e-01,
#>         -5.2808e-01, -1.4797e-01, -4.5527e-02, -3.9593e-01,  1.2841e-01,
#>         -7.8591e-01, -3.7563e-02,  6.1912e-01,  3.2458e-01,  3.7858e-01,
#>          1.8744e-01, -5.0738e-01,  8.0223e-02, -3.1468e-02, -1.5145e-01,
#>          1.6657e-01, -5.2251e-01, -2.5940e-01, -3.8505e-01, -7.4941e-02,
#>          3.9530e-01, -2.1742e-01, -1.7113e-01, -5.2492e-01, -7.7780e-02,
#>         -6.9759e-01,  2.2570e-01, -1.2935e-01,  3.0749e-01, -1.3554e-01,
#>          6.0182e-02, -1.1479e-01,  4.7263e-01,  3.7957e-01,  8.9523e-01,
#>         -3.6411e-01, -6.6355e-01, -7.6647e-01, -1.4479e+00, -5.2238e-01,
#>          2.3336e-02, -4.5736e-01,  5.9981e-01,  6.8699e-01,  4.2190e-02,
#>          1.5894e-01,  2.0743e-02,  9.2333e-02, -7.2747e-01,  1.2388e-01,
#>         -4.7257e-01, -2.9889e-01,  4.8955e-01, -9.1618e-01, -1.9497e-01,
#>         -1.4157e-01, -1.7472e-01,  4.9251e-02, -2.2263e-01,  6.1700e-01,
#>         -2.4691e-01,  6.0936e-01,  3.6134e-01,  4.3398e-01, -2.7615e-01,
#>         -2.6582e-01, -1.3132e-01, -4.4156e-02,  5.3686e-01,  1.2956e-01,
#>         -6.4218e-01, -1.5820e-01, -1.0249e+00, -9.3591e-03, -3.5060e-01,
#>          3.6650e-01,  4.9503e-01,  7.4325e-01,  9.6526e-02,  4.3141e-01,
#>          3.9511e-02, -7.0727e-02,  6.2696e-01,  1.3066e-01,  1.0243e-01,
#>          3.3839e-01,  1.9224e-01,  4.8800e-01, -2.1052e-01,  3.9523e-02,
#>          7.7568e-01, -1.2005e-01, -1.1262e-01,  8.7001e-02,  2.7273e-01,
#>         -4.6830e-02, -2.4966e-01, -3.2083e-01, -2.6389e-01,  1.6225e-01,
#>          2.8800e-01, -1.0799e-01, -1.0841e-01,  6.6873e-01,  3.4369e-01,
#>          5.8675e-01,  9.2084e-01, -1.8131e-01,  5.6371e-02, -5.7125e-01,
#>          3.1048e-01,  3.1629e-02,  1.2097e+00,  4.4492e-01, -2.3792e-01,
#>         -9.9342e-02, -5.0657e-01, -3.1333e-02,  1.5045e-01,  3.1493e-01,
#>         -4.1287e-01, -1.8618e-01, -4.2639e-02,  1.8266e+00,  4.8565e-01,
#>          6.3892e-01, -2.9107e-01, -3.2557e-01,  1.1088e-01, -1.3213e+00,
#>          7.1113e-01,  2.3618e-01,  2.1473e-01,  1.6360e-01, -5.2535e-01,
#>          3.4322e-01,  9.0777e-01,  1.8697e-01, -3.0531e-01,  2.7574e-01,
#>          5.1452e-01, -2.6733e-01,  2.4208e-01, -3.3234e-01,  6.3520e-01,
#>          2.5884e-01, -5.7923e-01,  3.0204e-01,  4.1746e-02,  4.7538e-02,
#>         -6.7038e-01,  4.6699e-01, -1.6951e-01, -1.5161e-01, -1.2805e-01,
#>         -4.3990e-01,  1.0177e+00, -3.8138e-01,  4.3114e-01, -7.5444e-03,
#>          2.7385e-01,  4.6314e-01, -8.6565e-02, -7.9458e-01,  1.4370e-02,
#>          2.6016e-01,  9.2574e-03,  9.3968e-01,  7.9679e-01,  3.3147e-03,
#>         -5.6733e-01,  2.9052e-01, -9.5894e-02,  1.8630e-01,  1.4475e-01,
#>          1.8935e-01,  5.1735e-01, -1.2187e+00, -1.3298e-01, -4.3538e-01,
#>         -6.5398e-01, -2.9286e-01,  1.3199e-01,  3.9075e-01,  9.0172e-01,
#>          9.9439e-01,  6.2783e-01, -1.6103e-01,  1.4153e-03, -9.1476e-01,
#>          7.7760e-01,  1.2264e+00,  8.1482e-02,  6.6732e-01, -7.4576e-01,
#>         -1.0470e-01, -6.7781e-01,  8.0405e-01,  3.6676e-02,  3.6362e-01,
#>          4.4962e-01,  8.9600e-01, -1.8275e+00,  6.7828e-01, -9.4118e-03,
#>          3.8665e-01, -2.2149e-02,  7.4756e-02,  3.7438e-01, -1.2696e-01,
#>         -5.3397e-01, -3.5782e-01,  3.0400e-01,  7.7663e-01, -1.9122e-01,
#>         -1.3041e-01, -2.1522e-01,  1.1086e+00,  1.0237e+00, -4.7553e-02,
#>         -3.9538e-01,  1.1568e+00, -4.2549e-01, -2.5641e-02,  2.1993e-01,
#>         -4.7488e-01, -7.7624e-02, -5.5211e-01, -5.3169e-01, -5.3790e-02,
#>         -6.0536e-01,  4.2789e-01, -3.8606e-01,  9.8630e-01,  4.3331e-01,
#>          4.8414e-01, -1.3519e-01, -6.5505e-01, -2.2913e-01, -3.1254e-01,
#>          1.2920e-01, -7.7761e-02, -3.1123e-01,  8.2576e-01,  8.6486e-01,
#>         -3.4766e-01, -3.8491e-01,  3.5732e-02,  3.7518e-01, -3.7511e-01,
#>          5.2371e-01, -7.9721e-01,  3.3401e-01,  8.3976e-01, -3.2525e-01,
#>         -3.0268e-01, -1.3558e-01,  2.2812e-01,  1.5632e-01,  3.1584e-01,
#>          9.3903e-02, -3.8647e-01, -1.0177e-01, -2.8833e-01,  3.6028e-01,
#>          2.2565e-01, -1.5595e-01, -4.4974e-01, -5.0904e-01,  4.5058e-01,
#>          7.9031e-01,  2.7041e-01, -3.6712e-01, -3.9090e-01,  2.3358e-01,
#>          1.2162e+00, -1.1371e+00, -8.2702e-01, -9.2749e-02,  5.8958e-01,
#>          4.4429e-02, -2.3344e-01, -5.6492e-01,  4.9407e-01, -4.0301e-01,
#>          5.0950e-01, -1.6741e-01, -4.0176e+00, -8.2092e-01, -3.9132e-01,
#>         -2.9754e-01, -2.6798e-01, -2.5174e-01,  6.6283e-01, -5.7531e-02,
#>          7.7359e-01,  2.5238e-01,  2.5732e-02,  1.7694e-01,  9.4648e-02,
#>          2.6886e-01,  9.3711e-01, -8.3930e-02])

More Models

Please have a look at the awesome HuggingFace for all supported pre-trained models!

 


Classic Word Embeddings

Classic word embeddings are static and word-level, meaning that each distinct word gets exactly one pre-computed embedding. Most embeddings fall under this class, including the popular GloVe or Komninos embeddings.

Simply instantiate the WordEmbeddings class and pass a string identifier of the embedding you wish to load. So, if you want to use GloVe embeddings, pass the string ‘glove’ to the constructor:

library(flaiR)
# initiate embedding with glove
WordEmbeddings <- flair_embeddings()$WordEmbeddings
glove_embedding <-  WordEmbeddings('glove')

Now, create an example sentence and call the embedding’s embed() method. You can also pass a list of sentences to this method since some embedding types make use of batching to increase speed.

library(flaiR)
# initiate a sentence object
Sentence <- flair_data()$Sentence

# create sentence object.
sentence = Sentence('The grass is green .')

# embed a sentence using glove.
glove_embedding$embed(sentence)
#> [[1]]
#> Sentence[5]: "The grass is green ."

This prints out the tokens and their embeddings. GloVe embeddings are Pytorch vectors of dimensionality 100.

# view embedded tokens.
for (token in seq_along(sentence)-1) {
  print(sentence[token])
  print(sentence[token]$embedding$numpy())
}
#> Token[0]: "The"
#>   [1] -0.038194 -0.244870  0.728120 -0.399610  0.083172  0.043953
#>   [7] -0.391410  0.334400 -0.575450  0.087459  0.287870 -0.067310
#>  [13]  0.309060 -0.263840 -0.132310 -0.207570  0.333950 -0.338480
#>  [19] -0.317430 -0.483360  0.146400 -0.373040  0.345770  0.052041
#>  [25]  0.449460 -0.469710  0.026280 -0.541550 -0.155180 -0.141070
#>  [31] -0.039722  0.282770  0.143930  0.234640 -0.310210  0.086173
#>  [37]  0.203970  0.526240  0.171640 -0.082378 -0.717870 -0.415310
#>  [43]  0.203350 -0.127630  0.413670  0.551870  0.579080 -0.334770
#>  [49] -0.365590 -0.548570 -0.062892  0.265840  0.302050  0.997750
#>  [55] -0.804810 -3.024300  0.012540 -0.369420  2.216700  0.722010
#>  [61] -0.249780  0.921360  0.034514  0.467450  1.107900 -0.193580
#>  [67] -0.074575  0.233530 -0.052062 -0.220440  0.057162 -0.158060
#>  [73] -0.307980 -0.416250  0.379720  0.150060 -0.532120 -0.205500
#>  [79] -1.252600  0.071624  0.705650  0.497440 -0.420630  0.261480
#>  [85] -1.538000 -0.302230 -0.073438 -0.283120  0.371040 -0.252170
#>  [91]  0.016215 -0.017099 -0.389840  0.874240 -0.725690 -0.510580
#>  [97] -0.520280 -0.145900  0.827800  0.270620
#> Token[1]: "grass"
#>   [1] -0.8135300  0.9404200 -0.2404800 -0.1350100  0.0556780  0.3362500
#>   [7]  0.0802090 -0.1014800 -0.5477600 -0.3536500  0.0733820  0.2586800
#>  [13]  0.1986600 -0.1432800  0.2507000  0.4281400  0.1949800  0.5345600
#>  [19]  0.7424100  0.0578160 -0.3178100  0.9435900  0.8145000 -0.0823750
#>  [25]  0.6165800  0.7284400 -0.3262300 -1.3641000  0.1232000  0.5372800
#>  [31] -0.5122800  0.0245900  1.0822001 -0.2295900  0.6038500  0.5541500
#>  [37] -0.9609900  0.4803300  0.0022260  0.5591300 -0.1636500 -0.8468100
#>  [43]  0.0740790 -0.6215700  0.0259670 -0.5162100 -0.0524620 -0.1417700
#>  [49] -0.0161230 -0.4971900 -0.5534500 -0.4037100  0.5095600  1.0276000
#>  [55] -0.0840000 -1.1179000  0.3225700  0.4928100  0.9487600  0.2040300
#>  [61]  0.5388300  0.8397200 -0.0688830  0.3136100  1.0450000 -0.2266900
#>  [67] -0.0896010 -0.6427100  0.6442900 -1.1001000 -0.0095814  0.2668200
#>  [73] -0.3230200 -0.6065200  0.0479150 -0.1663700  0.8571200  0.2335500
#>  [79]  0.2539500  1.2546000  0.5471600 -0.1979600 -0.7186300  0.2076000
#>  [85] -0.2587500 -0.3649900  0.0834360  0.6931700  0.1573700  1.0931000
#>  [91]  0.0912950 -1.3773000 -0.2717000  0.7070800  0.1872000 -0.3307200
#>  [97] -0.2835900  0.1029600  1.2228000  0.8374100
#> Token[2]: "is"
#>   [1] -0.5426400  0.4147600  1.0322000 -0.4024400  0.4669100  0.2181600
#>   [7] -0.0748640  0.4733200  0.0809960 -0.2207900 -0.1280800 -0.1144000
#>  [13]  0.5089100  0.1156800  0.0282110 -0.3628000  0.4382300  0.0475110
#>  [19]  0.2028200  0.4985700 -0.1006800  0.1326900  0.1697200  0.1165300
#>  [25]  0.3135500  0.2571300  0.0927830 -0.5682600 -0.5297500 -0.0514560
#>  [31] -0.6732600  0.9253300  0.2693000  0.2273400  0.6636500  0.2622100
#>  [37]  0.1971900  0.2609000  0.1877400 -0.3454000 -0.4263500  0.1397500
#>  [43]  0.5633800 -0.5690700  0.1239800 -0.1289400  0.7248400 -0.2610500
#>  [49] -0.2631400 -0.4360500  0.0789080 -0.8414600  0.5159500  1.3997000
#>  [55] -0.7646000 -3.1452999 -0.2920200 -0.3124700  1.5129000  0.5243500
#>  [61]  0.2145600  0.4245200 -0.0884110 -0.1780500  1.1876000  0.1057900
#>  [67]  0.7657100  0.2191400  0.3582400 -0.1163600  0.0932610 -0.6248300
#>  [73] -0.2189800  0.2179600  0.7405600 -0.4373500  0.1434300  0.1471900
#>  [79] -1.1605000 -0.0505080  0.1267700 -0.0143950 -0.9867600 -0.0912970
#>  [85] -1.2054000 -0.1197400  0.0478470 -0.5400100  0.5245700 -0.7096300
#>  [91] -0.3252800 -0.1346000 -0.4131400  0.3343500 -0.0072412  0.3225300
#>  [97] -0.0442190 -1.2969000  0.7621700  0.4634900
#> Token[3]: "green"
#>   [1] -0.67907000  0.34908000 -0.23984000 -0.99651998  0.73782003
#>   [6] -0.00065911  0.28009999  0.01728700 -0.36063001  0.03695500
#>  [11] -0.40395001  0.02409200  0.28957999  0.40496999  0.69992000
#>  [16]  0.25268999  0.80350000  0.04937000  0.15561999 -0.00632860
#>  [21] -0.29414001  0.14727999  0.18977000 -0.51791000  0.36985999
#>  [26]  0.74581999  0.08268900 -0.72601002 -0.40939000 -0.09782200
#>  [31] -0.14095999  0.71121001  0.61932999 -0.25014001  0.42250001
#>  [36]  0.48458001 -0.51915002  0.77125001  0.36684999  0.49652001
#>  [41] -0.04129800 -1.46829998  0.20038000  0.18591000  0.04986000
#>  [46] -0.17523000 -0.35528001  0.94152999 -0.11898000 -0.51902997
#>  [51] -0.01188700 -0.39186001 -0.17478999  0.93450999 -0.58930999
#>  [56] -2.77010012  0.34522000  0.86532998  1.08080006 -0.10291000
#>  [61] -0.09122000  0.55092001 -0.39473000  0.53675997  1.03830004
#>  [66] -0.40658000  0.24590001 -0.26797000 -0.26036000 -0.14150999
#>  [71] -0.12022000  0.16234000 -0.74320000 -0.64727998  0.04713300
#>  [76]  0.51642001  0.19898000  0.23919000  0.12549999  0.22471000
#>  [81]  0.82612997  0.07832800 -0.57020003  0.02393400 -0.15410000
#>  [86] -0.25738999  0.41262001 -0.46967000  0.87914002  0.72628999
#>  [91]  0.05386200 -1.15750003 -0.47835001  0.20139000 -1.00510001
#>  [96]  0.11515000 -0.96609002  0.12960000  0.18388000 -0.03038300
#> Token[4]: "."
#>   [1] -0.3397900  0.2094100  0.4634800 -0.6479200 -0.3837700  0.0380340
#>   [7]  0.1712700  0.1597800  0.4661900 -0.0191690  0.4147900 -0.3434900
#>  [13]  0.2687200  0.0446400  0.4213100 -0.4103200  0.1545900  0.0222390
#>  [19] -0.6465300  0.2525600  0.0431360 -0.1944500  0.4651600  0.4565100
#>  [25]  0.6858800  0.0912950  0.2187500 -0.7035100  0.1678500 -0.3507900
#>  [31] -0.1263400  0.6638400 -0.2582000  0.0365420 -0.1360500  0.4025300
#>  [37]  0.1428900  0.3813200 -0.1228300 -0.4588600 -0.2528200 -0.3043200
#>  [43] -0.1121500 -0.2618200 -0.2248200 -0.4455400  0.2991000 -0.8561200
#>  [49] -0.1450300 -0.4908600  0.0082973 -0.1749100  0.2752400  1.4401000
#>  [55] -0.2123900 -2.8434999 -0.2795800 -0.4572200  1.6386000  0.7880800
#>  [61] -0.5526200  0.6500000  0.0864260  0.3901200  1.0632000 -0.3537900
#>  [67]  0.4832800  0.3460000  0.8417400  0.0987070 -0.2421300 -0.2705300
#>  [73]  0.0452870 -0.4014700  0.1139500  0.0062226  0.0366730  0.0185180
#>  [79] -1.0213000 -0.2080600  0.6407200 -0.0687630 -0.5863500  0.3347600
#>  [85] -1.1432000 -0.1148000 -0.2509100 -0.4590700 -0.0968190 -0.1794600
#>  [91] -0.0633510 -0.6741200 -0.0688950  0.5360400 -0.8777300  0.3180200
#>  [97] -0.3924200 -0.2339400  0.4729800 -0.0288030

You choose which pre-trained embeddings you load by passing the appropriate id string to the constructor of the WordEmbeddings class. Typically, you use the two-letter language code to init an embedding, so ‘en’ for English and ‘de’ for German and so on. By default, this will initialize FastText embeddings trained over Wikipedia. You can also always use FastText embeddings over Web crawls, by instantiating with ‘-crawl’. So ‘de-crawl’ to use embeddings trained over German web crawls.

For English, we provide a few more options, so here you can choose between instantiating ‘en-glove’, ‘en-extvec’ and so on.

Suppored Models:

The following embeddings are currently supported:

ID Language Embedding
‘en-glove’ (or ‘glove’) English GloVe embeddings
‘en-extvec’ (or ‘extvec’) English Komninos embeddings
‘en-crawl’ (or ‘crawl’) English FastText embeddings over Web crawls
‘en-twitter’ (or ‘twitter’) English Twitter embeddings
‘en-turian’ (or ‘turian’) English Turian embeddings (small)
‘en’ (or ‘en-news’ or ‘news’) English FastText embeddings over news and wikipedia data
‘de’ German German FastText embeddings
‘nl’ Dutch Dutch FastText embeddings
‘fr’ French French FastText embeddings
‘it’ Italian Italian FastText embeddings
‘es’ Spanish Spanish FastText embeddings
‘pt’ Portuguese Portuguese FastText embeddings
‘ro’ Romanian Romanian FastText embeddings
‘ca’ Catalan Catalan FastText embeddings
‘sv’ Swedish Swedish FastText embeddings
‘da’ Danish Danish FastText embeddings
‘no’ Norwegian Norwegian FastText embeddings
‘fi’ Finnish Finnish FastText embeddings
‘pl’ Polish Polish FastText embeddings
‘cz’ Czech Czech FastText embeddings
‘sk’ Slovak Slovak FastText embeddings
‘sl’ Slovenian Slovenian FastText embeddings
‘sr’ Serbian Serbian FastText embeddings
‘hr’ Croatian Croatian FastText embeddings
‘bg’ Bulgarian Bulgarian FastText embeddings
‘ru’ Russian Russian FastText embeddings
‘ar’ Arabic Arabic FastText embeddings
‘he’ Hebrew Hebrew FastText embeddings
‘tr’ Turkish Turkish FastText embeddings
‘fa’ Persian Persian FastText embeddings
‘ja’ Japanese Japanese FastText embeddings
‘ko’ Korean Korean FastText embeddings
‘zh’ Chinese Chinese FastText embeddings
‘hi’ Hindi Hindi FastText embeddings
‘id’ Indonesian Indonesian FastText embeddings
‘eu’ Basque Basque FastText embeddings

So, if you want to load German FastText embeddings, instantiate as follows:

Alternatively, if you want to load German FastText embeddings trained over crawls, instantiate as follows:

 


Embedding Examples

Flair is a very popular natural language processing library, providing a variety of embedding methods for text representation. Flair Embeddings is a word embedding framework developed by Zalando. It focuses on word-level representation and can capture contextual information of words, allowing the same word to have different embeddings in different contexts. Unlike traditional word embeddings (such as Word2Vec or GloVe), Flair can dynamically generate word embeddings based on context and has achieved excellent results in various NLP tasks. Below are some key points about Flair Embeddings:

Context-Aware

Flair is a dynamic word embedding technique that can understand the meaning of words based on context. In contrast, static word embeddings, such as Word2Vec or GloVe, provide a fixed embedding for each word without considering its context in a sentence.

Therefore, context-sensitive embedding techniques, such as Flair, can capture the meaning of words in specific sentences more accurately, thus enhancing the performance of language models in various tasks.

Example:

Consider the following two English sentences:

  • “I am interested in the bank of the river.”
  • “I need to go to the bank to withdraw money.”

Here, the word “bank” has two different meanings. In the first sentence, it refers to the edge or shore of a river. In the second sentence, it refers to a financial institution.

For static embeddings, the word “bank” might have an embedding that lies somewhere between these two meanings because it doesn’t consider context. But for dynamic embeddings like Flair, “bank” in the first sentence will have an embedding related to rivers, and in the second sentence, it will have an embedding related to finance.

# Initialize Flair embeddings
FlairEmbeddings <- flair_embeddings()$FlairEmbeddings
Sentence <- flair_data()$Sentence
flair_embedding_forward <- FlairEmbeddings('news-forward')

# Define the two sentences
sentence1 <-  Sentence("I am interested in the bank of the river.")
sentence2 <-  Sentence("I need to go to the bank to withdraw money.")

# Get the embeddings

flair_embedding_forward$embed(sentence1)
#> [[1]]
#> Sentence[10]: "I am interested in the bank of the river."
flair_embedding_forward$embed(sentence2)
#> [[1]]
#> Sentence[11]: "I need to go to the bank to withdraw money."

# Extract the embedding for "bank" from the sentences
bank_embedding_sentence1 = sentence1[5]$embedding  # "bank" is the seventh word
bank_embedding_sentence2 = sentence2[6]$embedding  # "bank" is the sixth word

Same word, similar vector representation, but essentially different. In this way, you can see how the dynamic embeddings for “bank” in the two sentences differ based on context. Although we printed the embeddings here, in reality, they would be high-dimensional vectors, so you might see a lot of numbers. If you want a more intuitive view of the differences, you could compute the cosine similarity or other metrics between the two embeddings.

This is just a simple demonstration. In practice, you can also combine multiple embedding techniques, such as WordEmbeddings and FlairEmbeddings, to get richer word vectors.

library(lsa)
#> Loading required package: SnowballC
cosine(as.numeric( bank_embedding_sentence1$numpy()), 
       as.numeric( bank_embedding_sentence2$numpy()))
#>           [,1]
#> [1,] 0.7329552

Character-Based

Flair uses a character-level language model, meaning it can generate embeddings for rare words or even misspelled words. This is an important feature because it allows the model to understand and process words that have never appeared in the training data. Flair uses a bidirectional LSTM (Long Short-Term Memory) network that operates at a character level. This allows it to feed individual characters into the LSTM instead of words.

Multilingual Support

Flair provides various pre-trained character-level language models, supporting contextual word embeddings for multiple languages. It allows you to easily combine different word embeddings (e.g., Flair Embeddings, Word2Vec, GloVe, etc.) to create powerful stacked embeddings.

Classic Wordembeddings

In Flair, the simplest form of embeddings that still contains semantic information about the word are called classic word embeddings. These embeddings are pre-trained and non-contextual.

Let’s retrieve a few word embeddings and use FastText embeddings with the following code. To do so, we simply instantiate a WordEmbeddings class by passing in the ID of the embedding of our choice. Then, we simply wrap our text into a Sentence object, and call the embed(sentence) method on our WordEmbeddings class.

WordEmbeddings <- flair_embeddings()$WordEmbeddings
Sentence <- flair_data()$Sentence
embedding <- WordEmbeddings('crawl')
sentence <- Sentence("one two three one") 
embedding$embed(sentence) 
#> [[1]]
#> Sentence[4]: "one two three one"

for (i in seq_along(sentence$tokens)) {
  print(head(sentence$tokens[[i]]$embedding), n =5)
}
#> tensor([-0.0535, -0.0368, -0.2851, -0.0381, -0.0486,  0.2383])
#> tensor([ 0.0282, -0.0786, -0.1236,  0.1756, -0.1199,  0.0964])
#> tensor([-0.0920, -0.0690, -0.1475,  0.2313, -0.0872,  0.0799])
#> tensor([-0.0535, -0.0368, -0.2851, -0.0381, -0.0486,  0.2383])

Flair supports a range of classic word embeddings, each offering unique features and application scopes. Below is an overview, detailing the ID required to load each embedding and its corresponding language.

Embedding Type ID Language
GloVe glove English
Komninos extvec English
Twitter twitter English
Turian (small) turian English
FastText (crawl) crawl English
FastText (news & Wikipedia) ar Arabic
FastText (news & Wikipedia) bg Bulgarian
FastText (news & Wikipedia) ca Catalan
FastText (news & Wikipedia) cz Czech
FastText (news & Wikipedia) da Danish
FastText (news & Wikipedia) de German
FastText (news & Wikipedia) es Spanish
FastText (news & Wikipedia) en English
FastText (news & Wikipedia) eu Basque
FastText (news & Wikipedia) fa Persian
FastText (news & Wikipedia) fi Finnish
FastText (news & Wikipedia) fr French
FastText (news & Wikipedia) he Hebrew
FastText (news & Wikipedia) hi Hindi
FastText (news & Wikipedia) hr Croatian
FastText (news & Wikipedia) id Indonesian
FastText (news & Wikipedia) it Italian
FastText (news & Wikipedia) ja Japanese
FastText (news & Wikipedia) ko Korean
FastText (news & Wikipedia) nl Dutch
FastText (news & Wikipedia) no Norwegian
FastText (news & Wikipedia) pl Polish
FastText (news & Wikipedia) pt Portuguese
FastText (news & Wikipedia) ro Romanian
FastText (news & Wikipedia) ru Russian
FastText (news & Wikipedia) si Slovenian
FastText (news & Wikipedia) sk Slovak
FastText (news & Wikipedia) sr Serbian
FastText (news & Wikipedia) sv Swedish
FastText (news & Wikipedia) tr Turkish
FastText (news & Wikipedia) zh Chinese

Contexual Embeddings

The idea behind contextual string embeddings is that each word embedding should be defined by not only its syntactic-semantic meaning but also the context it appears in. What this means is that each word will have a different embedding for every context it appears in. Each pre-trained Flair model offers a forward version and a backward version. Let’s assume you are processing a language that, just like this text, uses the left-to-right script. The forward version takes into account the context that happens before the word – on the left-hand side. The backward version works in the opposite direction. It takes into account the context after the word – on the right-hand side of the word. If this is true, then two same words that appear at the beginning of two different sentences should have identical forward embeddings, because their context is null. Let’s test this out:

Because we are using a forward model, it only takes into account the context that occurs before a word. Additionally, since our word has no context on the left-hand side of its position in the sentence, the two embeddings are identical, and the code assumes they are identical, indeed output is True.

FlairEmbeddings <- flair_embeddings()$FlairEmbeddings
embedding <- FlairEmbeddings('news-forward')
s1 <- Sentence("nice shirt") 
s2 <- Sentence("nice pants") 

embedding$embed(s1) 
#> [[1]]
#> Sentence[2]: "nice shirt"
embedding$embed(s2) 
#> [[1]]
#> Sentence[2]: "nice pants"
cat(" s1 sentence:", paste(s1[0], sep = ""), "\n", "s2 sentence:", paste(s2[0], sep = ""))
#>  s1 sentence: Token[0]: "nice" 
#>  s2 sentence: Token[0]: "nice"

We test whether the sum of the two 2048 embeddings of nice is equal to 2048. If it is true, it indicates that the embedding results are consistent, which should theoretically be the case.

length(s1[0]$embedding$numpy()) == sum(s1[0]$embedding$numpy() ==  s2[0]$embedding$numpy())
#> [1] TRUE

Now we separately add a few more words, very and pretty, into two sentence objects.

s1 <- Sentence("very nice shirt") 
s2 <- Sentence("pretty nice pants") 

embedding$embed(s1) 
#> [[1]]
#> Sentence[3]: "very nice shirt"
embedding$embed(s2) 
#> [[1]]
#> Sentence[3]: "pretty nice pants"

The two sets of embeddings are not identical because the words are different, so it returns FALSE.

length(s1[0]$embedding$numpy()) == sum(s1[0]$embedding$numpy() ==  s2[0]$embedding$numpy())
#> [1] FALSE

The measure of similarity between two vectors in an inner product space is known as cosine similarity. The formula for calculating cosine similarity between two vectors, such as vectors A and B, is as follows:

CosineSimilarity=i(AiBi)i(Ai2)i(Bi2)Cosine Similarity = \frac{\sum_{i} (A_i \cdot B_i)}{\sqrt{\sum_{i} (A_i^2)} \cdot \sqrt{\sum_{i} (B_i^2)}}

library(lsa)
vector1 <- as.numeric(s1[0]$embedding$numpy())
vector2 <- as.numeric(s2[0]$embedding$numpy())

We can observe that the similarity between the two words is 0.55.

cosine_similarity <- cosine(vector1, vector2)
print(cosine_similarity)
#>           [,1]
#> [1,] 0.5571664

Extracting Embeddings from BERT

First, we utilize the TransformerWordEmbeddings function to download BERT, and more transformer models can also be found on Flair NLP’s Hugging Face.

library(flaiR)
TransformerWordEmbeddings <- flair_embeddings()$TransformerWordEmbeddings("bert-base-uncased")
embedding <- TransformerWordEmbeddings$embed(sentence)

Next, we traverse each token in the sentence and print them.

# Iterate through each token in the sentence, printing them. 
# Utilize reticulate::py_str(token) to view each token, given that the sentence is a Python object.
for (i in seq_along(sentence$tokens)) {
  cat("Token: ", reticulate::py_str(sentence$tokens[[i]]), "\n")
  # Access the embedding of the token, converting it to an R object, 
  # and print the first 10 elements of the vector.
  token_embedding <- sentence$tokens[[i]]$embedding
  print(head(token_embedding, 10))
}
#> Token:  Token[0]: "one" 
#> tensor([-0.0535, -0.0368, -0.2851, -0.0381, -0.0486,  0.2383, -0.1200,  0.2620,
#>         -0.0575,  0.0228])
#> Token:  Token[1]: "two" 
#> tensor([ 0.0282, -0.0786, -0.1236,  0.1756, -0.1199,  0.0964, -0.1327,  0.4449,
#>         -0.0264, -0.1168])
#> Token:  Token[2]: "three" 
#> tensor([-0.0920, -0.0690, -0.1475,  0.2313, -0.0872,  0.0799, -0.0901,  0.4403,
#>         -0.0103, -0.1494])
#> Token:  Token[3]: "one" 
#> tensor([-0.0535, -0.0368, -0.2851, -0.0381, -0.0486,  0.2383, -0.1200,  0.2620,
#>         -0.0575,  0.0228])

Visialized Embeddings

Word Embeddings (GloVe)

  • GloVe embeddings are Pytorch vectors of dimensionality 100.

  • For English, Flair provides a few more options. Here, you can use en-glove and en-extvec with the WordEmbeddings class.

# Initialize Text Processing Tools ---------------------------
# Import Sentence class for text operations
Sentence <- flair_data()$Sentence

# Configure GloVe Embeddings --------------------------------
# Load WordEmbeddings class and initialize GloVe model
WordEmbeddings <- flair_embeddings()$WordEmbeddings
embedding <- WordEmbeddings("glove") 

# Text Processing and Embedding -----------------------------
# Create sentence with semantic relationship pairs
sentence <- Sentence("King Queen man woman Paris London apple orange Taiwan Dublin Bamberg") 

# Apply GloVe embeddings to the sentence
embedding$embed(sentence)
#> [[1]]
#> Sentence[11]: "King Queen man woman Paris London apple orange Taiwan Dublin Bamberg"

# Extract embeddings into matrix format
sen_df <- process_embeddings(sentence, 
                          verbose = TRUE)
#> Extracting token embeddings...
#> Converting embeddings to matrix format...Processing completed in 0.004 seconds
#> Generated embedding matrix with 11 tokens and 100 dimensions

# Dimensionality Reduction ---------------------------------
# Set random seed for reproducibility
set.seed(123)

# Apply PCA to reduce dimensions to 3 components
pca_result <- prcomp(sen_df, center = TRUE, scale. = TRUE)

# Extract first three principal components
word_embeddings_matrix <- as.data.frame(pca_result$x[,1:3])
word_embeddings_matrix
#>                PC1       PC2         PC3
#> King    -2.9120910  1.285200 -1.95053854
#> Queen   -2.2413804  2.266714 -1.09020972
#> man     -5.6381902  2.984461  3.55462010
#> woman   -6.4891003  2.458607  3.56693660
#> Paris    3.0702212  5.039061 -2.65962020
#> London   5.3196216  4.368433 -2.60726627
#> apple    0.3362535 -8.679358 -0.44752722
#> orange  -0.0485467 -4.404101  0.77151480
#> Taiwan  -2.7993829 -4.149287 -6.33296039
#> Dublin   5.8994096  1.063291 -0.09271925
#> Bamberg  5.5031854 -2.233020  7.28777009
2D Plot
library(ggplot2)
glove_plot2D <- ggplot(word_embeddings_matrix, aes(x = PC1, y = PC2, color = PC3, 
                                             label = rownames(word_embeddings_matrix))) +
  geom_point(size = 3) + 
  geom_text(vjust = 1.5, hjust = 0.5) +  
  scale_color_gradient(low = "blue", high = "red") + 
  theme_minimal() +  
  labs(title = "", x = "PC1", y = "PC2", color = "PC3") 
  # guides(color = "none")  
glove_plot2D

3D Plot

plotly in R API: https://plotly.com/r/

library(plotly)
glove_plot3D <- plot_ly(data = word_embeddings_matrix, 
                  x = ~PC1, y = ~PC2, z = ~PC3, 
                  type = "scatter3d", mode = "markers",
                  marker = list(size = 5), 
                  text = rownames(word_embeddings_matrix), hoverinfo = 'text')

glove_plot3D

Stack Embeddings Method (GloVe + Back/forwad FlairEmbeddings or More)

# Initialize Embeddings -----------------------------
# Load embedding types from flaiR
WordEmbeddings <- flair_embeddings()$WordEmbeddings
FlairEmbeddings <- flair_embeddings()$FlairEmbeddings
StackedEmbeddings <- flair_embeddings()$StackedEmbeddings

# Configure Embeddings ----------------------------
# Initialize GloVe word embeddings
glove_embedding <- WordEmbeddings('glove')

# Initialize Flair contextual embeddings
flair_embedding_forward <- FlairEmbeddings('news-forward')
flair_embedding_backward <- FlairEmbeddings('news-backward')

# Initialize GloVe for individual use
embedding <- WordEmbeddings("glove") 

# Create stacked embeddings combining GloVe and bidirectional Flair
stacked_embeddings <- StackedEmbeddings(c(glove_embedding,
                                         flair_embedding_forward,
                                         flair_embedding_backward))

# Text Processing --------------------------------
# Load Sentence class from flaiR
Sentence <- flair_data()$Sentence

# Create test sentence with semantic relationships
sentence <- Sentence("King Queen man woman Paris London apple orange Taiwan Dublin Bamberg") 

# Apply embeddings and extract features ----------
# Embed text using stacked embeddings
stacked_embeddings$embed(sentence)

# Extract embeddings matrix with processing details
sen_df <- process_embeddings(sentence, 
                           verbose = TRUE)
#> Extracting token embeddings...
#> Converting embeddings to matrix format...Processing completed in 0.005 seconds
#> Generated embedding matrix with 11 tokens and 4196 dimensions

# Dimensionality Reduction -----------------------
set.seed(123)

# Perform PCA for visualization
pca_result <- prcomp(sen_df, center = TRUE, scale. = TRUE)

# Extract first three principal components
word_embeddings_matrix <- as.data.frame(pca_result$x[,1:3])
word_embeddings_matrix
#>                PC1         PC2         PC3
#> King     -8.607464  67.2291073  32.4862897
#> Queen     1.757712  12.0477242 -26.8302591
#> man      70.603192  -6.6184846  13.1651727
#> woman    22.532037  -8.1126281  -0.9074069
#> Paris   -11.395620  -0.3051653 -17.5197047
#> London   -8.709174  -2.7450590 -14.1780509
#> apple    -8.739480 -15.7725165  -6.3796332
#> orange  -25.178335 -38.8501335  51.4907560
#> Taiwan   -9.132397  -5.0252071 -11.0918852
#> Dublin  -10.925014  -3.3407305 -10.1367702
#> Bamberg -12.205457   1.4930931 -10.0985081
# 2D Plot
library(ggplot2)

stacked_plot2D <- ggplot(word_embeddings_matrix, aes(x = PC1, y = PC2, color = PC3, 
                                             label = rownames(word_embeddings_matrix))) +
  geom_point(size = 2) + 
  geom_text(vjust = 1.5, hjust = 0.5) +  
  scale_color_gradient(low = "blue", high = "red") + 
  theme_minimal() +  
  labs(title = "", x = "PC1", y = "PC2", color = "PC3") 

stacked_plot2D

Transformer Embeddings (BERT or More)

# Load Required Package ----------------------------
library(flaiR)

# Initialize BERT and Text Processing --------------
# Import Sentence class for text operations
Sentence <- flair_data()$Sentence

# Initialize BERT model (base uncased version)
TransformerWordEmbeddings <- flair_embeddings()$TransformerWordEmbeddings("bert-base-uncased")

# Text Processing and Embedding --------------------
# Create sentence with semantic relationship pairs
sentence <- Sentence("King Queen man woman Paris London apple orange Taiwan Dublin Bamberg") 

# Apply BERT embeddings to the sentence
TransformerWordEmbeddings$embed(sentence)
#> [[1]]
#> Sentence[11]: "King Queen man woman Paris London apple orange Taiwan Dublin Bamberg"

# Extract embeddings into matrix format
sen_df <- process_embeddings(sentence, verbose = TRUE)
#> Extracting token embeddings...
#> Converting embeddings to matrix format...Processing completed in 0.004 seconds
#> Generated embedding matrix with 11 tokens and 768 dimensions

# Dimensionality Reduction ------------------------
# Set random seed for reproducibility
set.seed(123)

# Apply PCA to reduce dimensions to 3 components
pca_result <- prcomp(sen_df, center = TRUE, scale. = TRUE)

# Extract first three principal components
word_embeddings_matrix <- as.data.frame(pca_result$x[,1:3])
word_embeddings_matrix
#>                 PC1        PC2        PC3
#> King    -11.6842542   4.264797  -2.539303
#> Queen   -18.6452604  13.278048  -9.781884
#> man     -18.3566181  10.571207  -1.609842
#> woman   -10.6397232  -2.072423   2.264132
#> Paris     6.9079152  -9.409334 -10.378022
#> London   10.7817678  -8.882271 -11.447016
#> apple    -0.6539754  -8.305123  12.104275
#> orange   -3.4409765  -4.756996  19.216482
#> Taiwan    4.5377967 -12.949604   4.782874
#> Dublin   13.4845075 -11.327307  -7.191567
#> Bamberg  27.7088205  29.589006   4.579870
# 2D Plot
library(ggplot2)

bert_plot2D <- ggplot(word_embeddings_matrix, aes(x = PC1, y = PC2, color = PC3, 
                                             label = rownames(word_embeddings_matrix))) +
  geom_point(size = 2) + 
  geom_text(vjust = 1.5, hjust = 0.5) +  
  scale_color_gradient(low = "blue", high = "red") + 
  theme_minimal() +  
  labs(title = "", x = "PC1", y = "PC2", color = "PC3") 
  # guides(color = "none") 

stacked_plot2D

Embedding Models Comparison

library(ggpubr)

figure <- ggarrange(glove_plot2D, stacked_plot2D, bert_plot2D,
                   labels = c("Glove", "Stacked Embedding", "BERT"),
                   ncol = 3, nrow = 1,
                   common.legend = TRUE,
                   legend = "bottom",
                   font.label = list(size = 8))

figure

 


Training a Binary Classifier

In this section, we’ll train a sentiment analysis model that can categorize text as either positive or negative. This case study is adapted from pages 116 to 130 of Tadej Magajna’s book, ‘Natural Language Processing with Flair’. The process for training text classifiers in Flair mirrors the process followed for sequence labeling models. Specifically, the steps to train text classifiers are:

  • Load a tagged corpus and compute the label dictionary map.
  • Prepare the document embeddings.
  • Initialize the TextClassifier class.
  • Train the model.

Loading a Tagged Corpus

Training text classification models requires a set of text documents (typically, sentences or paragraphs) where each document is associated with one or more classification labels. To train our sentiment analysis text classification model, we will be using the famous Internet Movie Database (IMDb) dataset, which contains 50,000 movie reviews from IMDB, where each review is labeled as either positive or negative. References to this dataset are already baked into Flair, so loading the dataset couldn’t be easier:

library(flaiR)
# load IMDB from flair_datasets module
Corpus <- flair_data()$Corpus
IMDB <- flair_datasets()$IMDB
# downsize to 0.05
corpus = IMDB()
#> 2025-01-19 17:57:08,728 Reading data from /Users/yenchiehliao/.flair/datasets/imdb_v4-rebalanced
#> 2025-01-19 17:57:08,728 Train: /Users/yenchiehliao/.flair/datasets/imdb_v4-rebalanced/train.txt
#> 2025-01-19 17:57:08,728 Dev: None
#> 2025-01-19 17:57:08,728 Test: None
#> 2025-01-19 17:57:09,355 No test split found. Using 10% (i.e. 5000 samples) of the train split as test data
#> 2025-01-19 17:57:09,370 No dev split found. Using 10% (i.e. 4500 samples) of the train split as dev data
#> 2025-01-19 17:57:09,370 Initialized corpus /Users/yenchiehliao/.flair/datasets/imdb_v4-rebalanced (label type name is 'sentiment')
corpus$downsample(0.05)
#> <flair.datasets.document_classification.IMDB object at 0x5322085e0>

Print the sizes in the corpus object as follows - test: %d | train: %d | dev: %d”

test_size <- length(corpus$test)
train_size <- length(corpus$train)
dev_size <- length(corpus$dev)
output <- sprintf("Corpus object sizes - Test: %d | Train: %d | Dev: %d", test_size, train_size, dev_size)
print(output)
#> [1] "Corpus object sizes - Test: 250 | Train: 2025 | Dev: 225"
lbl_type = 'sentiment'
label_dict = corpus$make_label_dictionary(label_type=lbl_type)
#> 2025-01-19 17:57:09,471 Computing label dictionary. Progress:
#> 2025-01-19 17:57:13,825 Dictionary created for label 'sentiment' with 2 values: POSITIVE (seen 1014 times), NEGATIVE (seen 1011 times)

Loading the Embeddings

flaiR covers all the different types of document embeddings that we can use. Here, we simply use DocumentPoolEmbeddings. They require no training prior to training the classification model itself:

DocumentPoolEmbeddings <- flair_embeddings()$DocumentPoolEmbeddings
WordEmbeddings <- flair_embeddings()$WordEmbeddings
glove = WordEmbeddings('glove')
document_embeddings = DocumentPoolEmbeddings(glove)

Initializing the TextClassifier

# initiate TextClassifier
TextClassifier <- flair_models()$TextClassifier
classifier <- TextClassifier(document_embeddings,
                             label_dictionary = label_dict,
                             label_type = lbl_type)

$to allows you to set the device to use CPU, GPU, or specific MPS devices on Mac (such as mps:0, mps:1, mps:2).

classifier$to(flair_device("mps")) 
TextClassifier(
  (embeddings): DocumentPoolEmbeddings(
    fine_tune_mode=none, pooling=mean
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings(
        'glove'
        (embedding): Embedding(400001, 100)
      )
    )
  )
  (decoder): Linear(in_features=100, out_features=3, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
)

Training the Model

Training the text classifier model involves two simple steps:

  • Defining the model trainer class by passing in the classifier model and the corpus
  • Setting off the training process passing in the required training hyper-parameters.

It is worth noting that the ‘L’ in numbers like 32L and 5L is used in R to denote that the number is an integer. Without the ‘L’ suffix, numbers in R are treated as numeric, which are by default double-precision floating-point numbers. In contrast, Python determines the type based on the value of the number itself. Whole numbers (e.g., 5 or 32) are of type int, while numbers with decimal points (e.g., 5.0) are of type float. Floating-point numbers in both languages are representations of real numbers but can have some approximation due to the way they are stored in memory.

# initiate ModelTrainer
ModelTrainer <- flair_trainers()$ModelTrainer

# fit the model
trainer <- ModelTrainer(classifier, corpus)

# start to train
# note: the 'L' in 32L is used in R to denote that the number is an integer.
trainer$train('classifier',
              learning_rate=0.1,
              mini_batch_size=32L,
              # specifies how embeddings are stored in RAM, ie."cpu", "cuda", "gpu", "mps".
              # embeddings_storage_mode = "mps",
              max_epochs=10L)
#> 2025-01-19 17:57:15,786 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,786 Model: "TextClassifier(
#>   (embeddings): DocumentPoolEmbeddings(
#>     fine_tune_mode=none, pooling=mean
#>     (embeddings): StackedEmbeddings(
#>       (list_embedding_0): WordEmbeddings(
#>         'glove'
#>         (embedding): Embedding(400001, 100)
#>       )
#>     )
#>   )
#>   (decoder): Linear(in_features=100, out_features=2, bias=True)
#>   (dropout): Dropout(p=0.0, inplace=False)
#>   (locked_dropout): LockedDropout(p=0.0)
#>   (word_dropout): WordDropout(p=0.0)
#>   (loss_function): CrossEntropyLoss()
#>   (weights): None
#>   (weight_tensor) None
#> )"
#> 2025-01-19 17:57:15,786 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,786 Corpus: 2025 train + 225 dev + 250 test sentences
#> 2025-01-19 17:57:15,786 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,786 Train:  2025 sentences
#> 2025-01-19 17:57:15,786         (train_with_dev=False, train_with_test=False)
#> 2025-01-19 17:57:15,786 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,786 Training Params:
#> 2025-01-19 17:57:15,787  - learning_rate: "0.1" 
#> 2025-01-19 17:57:15,787  - mini_batch_size: "32"
#> 2025-01-19 17:57:15,787  - max_epochs: "10"
#> 2025-01-19 17:57:15,787  - shuffle: "True"
#> 2025-01-19 17:57:15,787 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,787 Plugins:
#> 2025-01-19 17:57:15,787  - AnnealOnPlateau | patience: '3', anneal_factor: '0.5', min_learning_rate: '0.0001'
#> 2025-01-19 17:57:15,787 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,787 Final evaluation on model from best epoch (best-model.pt)
#> 2025-01-19 17:57:15,787  - metric: "('micro avg', 'f1-score')"
#> 2025-01-19 17:57:15,787 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,787 Computation:
#> 2025-01-19 17:57:15,787  - compute on device: cpu
#> 2025-01-19 17:57:15,787  - embedding storage: cpu
#> 2025-01-19 17:57:15,787 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,787 Model training base path: "classifier"
#> 2025-01-19 17:57:15,787 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:15,787 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:16,611 epoch 1 - iter 6/64 - loss 1.03482102 - time (sec): 0.82 - samples/sec: 233.27 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:17,459 epoch 1 - iter 12/64 - loss 0.97625112 - time (sec): 1.67 - samples/sec: 229.70 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:18,597 epoch 1 - iter 18/64 - loss 0.97130605 - time (sec): 2.81 - samples/sec: 205.04 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:19,494 epoch 1 - iter 24/64 - loss 0.96554917 - time (sec): 3.71 - samples/sec: 207.22 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:20,360 epoch 1 - iter 30/64 - loss 0.95172536 - time (sec): 4.57 - samples/sec: 209.97 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:21,288 epoch 1 - iter 36/64 - loss 0.93596985 - time (sec): 5.50 - samples/sec: 209.42 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:22,160 epoch 1 - iter 42/64 - loss 0.93621534 - time (sec): 6.37 - samples/sec: 210.90 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:23,028 epoch 1 - iter 48/64 - loss 0.93329721 - time (sec): 7.24 - samples/sec: 212.13 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:23,914 epoch 1 - iter 54/64 - loss 0.92732978 - time (sec): 8.13 - samples/sec: 212.64 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:25,014 epoch 1 - iter 60/64 - loss 0.91905521 - time (sec): 9.23 - samples/sec: 208.09 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:25,350 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:25,351 EPOCH 1 done: loss 0.9183 - lr: 0.100000
#> 2025-01-19 17:57:26,540 DEV : loss 0.969170331954956 - f1-score (micro avg)  0.4533
#> 2025-01-19 17:57:27,284  - 0 epochs without improvement
#> 2025-01-19 17:57:27,286 saving best model
#> 2025-01-19 17:57:27,742 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:28,633 epoch 2 - iter 6/64 - loss 0.89565008 - time (sec): 0.89 - samples/sec: 215.76 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:29,471 epoch 2 - iter 12/64 - loss 0.91454447 - time (sec): 1.73 - samples/sec: 222.17 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:30,413 epoch 2 - iter 18/64 - loss 0.87032511 - time (sec): 2.67 - samples/sec: 215.69 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:31,788 epoch 2 - iter 24/64 - loss 0.89476166 - time (sec): 4.04 - samples/sec: 189.88 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:32,833 epoch 2 - iter 30/64 - loss 0.89790537 - time (sec): 5.09 - samples/sec: 188.60 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:33,749 epoch 2 - iter 36/64 - loss 0.88154723 - time (sec): 6.01 - samples/sec: 191.81 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:34,676 epoch 2 - iter 42/64 - loss 0.89160718 - time (sec): 6.93 - samples/sec: 193.85 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:35,640 epoch 2 - iter 48/64 - loss 0.88830253 - time (sec): 7.90 - samples/sec: 194.50 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:36,507 epoch 2 - iter 54/64 - loss 0.88188778 - time (sec): 8.76 - samples/sec: 197.16 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:37,352 epoch 2 - iter 60/64 - loss 0.87024007 - time (sec): 9.61 - samples/sec: 199.81 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:37,949 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:37,949 EPOCH 2 done: loss 0.8670 - lr: 0.100000
#> 2025-01-19 17:57:39,164 DEV : loss 0.6613298654556274 - f1-score (micro avg)  0.56
#> 2025-01-19 17:57:40,142  - 0 epochs without improvement
#> 2025-01-19 17:57:40,143 saving best model
#> 2025-01-19 17:57:40,564 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:41,486 epoch 3 - iter 6/64 - loss 0.97559432 - time (sec): 0.92 - samples/sec: 208.37 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:42,411 epoch 3 - iter 12/64 - loss 0.96988676 - time (sec): 1.85 - samples/sec: 208.00 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:43,299 epoch 3 - iter 18/64 - loss 0.95166033 - time (sec): 2.73 - samples/sec: 210.67 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:44,141 epoch 3 - iter 24/64 - loss 0.94594647 - time (sec): 3.58 - samples/sec: 214.71 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:45,034 epoch 3 - iter 30/64 - loss 0.92877954 - time (sec): 4.47 - samples/sec: 214.81 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:45,966 epoch 3 - iter 36/64 - loss 0.90791758 - time (sec): 5.40 - samples/sec: 213.29 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:46,949 epoch 3 - iter 42/64 - loss 0.90284721 - time (sec): 6.38 - samples/sec: 210.50 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:47,878 epoch 3 - iter 48/64 - loss 0.89997107 - time (sec): 7.31 - samples/sec: 210.03 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:48,761 epoch 3 - iter 54/64 - loss 0.89072626 - time (sec): 8.20 - samples/sec: 210.82 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:49,898 epoch 3 - iter 60/64 - loss 0.89468155 - time (sec): 9.33 - samples/sec: 205.71 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:50,259 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:50,260 EPOCH 3 done: loss 0.8853 - lr: 0.100000
#> 2025-01-19 17:57:51,672 DEV : loss 0.9235679507255554 - f1-score (micro avg)  0.4533
#> 2025-01-19 17:57:52,159  - 1 epochs without improvement
#> 2025-01-19 17:57:52,161 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:57:53,106 epoch 4 - iter 6/64 - loss 0.82624254 - time (sec): 0.94 - samples/sec: 203.19 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:53,959 epoch 4 - iter 12/64 - loss 0.90411119 - time (sec): 1.80 - samples/sec: 213.56 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:55,042 epoch 4 - iter 18/64 - loss 0.85366339 - time (sec): 2.88 - samples/sec: 199.89 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:55,900 epoch 4 - iter 24/64 - loss 0.83558531 - time (sec): 3.74 - samples/sec: 205.38 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:56,780 epoch 4 - iter 30/64 - loss 0.85402448 - time (sec): 4.62 - samples/sec: 207.82 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:57,610 epoch 4 - iter 36/64 - loss 0.85074003 - time (sec): 5.45 - samples/sec: 211.42 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:58,572 epoch 4 - iter 42/64 - loss 0.83589199 - time (sec): 6.41 - samples/sec: 209.65 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:57:59,508 epoch 4 - iter 48/64 - loss 0.84322071 - time (sec): 7.35 - samples/sec: 209.06 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:00,355 epoch 4 - iter 54/64 - loss 0.83743314 - time (sec): 8.19 - samples/sec: 210.89 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:01,270 epoch 4 - iter 60/64 - loss 0.83712479 - time (sec): 9.11 - samples/sec: 210.77 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:01,853 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:01,854 EPOCH 4 done: loss 0.8385 - lr: 0.100000
#> 2025-01-19 17:58:03,036 DEV : loss 0.902965784072876 - f1-score (micro avg)  0.4533
#> 2025-01-19 17:58:03,839  - 2 epochs without improvement
#> 2025-01-19 17:58:03,841 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:04,777 epoch 5 - iter 6/64 - loss 0.83386131 - time (sec): 0.94 - samples/sec: 205.26 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:05,637 epoch 5 - iter 12/64 - loss 0.81110337 - time (sec): 1.80 - samples/sec: 213.86 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:06,518 epoch 5 - iter 18/64 - loss 0.81918713 - time (sec): 2.68 - samples/sec: 215.20 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:07,421 epoch 5 - iter 24/64 - loss 0.82163843 - time (sec): 3.58 - samples/sec: 214.54 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:08,367 epoch 5 - iter 30/64 - loss 0.80355368 - time (sec): 4.53 - samples/sec: 212.15 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:09,668 epoch 5 - iter 36/64 - loss 0.82123225 - time (sec): 5.83 - samples/sec: 197.71 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:10,579 epoch 5 - iter 42/64 - loss 0.83147700 - time (sec): 6.74 - samples/sec: 199.47 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:11,466 epoch 5 - iter 48/64 - loss 0.82776047 - time (sec): 7.62 - samples/sec: 201.46 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:12,335 epoch 5 - iter 54/64 - loss 0.82137293 - time (sec): 8.49 - samples/sec: 203.45 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:13,244 epoch 5 - iter 60/64 - loss 0.82507927 - time (sec): 9.40 - samples/sec: 204.20 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:13,859 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:13,860 EPOCH 5 done: loss 0.8266 - lr: 0.100000
#> 2025-01-19 17:58:15,043 DEV : loss 0.8814549446105957 - f1-score (micro avg)  0.4533
#> 2025-01-19 17:58:15,775  - 3 epochs without improvement
#> 2025-01-19 17:58:15,778 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:16,730 epoch 6 - iter 6/64 - loss 0.86599507 - time (sec): 0.95 - samples/sec: 201.65 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:17,617 epoch 6 - iter 12/64 - loss 0.83119471 - time (sec): 1.84 - samples/sec: 208.79 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:18,510 epoch 6 - iter 18/64 - loss 0.83965523 - time (sec): 2.73 - samples/sec: 210.80 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:19,388 epoch 6 - iter 24/64 - loss 0.82790930 - time (sec): 3.61 - samples/sec: 212.72 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:20,239 epoch 6 - iter 30/64 - loss 0.82493706 - time (sec): 4.46 - samples/sec: 215.17 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:21,176 epoch 6 - iter 36/64 - loss 0.82221150 - time (sec): 5.40 - samples/sec: 213.40 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:22,076 epoch 6 - iter 42/64 - loss 0.82115991 - time (sec): 6.30 - samples/sec: 213.41 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:22,959 epoch 6 - iter 48/64 - loss 0.82263479 - time (sec): 7.18 - samples/sec: 213.89 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:23,894 epoch 6 - iter 54/64 - loss 0.81930931 - time (sec): 8.12 - samples/sec: 212.90 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:24,738 epoch 6 - iter 60/64 - loss 0.81072097 - time (sec): 8.96 - samples/sec: 214.28 - lr: 0.100000 - momentum: 0.000000
#> 2025-01-19 17:58:25,342 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:25,342 EPOCH 6 done: loss 0.8158 - lr: 0.100000
#> 2025-01-19 17:58:26,564 DEV : loss 0.7792848944664001 - f1-score (micro avg)  0.5556
#> 2025-01-19 17:58:27,369  - 4 epochs without improvement (above 'patience')-> annealing learning_rate to [0.05]
#> 2025-01-19 17:58:27,372 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:28,302 epoch 7 - iter 6/64 - loss 0.62141190 - time (sec): 0.93 - samples/sec: 206.66 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:29,161 epoch 7 - iter 12/64 - loss 0.60887330 - time (sec): 1.79 - samples/sec: 214.74 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:30,066 epoch 7 - iter 18/64 - loss 0.60901921 - time (sec): 2.69 - samples/sec: 213.85 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:30,950 epoch 7 - iter 24/64 - loss 0.62312240 - time (sec): 3.58 - samples/sec: 214.66 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:32,162 epoch 7 - iter 30/64 - loss 0.64444851 - time (sec): 4.79 - samples/sec: 200.43 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:33,033 epoch 7 - iter 36/64 - loss 0.64126529 - time (sec): 5.66 - samples/sec: 203.52 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:33,938 epoch 7 - iter 42/64 - loss 0.65158008 - time (sec): 6.57 - samples/sec: 204.71 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:34,892 epoch 7 - iter 48/64 - loss 0.65823784 - time (sec): 7.52 - samples/sec: 204.26 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:35,851 epoch 7 - iter 54/64 - loss 0.65939955 - time (sec): 8.48 - samples/sec: 203.80 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:36,873 epoch 7 - iter 60/64 - loss 0.66072118 - time (sec): 9.50 - samples/sec: 202.11 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:37,460 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:37,460 EPOCH 7 done: loss 0.6589 - lr: 0.050000
#> 2025-01-19 17:58:38,670 DEV : loss 0.6003721356391907 - f1-score (micro avg)  0.6978
#> 2025-01-19 17:58:39,431  - 0 epochs without improvement
#> 2025-01-19 17:58:39,433 saving best model
#> 2025-01-19 17:58:39,830 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:40,762 epoch 8 - iter 6/64 - loss 0.65252212 - time (sec): 0.93 - samples/sec: 206.13 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:41,702 epoch 8 - iter 12/64 - loss 0.63390189 - time (sec): 1.87 - samples/sec: 205.19 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:42,589 epoch 8 - iter 18/64 - loss 0.63412659 - time (sec): 2.76 - samples/sec: 208.80 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:43,478 epoch 8 - iter 24/64 - loss 0.63389341 - time (sec): 3.65 - samples/sec: 210.60 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:44,393 epoch 8 - iter 30/64 - loss 0.63322895 - time (sec): 4.56 - samples/sec: 210.44 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:45,532 epoch 8 - iter 36/64 - loss 0.63415459 - time (sec): 5.70 - samples/sec: 202.07 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:46,428 epoch 8 - iter 42/64 - loss 0.63968469 - time (sec): 6.60 - samples/sec: 203.72 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:47,332 epoch 8 - iter 48/64 - loss 0.65097602 - time (sec): 7.50 - samples/sec: 204.76 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:48,184 epoch 8 - iter 54/64 - loss 0.65496666 - time (sec): 8.35 - samples/sec: 206.85 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:49,079 epoch 8 - iter 60/64 - loss 0.64874915 - time (sec): 9.25 - samples/sec: 207.60 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:49,687 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:49,687 EPOCH 8 done: loss 0.6494 - lr: 0.050000
#> 2025-01-19 17:58:50,971 DEV : loss 0.6876248717308044 - f1-score (micro avg)  0.5467
#> 2025-01-19 17:58:51,758  - 1 epochs without improvement
#> 2025-01-19 17:58:51,759 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:58:52,705 epoch 9 - iter 6/64 - loss 0.64467014 - time (sec): 0.95 - samples/sec: 203.00 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:53,618 epoch 9 - iter 12/64 - loss 0.64915569 - time (sec): 1.86 - samples/sec: 206.68 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:54,532 epoch 9 - iter 18/64 - loss 0.63960192 - time (sec): 2.77 - samples/sec: 207.79 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:55,468 epoch 9 - iter 24/64 - loss 0.62827631 - time (sec): 3.71 - samples/sec: 207.09 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:56,406 epoch 9 - iter 30/64 - loss 0.63651646 - time (sec): 4.65 - samples/sec: 206.62 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:57,212 epoch 9 - iter 36/64 - loss 0.63269677 - time (sec): 5.45 - samples/sec: 211.28 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:58,079 epoch 9 - iter 42/64 - loss 0.64019648 - time (sec): 6.32 - samples/sec: 212.68 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:59,018 epoch 9 - iter 48/64 - loss 0.64170534 - time (sec): 7.26 - samples/sec: 211.61 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:58:59,958 epoch 9 - iter 54/64 - loss 0.63846545 - time (sec): 8.20 - samples/sec: 210.78 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:01,073 epoch 9 - iter 60/64 - loss 0.63938779 - time (sec): 9.31 - samples/sec: 206.16 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:01,438 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:01,438 EPOCH 9 done: loss 0.6397 - lr: 0.050000
#> 2025-01-19 17:59:02,642 DEV : loss 0.5814307332038879 - f1-score (micro avg)  0.7156
#> 2025-01-19 17:59:03,402  - 0 epochs without improvement
#> 2025-01-19 17:59:03,404 saving best model
#> 2025-01-19 17:59:03,809 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:04,641 epoch 10 - iter 6/64 - loss 0.60317001 - time (sec): 0.83 - samples/sec: 231.02 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:05,539 epoch 10 - iter 12/64 - loss 0.61682722 - time (sec): 1.73 - samples/sec: 222.06 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:06,427 epoch 10 - iter 18/64 - loss 0.61251386 - time (sec): 2.62 - samples/sec: 220.11 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:07,569 epoch 10 - iter 24/64 - loss 0.61089186 - time (sec): 3.76 - samples/sec: 204.32 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:08,433 epoch 10 - iter 30/64 - loss 0.62348493 - time (sec): 4.62 - samples/sec: 207.65 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:09,306 epoch 10 - iter 36/64 - loss 0.61944086 - time (sec): 5.50 - samples/sec: 209.61 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:10,243 epoch 10 - iter 42/64 - loss 0.62064590 - time (sec): 6.43 - samples/sec: 208.90 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:11,219 epoch 10 - iter 48/64 - loss 0.62229136 - time (sec): 7.41 - samples/sec: 207.32 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:12,235 epoch 10 - iter 54/64 - loss 0.62610477 - time (sec): 8.42 - samples/sec: 205.11 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:13,184 epoch 10 - iter 60/64 - loss 0.62748551 - time (sec): 9.37 - samples/sec: 204.82 - lr: 0.050000 - momentum: 0.000000
#> 2025-01-19 17:59:13,757 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:13,757 EPOCH 10 done: loss 0.6305 - lr: 0.050000
#> 2025-01-19 17:59:15,000 DEV : loss 0.5568863153457642 - f1-score (micro avg)  0.7244
#> 2025-01-19 17:59:15,874  - 0 epochs without improvement
#> 2025-01-19 17:59:15,876 saving best model
#> 2025-01-19 17:59:16,683 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:16,684 Loading model from best epoch ...
#> 2025-01-19 17:59:18,120 
#> Results:
#> - F-score (micro) 0.712
#> - F-score (macro) 0.7078
#> - Accuracy 0.712
#> 
#> By class:
#>               precision    recall  f1-score   support
#> 
#>     NEGATIVE     0.6541    0.8595    0.7429       121
#>     POSITIVE     0.8132    0.5736    0.6727       129
#> 
#>     accuracy                         0.7120       250
#>    macro avg     0.7336    0.7166    0.7078       250
#> weighted avg     0.7362    0.7120    0.7067       250
#> 
#> 2025-01-19 17:59:18,120 ----------------------------------------------------------------------------------------------------
#> $test_score
#> [1] 0.712

Loading and Using the Classifiers

After training the text classification model, the resulting classifier will already be stored in memory as part of the classifier variable. It is possible, however, that your Python session exited after training. If so, you’ll need to load the model into memory with the following:

TextClassifier <- flair_models()$TextClassifier
classifier <- TextClassifier$load('classifier/best-model.pt')

We import the Sentence object. Now, we can generate predictions on some example text inputs.

Sentence <- flair_data()$Sentence
sentence <- Sentence("great")
classifier$predict(sentence)
print(sentence$labels)
#> [[1]]
#> 'Sentence[1]: "great"'/'POSITIVE' (1.0)
sentence <- Sentence("sad")
classifier$predict(sentence)
print(sentence$labels)
#> [[1]]
#> 'Sentence[1]: "sad"'/'NEGATIVE' (0.6663)

 


Training RNNs

Here, we train a sentiment analysis model to categorize text. In this case, we also include a pipeline that implements the use of Recurrent Neural Networks (RNN). This makes them particularly effective for tasks involving sequential data. This section also show you how to implement one of most powerful features in flaiR, stacked embeddings. You can stack multiple embeddings with different layers and let the classifier learn from different types of features. In Flair NLP, and with the flaiR package, it’s very easy to accomplish this task.

Import Necessary Modules

library(flaiR)
WordEmbeddings <- flair_embeddings()$WordEmbeddings
FlairEmbeddings <- flair_embeddings()$FlairEmbeddings
DocumentRNNEmbeddings <- flair_embeddings()$DocumentRNNEmbeddings
TextClassifier <- flair_models()$TextClassifier
ModelTrainer <- flair_trainers()$ModelTrainer

Get the IMDB Corpus

The IMDB movie review dataset is used here, which is a commonly utilized dataset for sentiment analysis. $downsample(0.1) method means only 10% of the dataset is used, allowing for a faster demonstration.

# load the IMDB file and downsize it to 0.1
IMDB <- flair_datasets()$IMDB
corpus <- IMDB()$downsample(0.1) 
#> 2025-01-19 17:59:18,604 Reading data from /Users/yenchiehliao/.flair/datasets/imdb_v4-rebalanced
#> 2025-01-19 17:59:18,604 Train: /Users/yenchiehliao/.flair/datasets/imdb_v4-rebalanced/train.txt
#> 2025-01-19 17:59:18,604 Dev: None
#> 2025-01-19 17:59:18,604 Test: None
#> 2025-01-19 17:59:19,199 No test split found. Using 10% (i.e. 5000 samples) of the train split as test data
#> 2025-01-19 17:59:19,217 No dev split found. Using 10% (i.e. 4500 samples) of the train split as dev data
#> 2025-01-19 17:59:19,218 Initialized corpus /Users/yenchiehliao/.flair/datasets/imdb_v4-rebalanced (label type name is 'sentiment')
# create the label dictionary
lbl_type <- 'sentiment'
label_dict <- corpus$make_label_dictionary(label_type=lbl_type)
#> 2025-01-19 17:59:19,236 Computing label dictionary. Progress:
#> 2025-01-19 17:59:28,293 Dictionary created for label 'sentiment' with 2 values: POSITIVE (seen 2056 times), NEGATIVE (seen 1994 times)

Stacked Embeddings

This is one of Flair’s most powerful features: it allows for the integration of embeddings to enable the model to learn from more sparse features. Three types of embeddings are utilized here: GloVe embeddings, and two types of Flair embeddings (forward and backward). Word embeddings are used to convert words into vectors.

# make a list of word embeddings
word_embeddings <- list(WordEmbeddings('glove'),
                        FlairEmbeddings('news-forward-fast'),
                        FlairEmbeddings('news-backward-fast'))

# initialize the document embeddings
document_embeddings <- DocumentRNNEmbeddings(word_embeddings, 
                                             hidden_size = 512L,
                                             reproject_words = TRUE,
                                             reproject_words_dimension = 256L)
# create a Text Classifier with the embeddings and label dictionary
classifier <- TextClassifier(document_embeddings, 
                            label_dictionary=label_dict, label_type='class')

# initialize the text classifier trainer with our corpus
trainer <- ModelTrainer(classifier, corpus)

Start the Training

For the sake of this example, setting max_epochs to 5. You might want to increase this for better performance.

It is worth noting that the learning rate is a parameter that determines the step size at each iteration while moving towards a minimum of the loss function. A smaller learning rate could slow down the learning process, but it could lead to more precise convergence. mini_batch_size determines the number of samples that will be used to compute the gradient at each step. The ‘L’ in 32L is used in R to denote that the number is an integer.

patience (aka early stop) is a hyper-parameter used in conjunction with early stopping to avoid overfitting. It determines the number of epochs the training process will tolerate without improvements before stopping the training. Setting max_epochs to 5 means the algorithm will make five passes through the dataset.

# note: the 'L' in 32L is used in R to denote that the number is an integer.
trainer$train('models/sentiment',
              learning_rate=0.1,
              mini_batch_size=32L,
              patience=5L,
              max_epochs=5L)  

To Apply the Trained Model for Prediction

sentence <- "This movie was really exciting!"
classifier$predict(sentence)
print(sentence.labels)

Finetune Transformers

We use data from The Temporal Focus of Campaign Communication (2020 JOP) as an example. Let’s assume we receive the data for training from different times. First, suppose you have a dataset of 1000 entries called cc_muller_old. On another day, with the help of nice friends, you receive another set of data, adding 2000 entries in a dataset called cc_muller_new. Both subsets are from data(cc_muller). We will show how to fine-tune a transformer model with cc_muller_old, and then continue with another round of fine-tuning using cc_muller_new.

Fine-tuning a Transformers Model

Step 1 Load Necessary Modules from Flair

Load necessary classes from flair package.

# Sentence is a class for holding a text sentence
Sentence <- flair_data()$Sentence

# Corpus is a class for text corpora
Corpus <- flair_data()$Corpus

# TransformerDocumentEmbeddings is a class for loading transformer 
TransformerDocumentEmbeddings <- flair_embeddings()$TransformerDocumentEmbeddings

# TextClassifier is a class for text classification
TextClassifier <- flair_models()$TextClassifier

# ModelTrainer is a class for training and evaluating models
ModelTrainer <- flair_trainers()$ModelTrainer

We use purrr to help us split sentences using Sentence from flair_data(), then use map2 to add labels, and finally use Corpus to segment the data.

library(purrr)

data(cc_muller)
cc_muller_old <- cc_muller[1:1000,]

old_text <- map(cc_muller_old$text, Sentence)
old_labels <- as.character(cc_muller_old$class)

old_text <- map2(old_text, old_labels, ~ {
  .x$add_label("classification", .y)
  .x
})
print(length(old_text))
#> [1] 1000
set.seed(2046)
sample <- sample(c(TRUE, FALSE), length(old_text), replace=TRUE, prob=c(0.8, 0.2))
old_train  <- old_text[sample]
old_test   <- old_text[!sample]

test_id <- sample(c(TRUE, FALSE), length(old_test), replace=TRUE, prob=c(0.5, 0.5))
old_test   <- old_test[test_id]
old_dev   <- old_test[!test_id]

If you do not provide a development set (dev set) while using Flair, it will automatically split the training data into training and development datasets. The test set is used for training the model and evaluating its final performance, whereas the development set is used for adjusting model parameters and preventing overfitting, or in other words, for early stopping of the model.

old_corpus <- Corpus(train = old_train, test = old_test)
#> 2025-01-19 17:59:31,009 No dev split found. Using 10% (i.e. 80 samples) of the train split as dev data

Step 3 Load distilbert Transformer

document_embeddings <- TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=TRUE)

First, the $make_label_dictionary function is used to automatically create a label dictionary for the classification task. The label dictionary is a mapping from label to index, which is used to map the labels to a tensor of label indices. Besides classification tasks, flaiR also supports other label types for training custom model. From the cc_muller dataset: Future (seen 423 times), Present (seen 262 times), Past (seen 131 times).

old_label_dict <- old_corpus$make_label_dictionary(label_type="classification")
#> 2025-01-19 17:59:32,368 Computing label dictionary. Progress:
#> 2025-01-19 17:59:32,380 Dictionary created for label 'classification' with 3 values: Future (seen 380 times), Present (seen 232 times), Past (seen 111 times)

TextClassifier is used to create a text classifier. The classifier takes the document embeddings (importing from 'distilbert-base-uncased' from Hugging Face) and the label dictionary as input. The label type is also specified as classification.

old_classifier <- TextClassifier(document_embeddings,
                                 label_dictionary = old_label_dict, 
                                 label_type='classification')

Step 4 Start Training

ModelTrainer is used to train the model.

old_trainer <- ModelTrainer(model = old_classifier, corpus = old_corpus)
old_trainer$train("vignettes/inst/muller-campaign-communication",  
                  learning_rate=0.02,              
                  mini_batch_size=8L,              
                  anneal_with_restarts = TRUE,
                  save_final_model=TRUE,
                  max_epochs=1L)   
#> 2025-01-19 17:59:32,537 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,537 Model: "TextClassifier(
#>   (embeddings): TransformerDocumentEmbeddings(
#>     (model): DistilBertModel(
#>       (embeddings): Embeddings(
#>         (word_embeddings): Embedding(30523, 768, padding_idx=0)
#>         (position_embeddings): Embedding(512, 768)
#>         (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#>         (dropout): Dropout(p=0.1, inplace=False)
#>       )
#>       (transformer): Transformer(
#>         (layer): ModuleList(
#>           (0-5): 6 x TransformerBlock(
#>             (attention): DistilBertSdpaAttention(
#>               (dropout): Dropout(p=0.1, inplace=False)
#>               (q_lin): Linear(in_features=768, out_features=768, bias=True)
#>               (k_lin): Linear(in_features=768, out_features=768, bias=True)
#>               (v_lin): Linear(in_features=768, out_features=768, bias=True)
#>               (out_lin): Linear(in_features=768, out_features=768, bias=True)
#>             )
#>             (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#>             (ffn): FFN(
#>               (dropout): Dropout(p=0.1, inplace=False)
#>               (lin1): Linear(in_features=768, out_features=3072, bias=True)
#>               (lin2): Linear(in_features=3072, out_features=768, bias=True)
#>               (activation): GELUActivation()
#>             )
#>             (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#>           )
#>         )
#>       )
#>     )
#>   )
#>   (decoder): Linear(in_features=768, out_features=3, bias=True)
#>   (dropout): Dropout(p=0.0, inplace=False)
#>   (locked_dropout): LockedDropout(p=0.0)
#>   (word_dropout): WordDropout(p=0.0)
#>   (loss_function): CrossEntropyLoss()
#>   (weights): None
#>   (weight_tensor) None
#> )"
#> 2025-01-19 17:59:32,537 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,537 Corpus: 723 train + 80 dev + 85 test sentences
#> 2025-01-19 17:59:32,537 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,537 Train:  723 sentences
#> 2025-01-19 17:59:32,537         (train_with_dev=False, train_with_test=False)
#> 2025-01-19 17:59:32,537 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,537 Training Params:
#> 2025-01-19 17:59:32,537  - learning_rate: "0.02" 
#> 2025-01-19 17:59:32,537  - mini_batch_size: "8"
#> 2025-01-19 17:59:32,538  - max_epochs: "1"
#> 2025-01-19 17:59:32,538  - shuffle: "True"
#> 2025-01-19 17:59:32,538 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,538 Plugins:
#> 2025-01-19 17:59:32,538  - AnnealOnPlateau | patience: '3', anneal_factor: '0.5', min_learning_rate: '0.0001'
#> 2025-01-19 17:59:32,538 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,538 Final evaluation on model from best epoch (best-model.pt)
#> 2025-01-19 17:59:32,538  - metric: "('micro avg', 'f1-score')"
#> 2025-01-19 17:59:32,538 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,538 Computation:
#> 2025-01-19 17:59:32,538  - compute on device: cpu
#> 2025-01-19 17:59:32,538  - embedding storage: cpu
#> 2025-01-19 17:59:32,538 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,538 Model training base path: "vignettes/inst/muller-campaign-communication"
#> 2025-01-19 17:59:32,538 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:32,538 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:35,241 epoch 1 - iter 9/91 - loss 1.15913804 - time (sec): 2.70 - samples/sec: 26.64 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:37,824 epoch 1 - iter 18/91 - loss 1.09388583 - time (sec): 5.29 - samples/sec: 27.24 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:40,444 epoch 1 - iter 27/91 - loss 0.98490597 - time (sec): 7.91 - samples/sec: 27.32 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:43,101 epoch 1 - iter 36/91 - loss 0.90666179 - time (sec): 10.56 - samples/sec: 27.27 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:45,659 epoch 1 - iter 45/91 - loss 0.86723306 - time (sec): 13.12 - samples/sec: 27.44 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:48,056 epoch 1 - iter 54/91 - loss 0.82877369 - time (sec): 15.52 - samples/sec: 27.84 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:51,125 epoch 1 - iter 63/91 - loss 0.82633138 - time (sec): 18.59 - samples/sec: 27.12 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:53,371 epoch 1 - iter 72/91 - loss 0.77545721 - time (sec): 20.83 - samples/sec: 27.65 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:55,599 epoch 1 - iter 81/91 - loss 0.75503454 - time (sec): 23.06 - samples/sec: 28.10 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:57,828 epoch 1 - iter 90/91 - loss 0.73309229 - time (sec): 25.29 - samples/sec: 28.47 - lr: 0.020000 - momentum: 0.000000
#> 2025-01-19 17:59:57,950 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:57,951 EPOCH 1 done: loss 0.7313 - lr: 0.020000
#> 2025-01-19 17:59:58,735 DEV : loss 0.4210357666015625 - f1-score (micro avg)  0.8375
#> 2025-01-19 17:59:58,737  - 0 epochs without improvement
#> 2025-01-19 17:59:58,739 saving best model
#> 2025-01-19 17:59:59,547 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 17:59:59,548 Loading model from best epoch ...
#> 2025-01-19 18:00:01,467 
#> Results:
#> - F-score (micro) 0.8235
#> - F-score (macro) 0.8246
#> - Accuracy 0.8235
#> 
#> By class:
#>               precision    recall  f1-score   support
#> 
#>       Future     0.8125    0.9070    0.8571        43
#>      Present     0.7826    0.6667    0.7200        27
#>         Past     0.9286    0.8667    0.8966        15
#> 
#>     accuracy                         0.8235        85
#>    macro avg     0.8412    0.8134    0.8246        85
#> weighted avg     0.8235    0.8235    0.8205        85
#> 
#> 2025-01-19 18:00:01,467 ----------------------------------------------------------------------------------------------------
#> $test_score
#> [1] 0.8235294

Continue Fine-tuning with New Dataset

Now, we can continue to fine tune the already fine tuned model with an additional 2000 pieces of data. First, let’s say we have another 2000 entries called cc_muller_new. We can fine-tune the previous model with these 2000 entries. The steps are the same as before. For this case, we don’t need to split the dataset again. We can use the entire 2000 entries as the training set and use the old_test set to evaluate how well our refined model performs.

Step 1 Load the muller-campaign-communication Model

Load the model (old_model) you have already fine tuned from previous stage and let’s fine tune it with the new data, new_corpus.

old_model <- TextClassifier$load("vignettes/inst/muller-campaign-communication/best-model.pt")

Step 2 Convert the New Data to Sentence and Corpus

library(purrr)
cc_muller_new <- cc_muller[1001:3000,]
new_text <- map(cc_muller_new$text, Sentence)
new_labels <- as.character(cc_muller_new$class)

new_text <- map2(new_text, new_labels, ~ {
  .x$add_label("classification", .y)
  .x
})
new_corpus <- Corpus(train=new_text, test=old_test)
#> 2025-01-19 18:00:03,729 No dev split found. Using 10% (i.e. 200 samples) of the train split as dev data

Step 3 Create a New Model Trainer with the Old Model and New Corpus

new_trainer <- ModelTrainer(old_model, new_corpus)

Step 4 Train the New Model

new_trainer$train("vignettes/inst/new-muller-campaign-communication",
                  learning_rate=0.002, 
                  mini_batch_size=8L,  
                  max_epochs=1L)    
#> 2025-01-19 18:00:03,821 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,822 Model: "TextClassifier(
#>   (embeddings): TransformerDocumentEmbeddings(
#>     (model): DistilBertModel(
#>       (embeddings): Embeddings(
#>         (word_embeddings): Embedding(30523, 768, padding_idx=0)
#>         (position_embeddings): Embedding(512, 768)
#>         (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#>         (dropout): Dropout(p=0.1, inplace=False)
#>       )
#>       (transformer): Transformer(
#>         (layer): ModuleList(
#>           (0-5): 6 x TransformerBlock(
#>             (attention): DistilBertSdpaAttention(
#>               (dropout): Dropout(p=0.1, inplace=False)
#>               (q_lin): Linear(in_features=768, out_features=768, bias=True)
#>               (k_lin): Linear(in_features=768, out_features=768, bias=True)
#>               (v_lin): Linear(in_features=768, out_features=768, bias=True)
#>               (out_lin): Linear(in_features=768, out_features=768, bias=True)
#>             )
#>             (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#>             (ffn): FFN(
#>               (dropout): Dropout(p=0.1, inplace=False)
#>               (lin1): Linear(in_features=768, out_features=3072, bias=True)
#>               (lin2): Linear(in_features=3072, out_features=768, bias=True)
#>               (activation): GELUActivation()
#>             )
#>             (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
#>           )
#>         )
#>       )
#>     )
#>   )
#>   (decoder): Linear(in_features=768, out_features=3, bias=True)
#>   (dropout): Dropout(p=0.0, inplace=False)
#>   (locked_dropout): LockedDropout(p=0.0)
#>   (word_dropout): WordDropout(p=0.0)
#>   (loss_function): CrossEntropyLoss()
#>   (weights): None
#>   (weight_tensor) None
#> )"
#> 2025-01-19 18:00:03,822 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,822 Corpus: 1800 train + 200 dev + 85 test sentences
#> 2025-01-19 18:00:03,822 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,822 Train:  1800 sentences
#> 2025-01-19 18:00:03,822         (train_with_dev=False, train_with_test=False)
#> 2025-01-19 18:00:03,822 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,823 Training Params:
#> 2025-01-19 18:00:03,823  - learning_rate: "0.002" 
#> 2025-01-19 18:00:03,823  - mini_batch_size: "8"
#> 2025-01-19 18:00:03,823  - max_epochs: "1"
#> 2025-01-19 18:00:03,823  - shuffle: "True"
#> 2025-01-19 18:00:03,823 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,823 Plugins:
#> 2025-01-19 18:00:03,823  - AnnealOnPlateau | patience: '3', anneal_factor: '0.5', min_learning_rate: '0.0001'
#> 2025-01-19 18:00:03,823 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,823 Final evaluation on model from best epoch (best-model.pt)
#> 2025-01-19 18:00:03,823  - metric: "('micro avg', 'f1-score')"
#> 2025-01-19 18:00:03,824 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,824 Computation:
#> 2025-01-19 18:00:03,824  - compute on device: cpu
#> 2025-01-19 18:00:03,824  - embedding storage: cpu
#> 2025-01-19 18:00:03,824 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,824 Model training base path: "vignettes/inst/new-muller-campaign-communication"
#> 2025-01-19 18:00:03,824 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:03,824 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:00:09,250 epoch 1 - iter 22/225 - loss 0.47686889 - time (sec): 5.43 - samples/sec: 32.44 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:15,460 epoch 1 - iter 44/225 - loss 0.47116109 - time (sec): 11.64 - samples/sec: 30.25 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:21,372 epoch 1 - iter 66/225 - loss 0.45405770 - time (sec): 17.55 - samples/sec: 30.09 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:27,612 epoch 1 - iter 88/225 - loss 0.42943539 - time (sec): 23.79 - samples/sec: 29.60 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:33,126 epoch 1 - iter 110/225 - loss 0.41669404 - time (sec): 29.30 - samples/sec: 30.03 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:39,205 epoch 1 - iter 132/225 - loss 0.41539682 - time (sec): 35.38 - samples/sec: 29.85 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:45,387 epoch 1 - iter 154/225 - loss 0.41510456 - time (sec): 41.56 - samples/sec: 29.64 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:51,676 epoch 1 - iter 176/225 - loss 0.41127963 - time (sec): 47.85 - samples/sec: 29.42 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:00:56,708 epoch 1 - iter 198/225 - loss 0.40708528 - time (sec): 52.88 - samples/sec: 29.95 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:01:02,230 epoch 1 - iter 220/225 - loss 0.39587183 - time (sec): 58.40 - samples/sec: 30.13 - lr: 0.002000 - momentum: 0.000000
#> 2025-01-19 18:01:03,442 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:01:03,442 EPOCH 1 done: loss 0.3941 - lr: 0.002000
#> 2025-01-19 18:01:05,382 DEV : loss 0.40370461344718933 - f1-score (micro avg)  0.85
#> 2025-01-19 18:01:05,387  - 0 epochs without improvement
#> 2025-01-19 18:01:05,389 saving best model
#> 2025-01-19 18:01:06,176 ----------------------------------------------------------------------------------------------------
#> 2025-01-19 18:01:06,176 Loading model from best epoch ...
#> 2025-01-19 18:01:08,031 
#> Results:
#> - F-score (micro) 0.8706
#> - F-score (macro) 0.8779
#> - Accuracy 0.8706
#> 
#> By class:
#>               precision    recall  f1-score   support
#> 
#>       Future     0.8837    0.8837    0.8837        43
#>      Present     0.7931    0.8519    0.8214        27
#>         Past     1.0000    0.8667    0.9286        15
#> 
#>     accuracy                         0.8706        85
#>    macro avg     0.8923    0.8674    0.8779        85
#> weighted avg     0.8755    0.8706    0.8718        85
#> 
#> 2025-01-19 18:01:08,032 ----------------------------------------------------------------------------------------------------
#> $test_score
#> [1] 0.8705882

Model Performance Metrics: Pre and Post Fine-tuning

After fine-tuning for 1 epoch, the model showed improved performance on the same test set.

Evaluation Metric Pre-finetune Post-finetune Improvement
F-score (micro) 0.7294 0.8471 +0.1177
F-score (macro) 0.7689 0.8583 +0.0894
Accuracy 0.7294 0.8471 +0.1177

More R tutorial and documentation see here.

Using Your Own Fine-tuned Model in flaiR

This seciton demonstrates how to utilize your custom fine-tuned model in flaiR for text classification tasks. Let’s explore this process step by step.

Setting Up Your Environment

First, we need to load the flaiR package and prepare our model:

library(flaiR)
classifier <- flair_models()$TextClassifier$load('vignettes/inst/new-muller-campaign-communication/best-model.pt')

It’s important to verify your model’s compatibility with $model_card. You can check this by examining the version requirements:

print(classifier$model_card)
#> $flair_version
#> [1] "0.15.0"
#> 
#> $pytorch_version
#> [1] "2.5.1"
#> 
#> $transformers_version
#> [1] "4.48.0"
#> 
#> $training_parameters
#> $training_parameters$base_path
#> PosixPath('vignettes/inst/new-muller-campaign-communication')
#> 
#> $training_parameters$learning_rate
#> [1] 0.002
#> 
#> $training_parameters$decoder_learning_rate
#> NULL
#> 
#> $training_parameters$mini_batch_size
#> [1] 8
#> 
#> $training_parameters$eval_batch_size
#> [1] 64
#> 
#> $training_parameters$mini_batch_chunk_size
#> NULL
#> 
#> $training_parameters$max_epochs
#> [1] 1
#> 
#> $training_parameters$optimizer
#> [1] "torch.optim.sgd.SGD"
#> 
#> $training_parameters$train_with_dev
#> [1] FALSE
#> 
#> $training_parameters$train_with_test
#> [1] FALSE
#> 
#> $training_parameters$max_grad_norm
#> [1] 5
#> 
#> $training_parameters$reduce_transformer_vocab
#> [1] FALSE
#> 
#> $training_parameters$main_evaluation_metric
#> $training_parameters$main_evaluation_metric[[1]]
#> [1] "micro avg"
#> 
#> $training_parameters$main_evaluation_metric[[2]]
#> [1] "f1-score"
#> 
#> 
#> $training_parameters$monitor_test
#> [1] FALSE
#> 
#> $training_parameters$monitor_train_sample
#> [1] 0
#> 
#> $training_parameters$use_final_model_for_eval
#> [1] FALSE
#> 
#> $training_parameters$gold_label_dictionary_for_eval
#> NULL
#> 
#> $training_parameters$exclude_labels
#> list()
#> 
#> $training_parameters$sampler
#> NULL
#> 
#> $training_parameters$shuffle
#> [1] TRUE
#> 
#> $training_parameters$shuffle_first_epoch
#> [1] TRUE
#> 
#> $training_parameters$embeddings_storage_mode
#> [1] "cpu"
#> 
#> $training_parameters$epoch
#> [1] 1
#> 
#> $training_parameters$save_final_model
#> [1] TRUE
#> 
#> $training_parameters$save_optimizer_state
#> [1] FALSE
#> 
#> $training_parameters$save_model_each_k_epochs
#> [1] 0
#> 
#> $training_parameters$create_file_logs
#> [1] TRUE
#> 
#> $training_parameters$create_loss_file
#> [1] TRUE
#> 
#> $training_parameters$write_weights
#> [1] FALSE
#> 
#> $training_parameters$use_amp
#> [1] FALSE
#> 
#> $training_parameters$multi_gpu
#> [1] FALSE
#> 
#> $training_parameters$plugins
#> $training_parameters$plugins[[1]]
#> $training_parameters$plugins[[1]]$`__cls__`
#> [1] "flair.trainers.plugins.functional.anneal_on_plateau.AnnealingPlugin"
#> 
#> $training_parameters$plugins[[1]]$base_path
#> [1] "vignettes/inst/new-muller-campaign-communication"
#> 
#> $training_parameters$plugins[[1]]$min_learning_rate
#> [1] 1e-04
#> 
#> $training_parameters$plugins[[1]]$anneal_factor
#> [1] 0.5
#> 
#> $training_parameters$plugins[[1]]$patience
#> [1] 3
#> 
#> $training_parameters$plugins[[1]]$initial_extra_patience
#> [1] 0
#> 
#> $training_parameters$plugins[[1]]$anneal_with_restarts
#> [1] FALSE
#> 
#> 
#> 
#> $training_parameters$kwargs
#> $training_parameters$kwargs$lr
#> [1] 0.002
# Check required versions
print(classifier$model_card$transformers_version)  # Required transformers version
#> [1] "4.48.0"
print(classifier$model_card$flair_version)         # Required Flair version
#> [1] "0.15.0"

Making Predictions

To make predictions, we first need to prepare our text by creating a Sentence object. This is a key component in Flair’s architecture that handles text processing:

# Get the Sentence class from flaiR
Sentence <- flair_data()$Sentence

# Create a Sentence object with your text
sentence <- Sentence("And to boost the housing we need, we will start to build a new generation of garden cities.")

# Make prediction
classifier$predict(sentence)

Access Prediction Results

Get predicted label

prediction <- sentence$labels[[1]]$value
print(prediction)
#> [1] "Future"

Get confidence score

confidence <- sentence$labels[[1]]$score 
print(confidence)
#> [1] 0.9971271

 


Extending conText’s Embedding Regression

ConText is a fast, flexible, and transparent framework for estimating context-specific word and short document embeddings using the ‘a la carte’ embeddings regression, implemented by Rodriguez et al (2022) and Rodriguez et al (2024).

In this case study, we will demonstrate how to use the conText package alongside other embedding frameworks by working through the example provided in Rodriguez et al.’s Quick Start Guide. While ConText includes its own cross-lingual ALC Embeddings, this tutorial extends its capabilities by integrating it with flaiR. Through this tutorial integration, this tutorial shows how to:

  • Access flaiR’s powerful embedding models

  • Connect with any transformer-based embedding models from HuggingFace via FlaiR

We will be following the example directly from Rodriguez et al.’s Quick Start Guide as this case study. It’s important to note that results obtained using alternative embedding frameworks may deviate from the original implementation, and should be interpreted with caution. These comparative results are primarily intended for reference and educational use.

First of all, when loading the conText package, you’ll find three pre-loaded datasets: cr_sample_corpus, cr_glove_subset, and cr_transform. These datasets are used in the package’s tutorial to demonstrate preprocessing steps. For this exercise, we only use cr_sample_corpus to explore other embedding frameworks, including:

  • en-crawl embedding

  • Flair NLP contextual embeddings (as described in Akbik et al., COLING 2018 paper)

  • Integrated embeddings extracted from transformers like BERT.

Build Document-Embedding-Matrix with Other Embedding Frameworks

Step 1 Tokenize Text with quanteda and conText

First, let’s tokenize cr_sample_corpus using the tokens_context function from the conText package.

# tokenize corpus removing unnecessary (i.e. semantically uninformative) elements
toks <- tokens(cr_sample_corpus, remove_punct=T, remove_symbols=T, remove_numbers=T, remove_separators=T)

# clean out stopwords and words with 2 or fewer characters
toks_nostop <- tokens_select(toks, pattern = stopwords("en"), selection = "remove", min_nchar=3)

# only use features that appear at least 5 times in the corpus
feats <- dfm(toks_nostop, tolower=T, verbose = FALSE) %>% dfm_trim(min_termfreq = 5) %>% featnames()

# leave the pads so that non-adjacent words will not become adjacent
toks_nostop_feats <- tokens_select(toks_nostop, feats, padding = TRUE)

# build a tokenized corpus of contexts surrounding the target term "immigration"
immig_toks <- tokens_context(x = toks_nostop_feats, pattern = "immigr*", window = 6L)
#> 125 instances of "immigrant" found.
#> 288 instances of "immigrants" found.
#> 924 instances of "immigration" found.

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

Step 2 Import Embedding Tools

To facilitate the loading of different embedding types, I’ll import the following classes and functions from flaiR: WordEmbeddings, FlairEmbeddings, TransformerWordEmbeddings, StackedEmbeddings, and Sentence. These components enable us to work with GloVe embeddings, Flair’s contextual embeddings, and transformer-based embeddings from the HuggingFace library.

Initialize a Flair Sentence object by concatenating cr_glove_subset row names. The collapse parameter ensures proper tokenization by adding space delimiters. Then, embed the sentence text using the loaded fasttext embeddings.

# Combine all text into a single string and create a Flair sentence
sentence <- Sentence(paste(rownames(cr_glove_subset), collapse = " "))

# Apply FastText embeddings to the sentence
fasttext_embeddings$embed(sentence)
#> [[1]]
#> Sentence[500]: "people bill immigration president can american country law states one united speaker time going just now senate years want congress work get border know said make security many think act need children today come house also americans like year dont say america new state take way jobs senator federal government first back even amendment important well legislation support right women reform colleagues million committee laws system percent vote floor last every nation program made issue care immigrants see department national health world good republican families court workers executive administration much help illegal enforcement republicans thank fact members working number opportunity thats believe pass great never legal obama homeland money family across two put look process came may violence order give rule job done able pay must day economy home community budget tax public without passed things better since majority texas let justice another hope policy lets provide debate history ago sure presidents action week young actually nations amendments bipartisan says long issues comprehensive next something place life doesnt thing lot point chairman economic business leadership service constitution tell status dream cant yield future office billion visa find military still member talk together school rights continue part madam amnesty days secure bring best yet lives democrats broken times side nothing ask protect allow problem around making clear body foreign communities including coming use rise simply heard forward end programs leader already saying individuals general power funding case whether debt course keep question worked needs washington war change got address secretary parents little least hard three citizens different education judiciary ever strong immigrant always means current report child less really services political given increase might enough create countries domestic human supreme authority brought reason kind past become aliens understand millions judge enforce benefits constitutional high stand others criminal serve stop middle real matter talking makes someone proud month safety line used college live didnt class called call far borders victims ensure district thousands california taxes rules obamacare deal goes representatives took insurance spending companies actions small instead read resources friends mexico seen illegally policies trying labor party example crisis bills votes ice special wage went taken men asked told comes serious maybe students businesses getting words group chamber senators taking leaders wish dollars citizenship sense problems close information social friend stay offer record away months set urge person minimum democratic cost local voted numbers join energy served happen aisle found congressional wanted according plan kids try chance major almost powers trade undocumented critical poverty start article single left certainly cases access patrol individual honor remember income attorney within officers wrong speak company second hundreds impact created hours respect arizona though move hearing employees gentleman anyone along known meet chair full southern big trillion level four ability experience face provisions man provides fix open theyre efforts asian university decision groups industry allowed higher often agree white visas held upon path certain share isnt looking employers consider colleague average woman anything however balance former city interest control crime weeks free gone unemployment idea true"

The process_embeddings function from flaiR extracts pre-embedded GloVe vectors from a sentence object and arranges them into a structured matrix. In this matrix, tokens are represented as rows, embedding dimensions as columns, and each row is labeled with its corresponding token text.

fasttext_subset <- process_embeddings(sentence, verbose = TRUE)
#> Extracting token embeddings...
#> Converting embeddings to matrix format...Processing completed in 3.527 seconds
#> Generated embedding matrix with 500 tokens and 300 dimensions

Step 3 Computing Context-Specific Word Embeddings Using FastText

Create a feature co-occurrence matrix (FCM) from tokenized text and transform pre-trained FastText embeddings using co-occurrence information.

# Create a feature co-occurrence matrix (FCM) from tokenized text
toks_fcm <- fcm(toks_nostop_feats, 
                context = "window",    
                window = 6,            
                count = "frequency",   
                tri = FALSE)           

# Transform pre-trained Glove embeddings using co-occurrence information
ft_transform <- compute_transform(
    x = toks_fcm,                    
    pre_trained = fasttext_subset,        
    weighting = 'log'                
)

Calculate Document Embedding Matrix (DEM) using transformed FastText embeddings.

# Calculate Document Embedding Matrix (DEM) using transformed FastText embeddings
immig_dem_ft <- dem(x = immig_dfm, 
                    pre_trained = fasttext_subset, 
                    transform = TRUE, 
                    transform_matrix = ft_transform, 
                    verbose = TRUE)

Show each document inherits its corresponding docvars.

head(immig_dem_ft@docvars)
#>       pattern party gender nominate_dim1
#> 1  immigrants     D      F        -0.759
#> 2 immigration     D      F        -0.402
#> 3   immigrant     D      F        -0.402
#> 4   immigrant     D      F        -0.402
#> 5   immigrant     D      F        -0.402
#> 6 immigration     D      F        -0.402

Step 4 Embedding Eegression

set.seed(2021L)
library(conText)
ft_model <- conText(formula = immigration ~ party + gender,
                    data = toks_nostop_feats,
                    pre_trained = fasttext_subset,
                    transform = TRUE, 
                    transform_matrix = ft_transform, 
                    confidence_level = 0.95,
                    permute = TRUE, 
                    jackknife = TRUE,
                    num_permutations = 100,
                    window = 6, case_insensitive = TRUE,
                    verbose = FALSE)
#> Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
#> https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
#> 
#>   coefficient normed.estimate std.error  lower.ci upper.ci p.value
#> 1     party_R        1.169415 0.1589605 0.8574491 1.481381       0
#> 2    gender_M        0.837367 0.1433172 0.5561016 1.118632       0

extract D-dimensional beta coefficients.

# The intercept in this case is the fastext embedding for female Democrats
# beta coefficients can be combined to get each group's fastext embedding
DF_wv <- ft_model['(Intercept)',]  # (D)emocrat - (F)emale 
DM_wv <- ft_model['(Intercept)',] + ft_model['gender_M',] # (D)emocrat - (M)ale 
RF_wv <- ft_model['(Intercept)',] + ft_model['party_R',]  # (R)epublican - (F)emale 
RM_wv <- ft_model['(Intercept)',] + ft_model['party_R',] + ft_model['gender_M',] # (R)epublican - (M)ale

nearest neighbors

nns(rbind(DF_wv,DM_wv), 
    N = 10, 
    pre_trained = fasttext_subset, 
    candidates = ft_model@features)
#> $DM_wv
#> # A tibble: 10 × 4
#>    target feature      rank value
#>    <chr>  <chr>       <int> <dbl>
#>  1 DM_wv  immigration     1 0.850
#>  2 DM_wv  immigrants      2 0.746
#>  3 DM_wv  immigrant       3 0.698
#>  4 DM_wv  education       4 0.589
#>  5 DM_wv  government      5 0.587
#>  6 DM_wv  legislation     6 0.579
#>  7 DM_wv  city            7 0.569
#>  8 DM_wv  law             8 0.566
#>  9 DM_wv  reform          9 0.559
#> 10 DM_wv  citizenship    10 0.557
#> 
#> $DF_wv
#> # A tibble: 10 × 4
#>    target feature      rank value
#>    <chr>  <chr>       <int> <dbl>
#>  1 DF_wv  immigration     1 0.779
#>  2 DF_wv  immigrants      2 0.667
#>  3 DF_wv  immigrant       3 0.620
#>  4 DF_wv  education       4 0.574
#>  5 DF_wv  government      5 0.556
#>  6 DF_wv  economic        6 0.551
#>  7 DF_wv  citizenship     7 0.549
#>  8 DF_wv  legislation     8 0.540
#>  9 DF_wv  issues          9 0.531
#> 10 DF_wv  laws           10 0.530
library(ggplot2)
ggplot(ft_model@normed_coefficients, aes(x = coefficient, y = normed.estimate)) +
  geom_errorbar(aes(ymin = lower.ci, ymax = upper.ci), width = 0.2) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  theme_minimal() +
  labs(
    title = "Estimated Coefficients with 95% CIs",
    x = "Variables",
    y = "Normalized Estimate"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Exploring Document-Embedding Matrix with conText Functions

Check dimensions of the resulting matrix.

# Calculate average document embeddings for immigration-related texts
immig_wv_ft <- matrix(colMeans(immig_dem_ft), 
                   ncol = ncol(immig_dem_ft)) %>%  `rownames<-`("immigration")
dim(immig_wv_ft)
#> [1]   1 300

to get group-specific embeddings, average within party

immig_wv_ft_party <- dem_group(immig_dem_ft, 
                               groups = immig_dem_ft@docvars$party)
dim(immig_wv_ft_party)
#> [1]   2 300

Find nearest neighbors by party

# find nearest neighbors by party
# setting as_list = FALSE combines each group's results into a single tibble (useful for joint plotting)
immig_nns_ft <- nns(immig_wv_ft_party, 
                    pre_trained = fasttext_subset, 
                    N = 5, 
                    candidates = immig_wv_ft_party@features, 
                    as_list = TRUE)

check out results for Republican.

immig_nns_ft[["R"]]
#> # A tibble: 5 × 4
#>   target feature      rank value
#>   <chr>  <chr>       <int> <dbl>
#> 1 R      immigration     1 0.837
#> 2 R      immigrants      2 0.779
#> 3 R      immigrant       3 0.721
#> 4 R      government      4 0.590
#> 5 R      laws            5 0.588

check out results for Democrat

immig_nns_ft[["D"]]
#> # A tibble: 5 × 4
#>   target feature      rank value
#>   <chr>  <chr>       <int> <dbl>
#> 1 D      immigration     1 0.849
#> 2 D      immigrants      2 0.817
#> 3 D      immigrant       3 0.764
#> 4 D      education       4 0.621
#> 5 D      families        5 0.612

compute the cosine similarity between each party’s embedding and a specific set of features

cos_sim(immig_wv_ft_party, 
        pre_trained = fasttext_subset, 
        features = c('reform', 'enforcement'), as_list = FALSE)
#>   target     feature     value
#> 1      D      reform 0.5306484
#> 2      R      reform 0.5425264
#> 3      D enforcement 0.5102908
#> 4      R enforcement 0.5647015

compute the cosine similarity between each party’s embedding and a specific set of features.

# Republican
nns_ratio(x = immig_wv_ft_party, 
          N = 15, 
          numerator = "R", 
          candidates = immig_wv_ft_party@features, 
          pre_trained = fasttext_subset, 
          verbose = FALSE)
#>        feature     value
#> 1       coming 1.3142709
#> 2       things 1.2022012
#> 3    political 1.1506919
#> 4  enforcement 1.1066269
#> 5       aliens 1.0194435
#> 6       issues 1.0124816
#> 7   government 0.9908574
#> 8        visas 0.9904320
#> 9         laws 0.9874748
#> 10 immigration 0.9849872
#> 11 legislation 0.9666258
#> 12  immigrants 0.9526017
#> 13 citizenship 0.9524804
#> 14   education 0.9448045
#> 15   immigrant 0.9435372
#> 16         law 0.9222851
#> 17   community 0.8816272
#> 18    citizens 0.8722014
#> 19        city 0.8544540
#> 20    families 0.7845902
#> 21 communities 0.7798629
# Democrat
nns_ratio(x = immig_wv_ft_party, 
          N = 15, 
          numerator = "D", 
          candidates = immig_wv_ft_party@features, 
          pre_trained = fasttext_subset, 
          verbose = FALSE)
#>        feature     value
#> 1  communities 1.2822766
#> 2     families 1.2745507
#> 3         city 1.1703380
#> 4     citizens 1.1465242
#> 5    community 1.1342663
#> 6          law 1.0842634
#> 7    immigrant 1.0598416
#> 8    education 1.0584201
#> 9  citizenship 1.0498903
#> 10  immigrants 1.0497567
#> 11 legislation 1.0345265
#> 12 immigration 1.0152416
#> 13        laws 1.0126841
#> 14       visas 1.0096605
#> 15  government 1.0092270
#> 16      issues 0.9876722
#> 17      aliens 0.9809273
#> 18 enforcement 0.9036469
#> 19   political 0.8690423
#> 20      things 0.8318075
#> 21      coming 0.7608781

compute the cosine similarity between each party’s embedding and a set of tokenized contexts

immig_ncs <- ncs(x = immig_wv_ft_party, 
                 contexts_dem = immig_dem_ft, 
                 contexts = immig_toks, 
                 N = 5, 
                 as_list = TRUE)

# nearest contexts to Republican embedding of target term
# note, these may included contexts originating from Democrat speakers
immig_ncs[["R"]]
#> # A tibble: 5 × 4
#>   target context                                                rank value
#>   <chr>  <chr>                                                 <int> <dbl>
#> 1 R      reminded difficult achieve consensus issues reform s…     1 0.536
#> 2 R      laws last years eight states adopted enforcement mea…     2 0.496
#> 3 R      well exactly right way stop illegal reducing demand …     3 0.492
#> 4 R      speeches border enforcement speeches stopping illega…     4 0.482
#> 5 R      welcome legal immigrants legal process illegal simpl…     5 0.472
immig_ncs[["D"]]
#> # A tibble: 5 × 4
#>   target context                                                rank value
#>   <chr>  <chr>                                                 <int> <dbl>
#> 1 D      heritage principles demand courage reform broken sys…     1 0.571
#> 2 D      speaker almost year since twothirds comprehensive re…     2 0.562
#> 3 D      senate accepted indeed need thorough comprehensive r…     3 0.562
#> 4 D      state union consideration bill provide comprehensive…     4 0.536
#> 5 D      cover house republican failure bring comprehensive r…     5 0.534

Comparative Analysis of A La Carte, Flair Stacked, and BERT Embeddings

Build A La Carte Document-Embedding-Matrix

# build a document-embedding-matrix
immig_dem <- dem(x = immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = TRUE)

set.seed(2021L)
alc_model <- conText(formula = immigration ~ party + gender,
                     data = toks_nostop_feats,
                     pre_trained = cr_glove_subset,
                     transform = TRUE, 
                     transform_matrix = cr_transform,
                     jackknife = TRUE, 
                     confidence_level = 0.95,
                     permute = TRUE, 
                     num_permutations = 100,
                     window = 6, 
                     case_insensitive = TRUE,
                     verbose = FALSE)
#> Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
#> https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
#> 
#>   coefficient normed.estimate std.error lower.ci upper.ci p.value
#> 1     party_R        2.960860 0.2398964 2.490054 3.431665       0
#> 2    gender_M        2.303401 0.2290150 1.853950 2.752851       0

Document Embedding Matrix Construction Using Flair Contextual Stacked Embeddings

To facilitate the loading of different embedding types, I’ll import the following classes and functions from flaiF: WordEmbeddings, FlairEmbeddings, TransformerWordEmbeddings, StackedEmbeddings, and Sentence. These components enable us to work with GloVe embeddings, Flair’s contextual embeddings, and transformer-based embeddings from the HuggingFace library.

Combine three different types of embeddings into a stacked embedding model.

This creates a stacked embedding model that combines:

  • FastText embeddings: Captures general word semantics
  • Forward Flair: Captures contextual information reading text left-to-right
  • Backward Flair: Captures contextual information reading text right-to-left
stacked_embeddings  <- StackedEmbeddings(list(
  fasttext_embeddings,    
  flair_forward,          
  flair_backward        
))
# Step 1: Create a Flair Sentence object from the text
sentence <- Sentence(paste(rownames(cr_glove_subset), collapse = " "))

# Step 2: Generate embeddings using our stacked model
stacked_embeddings$embed(sentence)

# Step 3: Extract and store embeddings for each token
stacked_subset <- process_embeddings(sentence, verbose = TRUE)
#> Extracting token embeddings...
#> Converting embeddings to matrix format...Processing completed in 3.123 seconds
#> Generated embedding matrix with 500 tokens and 4396 dimensions

# Step 4: Compute transformation matrix

st_transform <- compute_transform(
   x = toks_fcm,               
   pre_trained = stacked_subset,
   weighting = 'log'          
)

# Step 5: Generate document embeddings matrix
immig_dem_st <- dem(
   x = immig_dfm,              
   pre_trained = stacked_subset,
   transform = TRUE,          
   transform_matrix = st_transform,
   verbose = TRUE             
)

# Step 6: Fit conText model for analysis
set.seed(2021L)                 
st_model <- conText(formula = immigration ~ party + gender,  
                    data = toks_nostop_feats,                
                    pre_trained = stacked_subset,          
                    transform = TRUE,                      
                    transform_matrix = st_transform,       
                    jackknife = TRUE,                      
                    confidence_level = 0.95,             
                    permute = TRUE,                      
                    num_permutations = 100,               
                    window = 6,                          
                    case_insensitive = TRUE,              
                    verbose = FALSE)
#> Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
#> https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
#> 
#>   coefficient normed.estimate std.error  lower.ci upper.ci p.value
#> 1     party_R        121.4567  96.63111 -68.18549 311.0988    0.32
#> 2    gender_M        226.2376 102.94598  24.20230 428.2730    0.07

Document Embedding Matrix Construction with BERT

BERT embeddings provide powerful contextual representations through their bidirectional transformer architecture. These embeddings are good at understanding context from both directions within text, generating deep contextual representations through multiple transformer layers, and leveraging pre-training on large text corpora to achieve strong performance across NLP tasks. The classic BERT base model generates 768-dimensional embeddings for each token, providing rich semantic representations.

By utilizing the Flair framework, we also can seamlessly integrate:

  • Multiple BERT variants like RoBERTa and DistilBERT
  • Cross-lingual models such as XLM-RoBERTa
  • Domain-adapted BERT models
  • Any transformer model available on HuggingFace
# Initialize BERT base uncased model embeddings from HuggingFace
bert_embeddings <- TransformerWordEmbeddings('bert-base-uncased')
# Step 1: Create a Flair Sentence object from the text
sentence <- Sentence(paste(rownames(cr_glove_subset), collapse = " "))

# Step 2: Generate embeddings using BERT model from HugginFace
bert_embeddings$embed(sentence)
#> [[1]]
#> Sentence[500]: "people bill immigration president can american country law states one united speaker time going just now senate years want congress work get border know said make security many think act need children today come house also americans like year dont say america new state take way jobs senator federal government first back even amendment important well legislation support right women reform colleagues million committee laws system percent vote floor last every nation program made issue care immigrants see department national health world good republican families court workers executive administration much help illegal enforcement republicans thank fact members working number opportunity thats believe pass great never legal obama homeland money family across two put look process came may violence order give rule job done able pay must day economy home community budget tax public without passed things better since majority texas let justice another hope policy lets provide debate history ago sure presidents action week young actually nations amendments bipartisan says long issues comprehensive next something place life doesnt thing lot point chairman economic business leadership service constitution tell status dream cant yield future office billion visa find military still member talk together school rights continue part madam amnesty days secure bring best yet lives democrats broken times side nothing ask protect allow problem around making clear body foreign communities including coming use rise simply heard forward end programs leader already saying individuals general power funding case whether debt course keep question worked needs washington war change got address secretary parents little least hard three citizens different education judiciary ever strong immigrant always means current report child less really services political given increase might enough create countries domestic human supreme authority brought reason kind past become aliens understand millions judge enforce benefits constitutional high stand others criminal serve stop middle real matter talking makes someone proud month safety line used college live didnt class called call far borders victims ensure district thousands california taxes rules obamacare deal goes representatives took insurance spending companies actions small instead read resources friends mexico seen illegally policies trying labor party example crisis bills votes ice special wage went taken men asked told comes serious maybe students businesses getting words group chamber senators taking leaders wish dollars citizenship sense problems close information social friend stay offer record away months set urge person minimum democratic cost local voted numbers join energy served happen aisle found congressional wanted according plan kids try chance major almost powers trade undocumented critical poverty start article single left certainly cases access patrol individual honor remember income attorney within officers wrong speak company second hundreds impact created hours respect arizona though move hearing employees gentleman anyone along known meet chair full southern big trillion level four ability experience face provisions man provides fix open theyre efforts asian university decision groups industry allowed higher often agree white visas held upon path certain share isnt looking employers consider colleague average woman anything however balance former city interest control crime weeks free gone unemployment idea true"

# Step 3: Extract and store embeddings for each token
bert_subset <- process_embeddings(sentence, verbose = TRUE)
#> Extracting token embeddings...
#> Converting embeddings to matrix format...Processing completed in 2.798 seconds
#> Generated embedding matrix with 500 tokens and 768 dimensions

# Step 4: Compute transformation matrix 
bt_transform <- compute_transform(x = toks_fcm,     
                                  pre_trained = bert_subset,
                                  weighting = 'log')

# Step 5: Generate document embeddings matrix
immig_dem_bt <- dem(x = immig_dfm,
                    pre_trained = bert_subset,
                    transform = TRUE,          
                    transform_matrix = bt_transform,
                    verbose = TRUE)

# Step 6: Fit conText model for analysis
set.seed(2021L)                 
bt_model <- conText(formula = immigration ~ party + gender,  
                    data = toks_nostop_feats,                
                    pre_trained = bert_subset,          
                    transform = TRUE,                      
                    transform_matrix = bt_transform,       
                    jackknife = TRUE,                      
                    confidence_level = 0.95,             
                    permute = TRUE,                      
                    num_permutations = 100,               
                    window = 6,                          
                    case_insensitive = TRUE,              
                    verbose = FALSE)
#> Note: These values are not regression coefficients. Check out the Quick Start Guide for help with interpretation: 
#> https://github.com/prodriguezsosa/conText/blob/master/vignettes/quickstart.md
#> 
#>   coefficient normed.estimate std.error   lower.ci  upper.ci p.value
#> 1     party_R        356.6358  260.9004 -155.39102  868.6627    0.20
#> 2    gender_M        571.7140  276.0197   30.01485 1113.4131    0.07

Comparision of Different Embedding Approaches

While this tutorial doesn’t determine a definitive best approach, it’s important to understand the key distinctions between word embedding methods. BERT, FastText, Flair Stacked Embeddings, and GloVe can be categorized into two groups: dynamic and static embeddings.

Dynamic embeddings, particularly BERT and Flair, adapt their word representations based on context using high-dimensional vector spaces (BERT uses 768 dimensions in its base model). BERT employs self-attention mechanisms and subword tokenization, while Flair uses character-level modeling. Both effectively handle out-of-vocabulary words through these mechanisms.

However, there is a notable difference between their case study and here. While they provide selected words, we directly extract individual word vectors from BERT and Flair (forward/backward) embeddings using the same set of words. This doesn’t truly utilize BERT and Flair embeddings’ capability of modeling context. A more meaningful approach would be to extract embeddings at the quasi-sentence or paragraph level, or alternatively, to pool the entire document before extracting embeddings.

These context-based approaches stand in stark contrast to GloVe’s methodology, which relies on pre-computed global word-word co-occurrence statistics to generate static word vectors.

 


Cite

@Manual{,
  title = {Flair NLP and flaiR for Social Science},
  author = {Yen-Chieh Liao, Sohini Timbadia and Stefan Müller},
  year = {2024},
  url = {https://davidycliao.github.io/flaiR/articles/tutorial.html}
}