Skip to content

Generic Approach Using Part-of-Speech Tagging

library(flaiR)
data("de_immigration")
uk_immigration <- head(uk_immigration, 2)

Download the de-pos part-of-speech tagging model from FlairNLP on Hugging Face.

tagger_pos <- load_tagger_pos("pos")
#> 2024-09-23 11:43:30,060 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD

Flair NLP operates under the PyTorch framework. As such, we can use the $to method to set the device for the Flair Python library. The flair_device(“cpu”) allows you to select whether to use the CPU, CUDA devices (like cuda:0, cuda:1, cuda:2), or specific MPS devices on Mac (such as mps:0, mps:1, mps:2). For information on Accelerated PyTorch training on Mac, please refer to https://developer.apple.com/metal/pytorch/. For more about CUDA, please visit: https://developer.nvidia.com/cuda-zone

tagger_pos$to(flair_device("mps")) 
SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, inplace=False)
        (encoder): Embedding(300, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=300, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, inplace=False)
        (encoder): Embedding(300, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=300, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
  (rnn): LSTM(4096, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=53, bias=True)
  (loss_function): ViterbiLoss()
  (crf): CRF()
)
results <- get_pos(uk_immigration$text, 
                   uk_immigration$speaker, tagger_pos, 
                   show.text_id = FALSE,
                   gc.active = FALSE)
print(results)
#>                doc_id token_id text_id   token    tag precision
#>                <char>    <num>  <lgcl>  <char> <char>     <num>
#>   1: Philip Hollobone        0      NA       I    PRP    1.0000
#>   2: Philip Hollobone        1      NA   thank    VBP    0.9996
#>   3: Philip Hollobone        2      NA     Mr.    NNP    1.0000
#>   4: Philip Hollobone        3      NA Speaker    NNP    1.0000
#>   5: Philip Hollobone        4      NA     for     IN    1.0000
#>  ---                                                           
#> 440:  Stewart Jackson       66      NA parties    NNS    1.0000
#> 441:  Stewart Jackson       67      NA      in     IN    1.0000
#> 442:  Stewart Jackson       68      NA    this     DT    1.0000
#> 443:  Stewart Jackson       69      NA country     NN    1.0000
#> 444:  Stewart Jackson       70      NA       ?      .    0.9949

Batch Processing

By default, the batch_size parameter is set to 5. You can consider starting with this default value and then experimenting with different batch sizes to find the one that works best for your specific use case. You can monitor memory usage and processing time to help you make a decision. If you have access to a GPU, you might also try larger batch sizes to take advantage of GPU parallelism. However, be cautious not to set the batch size too large, as it can lead to out-of-memory errors. Ultimately, the choice of batch size should be based on a balance between memory constraints, processing efficiency, and the specific requirements of your entity extraction task.

batch_process_results  <- get_pos_batch(uk_immigration$text,
                                        uk_immigration$speaker, 
                                        tagger_pos, 
                                        show.text_id = FALSE,
                                        batch_size = 10,
                                        verbose = TRUE)
#> CPU is used.
#> Processing batch starting at index: 1
print(batch_process_results)
#>                doc_id token_id text_id   token    tag precision
#>                <char>    <num>  <lgcl>  <char> <char>     <num>
#>   1: Philip Hollobone        0      NA       I    PRP    1.0000
#>   2: Philip Hollobone        1      NA   thank    VBP    0.9996
#>   3: Philip Hollobone        2      NA     Mr.    NNP    1.0000
#>   4: Philip Hollobone        3      NA Speaker    NNP    1.0000
#>   5: Philip Hollobone        4      NA     for     IN    1.0000
#>  ---                                                           
#> 448:             <NA>        0      NA      NA    NNP    0.8859
#> 449:             <NA>        0      NA      NA    NNP    0.8859
#> 450:             <NA>        0      NA      NA    NNP    0.8859
#> 451:             <NA>        0      NA      NA    NNP    0.8859
#> 452:             <NA>        0      NA      NA    NNP    0.8859