Skip to content

Generic Approach Using Pre-trained NER English Model

library(flaiR)
data("uk_immigration")
uk_immigration <- head(uk_immigration, 10)

Use load_tagger_ner to call the NER pretrained model. The model will be downloaded from Flair’s Hugging Face repo. Thus, ensure you have an internet connection. Once downloaded, the model will be stored in .flair as the cache in your device. So, once you’ve downloaded it and it hasn’t been manually removed, executing the command again will not trigger a download.

tagger_ner <- load_tagger_ner("ner")
#> 2024-09-23 11:41:06,048 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>

Flair NLP operates under the PyTorch framework. As such, we can use the $to method to set the device for the Flair Python library. The flair_device(“cpu”) allows you to select whether to use the CPU, CUDA devices (like cuda:0, cuda:1, cuda:2), or specific MPS devices on Mac (such as mps:0, mps:1, mps:2). For information on Accelerated PyTorch training on Mac, please refer to https://developer.apple.com/metal/pytorch/. For more about CUDA, please visit: https://developer.nvidia.com/cuda-zone

tagger_pos$to(flair_device("mps")) 
SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, inplace=False)
        (encoder): Embedding(300, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=300, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, inplace=False)
        (encoder): Embedding(300, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=300, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
  (rnn): LSTM(4096, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=53, bias=True)
  (loss_function): ViterbiLoss()
  (crf): CRF()
)

If you want the computation to run faster, it is recommended to keep the show.text_id set to FALSE by default.

results <- get_entities(uk_immigration$text,
                        uk_immigration$speaker, 
                        tagger_ner,
                        show.text_id = FALSE
                        )
print(results)
#>               doc_id                          entity    tag
#>               <char>                          <char> <char>
#>  1: Philip Hollobone                    Conservative    ORG
#>  2: Philip Hollobone Liberal Democrat Front Benchers    ORG
#>  3: Philip Hollobone                    Back Benches   MISC
#>  4: Philip Hollobone                       Kettering    LOC
#>  5: Philip Hollobone                            Sikh   MISC
#>  6: Philip Hollobone                       Kettering    LOC
#>  7: Philip Hollobone                       Kettering    LOC
#>  8: Philip Hollobone                         British   MISC
#>  9: Philip Hollobone                  United Kingdom    LOC
#> 10: Philip Hollobone                          Norman   MISC
#> 11: Philip Hollobone                  United Kingdom    LOC
#> 12:  Stewart Jackson                          Friend    PER
#> 13:  Stewart Jackson        Archbishop of Canterbury    ORG
#> 14:  Stewart Jackson                           Carey    PER
#> 15: Philip Hollobone                          Friend    PER
#> 16: Philip Hollobone                  United Kingdom    LOC
#> 17: Philip Hollobone                              UK    LOC
#> 18: Philip Hollobone                          Europe    LOC
#> 19: Philip Hollobone                           Malta    LOC
#> 20:  Stewart Jackson                         Barking    LOC
#> 21:  Stewart Jackson                        Dagenham    LOC
#> 22:  Stewart Jackson                British National    ORG
#> 23:  Stewart Jackson                    Conservative    ORG
#> 24:  Stewart Jackson                          Friend    PER
#> 25:  Stewart Jackson                      Folkestone    LOC
#> 26:  Stewart Jackson                           Hythe    LOC
#> 27:  Stewart Jackson                          Howard    PER
#> 28: Philip Hollobone                          Friend    PER
#> 29: Philip Hollobone                         Shipley    PER
#> 30: Philip Hollobone                   Philip Davies    PER
#> 31: Philip Hollobone                        Solihull    LOC
#> 32: Philip Hollobone                     Lorely Burt    ORG
#> 33: Philip Hollobone                    Peterborough    LOC
#> 34: Philip Hollobone                         Jackson    PER
#> 35: Philip Hollobone                          Friend    PER
#> 36:    Philip Davies                          Friend    PER
#> 37:    Philip Davies                      Government    ORG
#> 38: Philip Hollobone                       Kettering    LOC
#> 39: Philip Hollobone                      Government    ORG
#> 40: Philip Hollobone                       Kettering    LOC
#> 41: Philip Hollobone                       Kettering    LOC
#> 42: Philip Hollobone               Migrationwatch UK    ORG
#> 43: Philip Hollobone                      Carshalton    LOC
#> 44: Philip Hollobone                      Wallington    LOC
#> 45: Philip Hollobone                       Tom Brake    PER
#> 46: Philip Hollobone                            <NA>   <NA>
#> 47:      Phil Woolas                       Gentleman    PER
#> 48:      Phil Woolas                      Carshalton    LOC
#> 49:      Phil Woolas                      Wallington    LOC
#> 50:      Phil Woolas                       Tom Brake    PER
#>               doc_id                          entity    tag
print(results)
#>               doc_id                          entity    tag
#>               <char>                          <char> <char>
#>  1: Philip Hollobone                    Conservative    ORG
#>  2: Philip Hollobone Liberal Democrat Front Benchers    ORG
#>  3: Philip Hollobone                    Back Benches   MISC
#>  4: Philip Hollobone                       Kettering    LOC
#>  5: Philip Hollobone                            Sikh   MISC
#>  6: Philip Hollobone                       Kettering    LOC
#>  7: Philip Hollobone                       Kettering    LOC
#>  8: Philip Hollobone                         British   MISC
#>  9: Philip Hollobone                  United Kingdom    LOC
#> 10: Philip Hollobone                          Norman   MISC
#> 11: Philip Hollobone                  United Kingdom    LOC
#> 12:  Stewart Jackson                          Friend    PER
#> 13:  Stewart Jackson        Archbishop of Canterbury    ORG
#> 14:  Stewart Jackson                           Carey    PER
#> 15: Philip Hollobone                          Friend    PER
#> 16: Philip Hollobone                  United Kingdom    LOC
#> 17: Philip Hollobone                              UK    LOC
#> 18: Philip Hollobone                          Europe    LOC
#> 19: Philip Hollobone                           Malta    LOC
#> 20:  Stewart Jackson                         Barking    LOC
#> 21:  Stewart Jackson                        Dagenham    LOC
#> 22:  Stewart Jackson                British National    ORG
#> 23:  Stewart Jackson                    Conservative    ORG
#> 24:  Stewart Jackson                          Friend    PER
#> 25:  Stewart Jackson                      Folkestone    LOC
#> 26:  Stewart Jackson                           Hythe    LOC
#> 27:  Stewart Jackson                          Howard    PER
#> 28: Philip Hollobone                          Friend    PER
#> 29: Philip Hollobone                         Shipley    PER
#> 30: Philip Hollobone                   Philip Davies    PER
#> 31: Philip Hollobone                        Solihull    LOC
#> 32: Philip Hollobone                     Lorely Burt    ORG
#> 33: Philip Hollobone                    Peterborough    LOC
#> 34: Philip Hollobone                         Jackson    PER
#> 35: Philip Hollobone                          Friend    PER
#> 36:    Philip Davies                          Friend    PER
#> 37:    Philip Davies                      Government    ORG
#> 38: Philip Hollobone                       Kettering    LOC
#> 39: Philip Hollobone                      Government    ORG
#> 40: Philip Hollobone                       Kettering    LOC
#> 41: Philip Hollobone                       Kettering    LOC
#> 42: Philip Hollobone               Migrationwatch UK    ORG
#> 43: Philip Hollobone                      Carshalton    LOC
#> 44: Philip Hollobone                      Wallington    LOC
#> 45: Philip Hollobone                       Tom Brake    PER
#> 46: Philip Hollobone                            <NA>   <NA>
#> 47:      Phil Woolas                       Gentleman    PER
#> 48:      Phil Woolas                      Carshalton    LOC
#> 49:      Phil Woolas                      Wallington    LOC
#> 50:      Phil Woolas                       Tom Brake    PER
#>               doc_id                          entity    tag

Batch Processing

Processing texts individually can be both inefficient and memory-intensive. On the other hand, processing all the texts simultaneously could surpass memory constraints, especially if each document in the dataset is sizable. Parsing the documents in smaller batches may provide an optimal compromise between these two scenarios. Batch processing can enhance efficiency and aid in memory management.

By default, the batch_size parameter is set to 5. You can consider starting with this default value and then experimenting with different batch sizes to find the one that works best for your specific use case. You can monitor memory usage and processing time to help you make a decision. If you have access to a GPU, you might also try larger batch sizes to take advantage of GPU parallelism. However, be cautious not to set the batch size too large, as it can lead to out-of-memory errors. Ultimately, the choice of batch size should be based on a balance between memory constraints, processing efficiency, and the specific requirements of your entity extraction task.

batch_process_time <- system.time({
    batch_process_results  <- get_entities_batch(uk_immigration$text,
                                                 uk_immigration$speaker, 
                                                 tagger_ner, 
                                                 show.text_id = FALSE,
                                                 batch_size = 5)
    gc()
})
#> CPU is used.
#> Processing batch 1 out of 2...
#> Processing batch 2 out of 2...
print(batch_process_time)
#>    user  system elapsed 
#>  51.513   1.328  64.105
print(batch_process_results)
#>               doc_id                          entity    tag text_id
#>               <char>                          <char> <char>  <lgcl>
#>  1: Philip Hollobone                    Conservative    ORG      NA
#>  2: Philip Hollobone Liberal Democrat Front Benchers    ORG      NA
#>  3: Philip Hollobone                    Back Benches   MISC      NA
#>  4: Philip Hollobone                       Kettering    LOC      NA
#>  5: Philip Hollobone                            Sikh   MISC      NA
#>  6: Philip Hollobone                       Kettering    LOC      NA
#>  7: Philip Hollobone                       Kettering    LOC      NA
#>  8: Philip Hollobone                         British   MISC      NA
#>  9: Philip Hollobone                  United Kingdom    LOC      NA
#> 10: Philip Hollobone                          Norman   MISC      NA
#> 11: Philip Hollobone                  United Kingdom    LOC      NA
#> 12:  Stewart Jackson                          Friend    PER      NA
#> 13:  Stewart Jackson        Archbishop of Canterbury    ORG      NA
#> 14:  Stewart Jackson                           Carey    PER      NA
#> 15: Philip Hollobone                          Friend    PER      NA
#> 16: Philip Hollobone                  United Kingdom    LOC      NA
#> 17: Philip Hollobone                              UK    LOC      NA
#> 18: Philip Hollobone                          Europe    LOC      NA
#> 19: Philip Hollobone                           Malta    LOC      NA
#> 20:  Stewart Jackson                         Barking    LOC      NA
#> 21:  Stewart Jackson                        Dagenham    LOC      NA
#> 22:  Stewart Jackson                British National    ORG      NA
#> 23:  Stewart Jackson                    Conservative    ORG      NA
#> 24:  Stewart Jackson                          Friend    PER      NA
#> 25:  Stewart Jackson                      Folkestone    LOC      NA
#> 26:  Stewart Jackson                           Hythe    LOC      NA
#> 27:  Stewart Jackson                          Howard    PER      NA
#> 28: Philip Hollobone                          Friend    PER      NA
#> 29: Philip Hollobone                         Shipley    PER      NA
#> 30: Philip Hollobone                   Philip Davies    PER      NA
#> 31: Philip Hollobone                        Solihull    LOC      NA
#> 32: Philip Hollobone                     Lorely Burt    ORG      NA
#> 33: Philip Hollobone                    Peterborough    LOC      NA
#> 34: Philip Hollobone                         Jackson    PER      NA
#> 35: Philip Hollobone                          Friend    PER      NA
#> 36:    Philip Davies                          Friend    PER      NA
#> 37:    Philip Davies                      Government    ORG      NA
#> 38: Philip Hollobone                       Kettering    LOC      NA
#> 39: Philip Hollobone                      Government    ORG      NA
#> 40: Philip Hollobone                       Kettering    LOC      NA
#> 41: Philip Hollobone                       Kettering    LOC      NA
#> 42: Philip Hollobone               Migrationwatch UK    ORG      NA
#> 43: Philip Hollobone                      Carshalton    LOC      NA
#> 44: Philip Hollobone                      Wallington    LOC      NA
#> 45: Philip Hollobone                       Tom Brake    PER      NA
#> 46: Philip Hollobone                            <NA>   <NA>      NA
#> 47:      Phil Woolas                       Gentleman    PER      NA
#> 48:      Phil Woolas                      Carshalton    LOC      NA
#> 49:      Phil Woolas                      Wallington    LOC      NA
#> 50:      Phil Woolas                       Tom Brake    PER      NA
#>               doc_id                          entity    tag text_id