Tagging Named Entities with Flair Standard Models
David (Yen-Chieh) Liao
Postdoc at Text & Policy Research Group and SPIRe in UCDSource:
vignettes/get_entities.Rmd
get_entities.Rmd
Generic Approach Using Pre-trained NER English Model
Use load_tagger_ner
to call the NER pretrained model.
The model will be downloaded from Flair’s Hugging Face repo. Thus,
ensure you have an internet connection. Once downloaded, the model will
be stored in .flair as the cache in your device. So,
once you’ve downloaded it and it hasn’t been manually removed, executing
the command again will not trigger a download.
tagger_ner <- load_tagger_ner("ner")
#> 2024-09-23 11:41:06,048 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
Flair NLP operates under the PyTorch framework. As such, we can use
the $to
method to set the device for the Flair Python
library. The flair_device(“cpu”) allows you to select whether to use the
CPU, CUDA devices (like cuda:0, cuda:1, cuda:2), or specific MPS devices
on Mac (such as mps:0, mps:1, mps:2). For information on Accelerated
PyTorch training on Mac, please refer to https://developer.apple.com/metal/pytorch/. For more
about CUDA, please visit: https://developer.nvidia.com/cuda-zone
tagger_pos$to(flair_device("mps"))
SequenceTagger(
(embeddings): StackedEmbeddings(
(list_embedding_0): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.05, inplace=False)
(encoder): Embedding(300, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=300, bias=True)
)
)
(list_embedding_1): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.05, inplace=False)
(encoder): Embedding(300, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=300, bias=True)
)
)
)
(word_dropout): WordDropout(p=0.05)
(locked_dropout): LockedDropout(p=0.5)
(embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
(rnn): LSTM(4096, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=53, bias=True)
(loss_function): ViterbiLoss()
(crf): CRF()
)
If you want the computation to run faster, it is recommended to keep the show.text_id set to FALSE by default.
results <- get_entities(uk_immigration$text,
uk_immigration$speaker,
tagger_ner,
show.text_id = FALSE
)
print(results)
#> doc_id entity tag
#> <char> <char> <char>
#> 1: Philip Hollobone Conservative ORG
#> 2: Philip Hollobone Liberal Democrat Front Benchers ORG
#> 3: Philip Hollobone Back Benches MISC
#> 4: Philip Hollobone Kettering LOC
#> 5: Philip Hollobone Sikh MISC
#> 6: Philip Hollobone Kettering LOC
#> 7: Philip Hollobone Kettering LOC
#> 8: Philip Hollobone British MISC
#> 9: Philip Hollobone United Kingdom LOC
#> 10: Philip Hollobone Norman MISC
#> 11: Philip Hollobone United Kingdom LOC
#> 12: Stewart Jackson Friend PER
#> 13: Stewart Jackson Archbishop of Canterbury ORG
#> 14: Stewart Jackson Carey PER
#> 15: Philip Hollobone Friend PER
#> 16: Philip Hollobone United Kingdom LOC
#> 17: Philip Hollobone UK LOC
#> 18: Philip Hollobone Europe LOC
#> 19: Philip Hollobone Malta LOC
#> 20: Stewart Jackson Barking LOC
#> 21: Stewart Jackson Dagenham LOC
#> 22: Stewart Jackson British National ORG
#> 23: Stewart Jackson Conservative ORG
#> 24: Stewart Jackson Friend PER
#> 25: Stewart Jackson Folkestone LOC
#> 26: Stewart Jackson Hythe LOC
#> 27: Stewart Jackson Howard PER
#> 28: Philip Hollobone Friend PER
#> 29: Philip Hollobone Shipley PER
#> 30: Philip Hollobone Philip Davies PER
#> 31: Philip Hollobone Solihull LOC
#> 32: Philip Hollobone Lorely Burt ORG
#> 33: Philip Hollobone Peterborough LOC
#> 34: Philip Hollobone Jackson PER
#> 35: Philip Hollobone Friend PER
#> 36: Philip Davies Friend PER
#> 37: Philip Davies Government ORG
#> 38: Philip Hollobone Kettering LOC
#> 39: Philip Hollobone Government ORG
#> 40: Philip Hollobone Kettering LOC
#> 41: Philip Hollobone Kettering LOC
#> 42: Philip Hollobone Migrationwatch UK ORG
#> 43: Philip Hollobone Carshalton LOC
#> 44: Philip Hollobone Wallington LOC
#> 45: Philip Hollobone Tom Brake PER
#> 46: Philip Hollobone <NA> <NA>
#> 47: Phil Woolas Gentleman PER
#> 48: Phil Woolas Carshalton LOC
#> 49: Phil Woolas Wallington LOC
#> 50: Phil Woolas Tom Brake PER
#> doc_id entity tag
print(results)
#> doc_id entity tag
#> <char> <char> <char>
#> 1: Philip Hollobone Conservative ORG
#> 2: Philip Hollobone Liberal Democrat Front Benchers ORG
#> 3: Philip Hollobone Back Benches MISC
#> 4: Philip Hollobone Kettering LOC
#> 5: Philip Hollobone Sikh MISC
#> 6: Philip Hollobone Kettering LOC
#> 7: Philip Hollobone Kettering LOC
#> 8: Philip Hollobone British MISC
#> 9: Philip Hollobone United Kingdom LOC
#> 10: Philip Hollobone Norman MISC
#> 11: Philip Hollobone United Kingdom LOC
#> 12: Stewart Jackson Friend PER
#> 13: Stewart Jackson Archbishop of Canterbury ORG
#> 14: Stewart Jackson Carey PER
#> 15: Philip Hollobone Friend PER
#> 16: Philip Hollobone United Kingdom LOC
#> 17: Philip Hollobone UK LOC
#> 18: Philip Hollobone Europe LOC
#> 19: Philip Hollobone Malta LOC
#> 20: Stewart Jackson Barking LOC
#> 21: Stewart Jackson Dagenham LOC
#> 22: Stewart Jackson British National ORG
#> 23: Stewart Jackson Conservative ORG
#> 24: Stewart Jackson Friend PER
#> 25: Stewart Jackson Folkestone LOC
#> 26: Stewart Jackson Hythe LOC
#> 27: Stewart Jackson Howard PER
#> 28: Philip Hollobone Friend PER
#> 29: Philip Hollobone Shipley PER
#> 30: Philip Hollobone Philip Davies PER
#> 31: Philip Hollobone Solihull LOC
#> 32: Philip Hollobone Lorely Burt ORG
#> 33: Philip Hollobone Peterborough LOC
#> 34: Philip Hollobone Jackson PER
#> 35: Philip Hollobone Friend PER
#> 36: Philip Davies Friend PER
#> 37: Philip Davies Government ORG
#> 38: Philip Hollobone Kettering LOC
#> 39: Philip Hollobone Government ORG
#> 40: Philip Hollobone Kettering LOC
#> 41: Philip Hollobone Kettering LOC
#> 42: Philip Hollobone Migrationwatch UK ORG
#> 43: Philip Hollobone Carshalton LOC
#> 44: Philip Hollobone Wallington LOC
#> 45: Philip Hollobone Tom Brake PER
#> 46: Philip Hollobone <NA> <NA>
#> 47: Phil Woolas Gentleman PER
#> 48: Phil Woolas Carshalton LOC
#> 49: Phil Woolas Wallington LOC
#> 50: Phil Woolas Tom Brake PER
#> doc_id entity tag
Batch Processing
Processing texts individually can be both inefficient and memory-intensive. On the other hand, processing all the texts simultaneously could surpass memory constraints, especially if each document in the dataset is sizable. Parsing the documents in smaller batches may provide an optimal compromise between these two scenarios. Batch processing can enhance efficiency and aid in memory management.
By default, the batch_size parameter is set to 5. You can consider starting with this default value and then experimenting with different batch sizes to find the one that works best for your specific use case. You can monitor memory usage and processing time to help you make a decision. If you have access to a GPU, you might also try larger batch sizes to take advantage of GPU parallelism. However, be cautious not to set the batch size too large, as it can lead to out-of-memory errors. Ultimately, the choice of batch size should be based on a balance between memory constraints, processing efficiency, and the specific requirements of your entity extraction task.
batch_process_time <- system.time({
batch_process_results <- get_entities_batch(uk_immigration$text,
uk_immigration$speaker,
tagger_ner,
show.text_id = FALSE,
batch_size = 5)
gc()
})
#> CPU is used.
#> Processing batch 1 out of 2...
#> Processing batch 2 out of 2...
print(batch_process_time)
#> user system elapsed
#> 51.513 1.328 64.105
print(batch_process_results)
#> doc_id entity tag text_id
#> <char> <char> <char> <lgcl>
#> 1: Philip Hollobone Conservative ORG NA
#> 2: Philip Hollobone Liberal Democrat Front Benchers ORG NA
#> 3: Philip Hollobone Back Benches MISC NA
#> 4: Philip Hollobone Kettering LOC NA
#> 5: Philip Hollobone Sikh MISC NA
#> 6: Philip Hollobone Kettering LOC NA
#> 7: Philip Hollobone Kettering LOC NA
#> 8: Philip Hollobone British MISC NA
#> 9: Philip Hollobone United Kingdom LOC NA
#> 10: Philip Hollobone Norman MISC NA
#> 11: Philip Hollobone United Kingdom LOC NA
#> 12: Stewart Jackson Friend PER NA
#> 13: Stewart Jackson Archbishop of Canterbury ORG NA
#> 14: Stewart Jackson Carey PER NA
#> 15: Philip Hollobone Friend PER NA
#> 16: Philip Hollobone United Kingdom LOC NA
#> 17: Philip Hollobone UK LOC NA
#> 18: Philip Hollobone Europe LOC NA
#> 19: Philip Hollobone Malta LOC NA
#> 20: Stewart Jackson Barking LOC NA
#> 21: Stewart Jackson Dagenham LOC NA
#> 22: Stewart Jackson British National ORG NA
#> 23: Stewart Jackson Conservative ORG NA
#> 24: Stewart Jackson Friend PER NA
#> 25: Stewart Jackson Folkestone LOC NA
#> 26: Stewart Jackson Hythe LOC NA
#> 27: Stewart Jackson Howard PER NA
#> 28: Philip Hollobone Friend PER NA
#> 29: Philip Hollobone Shipley PER NA
#> 30: Philip Hollobone Philip Davies PER NA
#> 31: Philip Hollobone Solihull LOC NA
#> 32: Philip Hollobone Lorely Burt ORG NA
#> 33: Philip Hollobone Peterborough LOC NA
#> 34: Philip Hollobone Jackson PER NA
#> 35: Philip Hollobone Friend PER NA
#> 36: Philip Davies Friend PER NA
#> 37: Philip Davies Government ORG NA
#> 38: Philip Hollobone Kettering LOC NA
#> 39: Philip Hollobone Government ORG NA
#> 40: Philip Hollobone Kettering LOC NA
#> 41: Philip Hollobone Kettering LOC NA
#> 42: Philip Hollobone Migrationwatch UK ORG NA
#> 43: Philip Hollobone Carshalton LOC NA
#> 44: Philip Hollobone Wallington LOC NA
#> 45: Philip Hollobone Tom Brake PER NA
#> 46: Philip Hollobone <NA> <NA> NA
#> 47: Phil Woolas Gentleman PER NA
#> 48: Phil Woolas Carshalton LOC NA
#> 49: Phil Woolas Wallington LOC NA
#> 50: Phil Woolas Tom Brake PER NA
#> doc_id entity tag text_id