Tagging Part-of-Speech Tagging with Flair Standard Models
David (Yen-Chieh) Liao
Postdoc at Text & Policy Research Group and SPIRe in UCDSource:
vignettes/get_pos.Rmd
get_pos.Rmd
Generic Approach Using Part-of-Speech Tagging
Download the de-pos part-of-speech tagging model from FlairNLP on Hugging Face.
tagger_pos <- load_tagger_pos("pos")
#> 2024-09-23 11:43:30,060 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD
Flair NLP operates under the PyTorch framework. As such, we can use
the $to
method to set the device for the Flair Python
library. The flair_device(“cpu”) allows you to select whether to use the
CPU, CUDA devices (like cuda:0, cuda:1, cuda:2), or specific MPS devices
on Mac (such as mps:0, mps:1, mps:2). For information on Accelerated
PyTorch training on Mac, please refer to https://developer.apple.com/metal/pytorch/. For more
about CUDA, please visit: https://developer.nvidia.com/cuda-zone
tagger_pos$to(flair_device("mps"))
SequenceTagger(
(embeddings): StackedEmbeddings(
(list_embedding_0): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.05, inplace=False)
(encoder): Embedding(300, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=300, bias=True)
)
)
(list_embedding_1): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.05, inplace=False)
(encoder): Embedding(300, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=300, bias=True)
)
)
)
(word_dropout): WordDropout(p=0.05)
(locked_dropout): LockedDropout(p=0.5)
(embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
(rnn): LSTM(4096, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=53, bias=True)
(loss_function): ViterbiLoss()
(crf): CRF()
)
results <- get_pos(uk_immigration$text,
uk_immigration$speaker, tagger_pos,
show.text_id = FALSE,
gc.active = FALSE)
print(results)
#> doc_id token_id text_id token tag precision
#> <char> <num> <lgcl> <char> <char> <num>
#> 1: Philip Hollobone 0 NA I PRP 1.0000
#> 2: Philip Hollobone 1 NA thank VBP 0.9996
#> 3: Philip Hollobone 2 NA Mr. NNP 1.0000
#> 4: Philip Hollobone 3 NA Speaker NNP 1.0000
#> 5: Philip Hollobone 4 NA for IN 1.0000
#> ---
#> 440: Stewart Jackson 66 NA parties NNS 1.0000
#> 441: Stewart Jackson 67 NA in IN 1.0000
#> 442: Stewart Jackson 68 NA this DT 1.0000
#> 443: Stewart Jackson 69 NA country NN 1.0000
#> 444: Stewart Jackson 70 NA ? . 0.9949
Batch Processing
By default, the batch_size parameter is set to 5. You can consider starting with this default value and then experimenting with different batch sizes to find the one that works best for your specific use case. You can monitor memory usage and processing time to help you make a decision. If you have access to a GPU, you might also try larger batch sizes to take advantage of GPU parallelism. However, be cautious not to set the batch size too large, as it can lead to out-of-memory errors. Ultimately, the choice of batch size should be based on a balance between memory constraints, processing efficiency, and the specific requirements of your entity extraction task.
batch_process_results <- get_pos_batch(uk_immigration$text,
uk_immigration$speaker,
tagger_pos,
show.text_id = FALSE,
batch_size = 10,
verbose = TRUE)
#> CPU is used.
#> Processing batch starting at index: 1
print(batch_process_results)
#> doc_id token_id text_id token tag precision
#> <char> <num> <lgcl> <char> <char> <num>
#> 1: Philip Hollobone 0 NA I PRP 1.0000
#> 2: Philip Hollobone 1 NA thank VBP 0.9996
#> 3: Philip Hollobone 2 NA Mr. NNP 1.0000
#> 4: Philip Hollobone 3 NA Speaker NNP 1.0000
#> 5: Philip Hollobone 4 NA for IN 1.0000
#> ---
#> 448: <NA> 0 NA NA NNP 0.8859
#> 449: <NA> 0 NA NA NNP 0.8859
#> 450: <NA> 0 NA NA NNP 0.8859
#> 451: <NA> 0 NA NA NNP 0.8859
#> 452: <NA> 0 NA NA NNP 0.8859