This function returns a data table of POS tags and other related data for the given texts using batch processing.
Usage
get_pos_batch(
texts,
doc_ids,
tagger = NULL,
language = NULL,
show.text_id = FALSE,
gc.active = FALSE,
batch_size = 5,
device = "cpu",
verbose = TRUE
)
Arguments
- texts
A character vector containing texts to be processed.
- doc_ids
A character vector containing document ids.
- tagger
A tagger object (default is NULL).
- language
The language of the texts (default is NULL).
- show.text_id
A logical value. If TRUE, includes the actual text from which the entity was extracted in the resulting data table. Useful for verification and traceability purposes but might increase the size of the output. Default is FALSE.
- gc.active
A logical value. If TRUE, runs the garbage collector after processing all texts. This can help in freeing up memory by releasing unused memory space, especially when processing a large number of texts. Default is FALSE.
- batch_size
An integer specifying the size of each batch. Default is 5.
- device
A character string specifying the computation device.
"cuda" or "cuda:0" ("mps" or "mps:0" in Mac M1/M2 )Refers to the first GPU in the system. If there's only one GPU, specifying "cuda" or "cuda:0" will allocate computations to this GPU.
"cuda:1" ("mps:1")Refers to the second GPU in the system, allowing allocation of specific computations to this GPU.
"cuda:2" ("mps:2)Refers to the third GPU in the system, and so on for systems with more GPUs.
- verbose
A logical value. If TRUE, the function prints batch processing progress updates. Default is TRUE.
Value
A data.table containing the following columns:
doc_id
The document identifier corresponding to each text.
token_id
The token number in the original text, indicating the position of the token.
text_id
The actual text input passed to the function (if show.text_id is TRUE).
token
The individual word or token from the text that was POS tagged.
tag
The part-of-speech tag assigned to the token by the Flair library.
precision
A confidence score (numeric) for the assigned POS tag.
Examples
if (FALSE) { # \dontrun{
library(reticulate)
library(fliaR)
tagger_pos_fast <- load_tagger_pos('pos-fast')
texts <- c("UCD is one of the best universities in Ireland.",
"Essex is not in the Russell Group, but it is famous for political science research.",
"TCD is the oldest university in Ireland.")
doc_ids <- c("doc1", "doc2", "doc3")
# Using the batch_size parameter
get_pos_batch(texts, doc_ids, tagger_pos_fast, batch_size = 2)
} # }