Batch Process of Part-of-Speech Tagging — get_pos

This function returns a data table of POS tags and other related data for the given texts using batch processing.

Usage

get_pos_batch(
  texts,
  doc_ids,
  tagger = NULL,
  language = NULL,
  show.text_id = FALSE,
  gc.active = FALSE,
  batch_size = 5,
  device = "cpu",
  verbose = TRUE
)

Arguments

texts

A character vector containing texts to be processed.

doc_ids

A character vector containing document ids.

tagger

A tagger object (default is NULL).

language

The language of the texts (default is NULL).

show.text_id

A logical value. If TRUE, includes the actual text from which the entity was extracted in the resulting data table. Useful for verification and traceability purposes but might increase the size of the output. Default is FALSE.

gc.active

A logical value. If TRUE, runs the garbage collector after processing all texts. This can help in freeing up memory by releasing unused memory space, especially when processing a large number of texts. Default is FALSE.

batch_size

An integer specifying the size of each batch. Default is 5.

device

A character string specifying the computation device.

"cuda" or "cuda:0" ("mps" or "mps:0" in Mac M1/M2 )Refers to the first GPU in the system. If there's only one GPU, specifying "cuda" or "cuda:0" will allocate computations to this GPU.
"cuda:1" ("mps:1")Refers to the second GPU in the system, allowing allocation of specific computations to this GPU.
"cuda:2" ("mps:2)Refers to the third GPU in the system, and so on for systems with more GPUs.

verbose

A logical value. If TRUE, the function prints batch processing progress updates. Default is TRUE.

Value

A data.table containing the following columns:

doc_id: The document identifier corresponding to each text.
token_id: The token number in the original text, indicating the position of the token.
text_id: The actual text input passed to the function (if show.text_id is TRUE).
token: The individual word or token from the text that was POS tagged.
tag: The part-of-speech tag assigned to the token by the Flair library.
precision: A confidence score (numeric) for the assigned POS tag.

Examples

if (FALSE) { # \dontrun{
library(reticulate)
library(fliaR)
tagger_pos_fast <- load_tagger_pos('pos-fast')
texts <- c("UCD is one of the best universities in Ireland.",
           "Essex is not in the Russell Group, but it is famous for political science research.",
           "TCD is the oldest university in Ireland.")
doc_ids <- c("doc1", "doc2", "doc3")

# Using the batch_size parameter
get_pos_batch(texts, doc_ids, tagger_pos_fast, batch_size = 2)
} # }