Extract Named Entities from a Batch of Texts — get_entities

This function processes batches of texts and extracts named entities.

Usage

get_entities_batch(
  texts,
  doc_ids,
  tagger = NULL,
  language = "en",
  show.text_id = FALSE,
  gc.active = FALSE,
  batch_size = 5,
  device = "cpu",
  verbose = TRUE
)

Arguments

texts

A character vector of texts to process.

doc_ids

A vector of document IDs corresponding to each text.

tagger

A pre-loaded Flair NER tagger. Default is NULL, and the tagger is loaded based on the provided language.

language

A character string specifying the language of the texts. Default is "en" (English).

show.text_id

Logical, whether to include the text ID in the output. Default is FALSE.

gc.active

Logical, whether to activate garbage collection after processing each batch. Default is FALSE.

batch_size

An integer specifying the size of each batch. Default is 5.

device

A character string specifying the computation device. It can be either "cpu" or a string representation of a GPU device number. For instance, "0" corresponds to the first GPU. If a GPU device number is provided, it will attempt to use that GPU. The default is "cpu".

"cuda" or "cuda:0" ("mps" or "mps:0" in Mac M1/M2 )Refers to the first GPU in the system. If there's only one GPU, specifying "cuda" or "cuda:0" will allocate computations to this GPU.
"cuda:1" ("mps:1")Refers to the second GPU in the system, allowing allocation of specific computations to this GPU.
"cuda:2" ("mps:2)Refers to the third GPU in the system, and so on for systems with more GPUs.

verbose

A logical value. If TRUE, the function prints batch processing progress updates. Default is TRUE.

Value

A data.table containing the extracted entities, their corresponding tags, and document IDs.

Examples

if (FALSE) { # \dontrun{
library(reticulate)
library(fliaR)

texts <- c("UCD is one of the best universities in Ireland.",
           "UCD has a good campus but is very far from
           my apartment in Dublin.",
           "Essex is famous for social science research.",
           "Essex is not in the Russell Group, but it is
           famous for political science research.",
           "TCD is the oldest university in Ireland.",
           "TCD is similar to Oxford.")
doc_ids <- c("doc1", "doc2", "doc3", "doc4", "doc5", "doc6")
# Load NER ("ner") model
tagger_ner <- load_tagger_ner('ner')
results <- get_entities_batch(texts, doc_ids, tagger_ner)
print(results)} # }