Extract Named Entities from Texts with Batch Processing

This function processes texts in batches and extracts named entities using the Flair NLP library. It supports both standard NER and OntoNotes models, with options for batch processing and GPU acceleration.

Usage

get_entities(
  texts,
  doc_ids = NULL,
  tagger,
  show.text_id = FALSE,
  gc.active = FALSE,
  batch_size = 5,
  device = "cpu",
  verbose = FALSE
)

Arguments

texts

A character vector containing the texts to process.

doc_ids

A character or numeric vector containing the document IDs corresponding to each text.

tagger

A Flair tagger object for named entity recognition. Must be provided by the user. Can be created using load_tagger_ner() with different models:

Standard NER: tagger_ner <- load_tagger_ner('ner')
OntoNotes: tagger_ner <- load_tagger_ner('flair/ner-english-ontonotes')
Large model: tagger_ner <- load_tagger_ner('flair/ner-english-large')

show.text_id

A logical value. If TRUE, includes the actual text from which the entity was extracted. Default is FALSE.

gc.active

A logical value. If TRUE, runs the garbage collector after processing texts. Default is FALSE.

batch_size

An integer specifying the size of each batch. Set to 1 for single-text processing. Default is 5.

device

A character string specifying the computation device ("cpu", "cuda:0", "cuda:1", etc.). Default is "cpu". Note: MPS (Mac M1/M2) is currently not fully supported and will default to CPU.

verbose

A logical value. If TRUE, prints processing progress. Default is FALSE.

Value

A data table with columns:

doc_id: Character or numeric. The ID of the document from which the entity was extracted.
text_id: Character. The complete text from which the entity was extracted. Only included when show.text_id = TRUE.
entity: Character. The actual named entity text that was extracted. Will be NA if no entity was found.
tag: Character. The category of the named entity.
score: Numeric. Confidence score of the prediction.

Examples

if (FALSE) { # \dontrun{
library(reticulate)
library(flaiR)

# Using standard NER model
tagger_std <- load_tagger_ner('ner')

texts <- c(
  "John Smith works at Google in New York.",
  "The Eiffel Tower was built in 1889."
)
doc_ids <- c("doc1", "doc2")

results <- get_entities(
  texts = texts,
  doc_ids = doc_ids,
  tagger = tagger_std,
  batch_size = 2,
  verbose = TRUE
)
} # }