This function processes texts in batches and extracts named entities using the Flair NLP library. It supports both standard NER and OntoNotes models, with options for batch processing and GPU acceleration.
Usage
get_entities(
texts,
doc_ids = NULL,
tagger,
show.text_id = FALSE,
gc.active = FALSE,
batch_size = 5,
device = "cpu",
verbose = FALSE
)
Arguments
- texts
A character vector containing the texts to process.
- doc_ids
A character or numeric vector containing the document IDs corresponding to each text.
- tagger
A Flair tagger object for named entity recognition. Must be provided by the user. Can be created using load_tagger_ner() with different models:
Standard NER: tagger_ner <- load_tagger_ner('ner')
OntoNotes: tagger_ner <- load_tagger_ner('flair/ner-english-ontonotes')
Large model: tagger_ner <- load_tagger_ner('flair/ner-english-large')
- show.text_id
A logical value. If TRUE, includes the actual text from which the entity was extracted. Default is FALSE.
- gc.active
A logical value. If TRUE, runs the garbage collector after processing texts. Default is FALSE.
- batch_size
An integer specifying the size of each batch. Set to 1 for single-text processing. Default is 5.
- device
A character string specifying the computation device ("cpu", "cuda:0", "cuda:1", etc.). Default is "cpu". Note: MPS (Mac M1/M2) is currently not fully supported and will default to CPU.
- verbose
A logical value. If TRUE, prints processing progress. Default is FALSE.
Value
A data table with columns:
- doc_id
Character or numeric. The ID of the document from which the entity was extracted.
- text_id
Character. The complete text from which the entity was extracted. Only included when show.text_id = TRUE.
- entity
Character. The actual named entity text that was extracted. Will be NA if no entity was found.
- tag
Character. The category of the named entity.
- score
Numeric. Confidence score of the prediction.
Examples
if (FALSE) { # \dontrun{
library(reticulate)
library(flaiR)
# Using standard NER model
tagger_std <- load_tagger_ner('ner')
texts <- c(
"John Smith works at Google in New York.",
"The Eiffel Tower was built in 1889."
)
doc_ids <- c("doc1", "doc2")
results <- get_entities(
texts = texts,
doc_ids = doc_ids,
tagger = tagger_std,
batch_size = 2,
verbose = TRUE
)
} # }