Named entity extraction from text in Python



Introduction

The ability to understand and process natural language is regarded as one of the superiority of humans to computer. Humans if able to read and or possibly write, can readily comprehend written texts and make meanings out of the syllables, grammar, vocabulary and punctuation. Although, as simple and effortless as it might seem for humans, making meaning out of a simple text for computers requires a bit of rules, logic driven approach or machine training and adaptation.

Natural Language Processing (NLP) in Python has become commonplace, thanks to the availability of mature, open source and industrial grade readily available libraries with little learning curve. Out of the open source NLP libraries available for Python, Natural Language Toolkit (NLTK) and spaCy stand out with huge community engagement and resources.

NLTK vs spaCy

NLTK is a suite of open source Natural Language Processing library containing Python modules with extensive data sets and examples, with great support for research and development. Also, NLTK provides easy-to-use interfaces to numerous corpora and lexical resources such as WordNet. It includes a suite of text processing libraries for performing classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers on natural languages. The github code is at https://github.com/nltk/nltk.

spaCy on the other is more suited for large scale Natural Language Processing tasks and is commonly used in industry and enterprise applications with NLP requirements, with support for over 50 languages. Spacy can be used together with any of Python’s AI libraries, it works seamlessly with TensorFlow, PyTorch, scikit-learn and Gensim. It is fairly easier to build linguistically advanced statistical models for a variety of NLP problems using spaCy compared to NLTK. More info on spacCy can be found at https://spacy.io/.

Named Entity Recognition

Named entity recognition (NER) is a subset or subtask of information extraction. It involves identifying and classifying named entities in text into sets of pre-defined categories. These categories include names of persons, locations, expressions of times, organizations, quantities, monetary values and so on. NER has real word usages in various Natural Language Processing problems.

Some sample usages might include
  • Building a minimalistic search engine, you might want to identify locations, names or even products in search texts.
  • Tweet mining, to determine if it contains locations or persons of interests.

Named Entity Recognition using spaCy

spaCy named entity recognition models were trained using OntoNotes 5 corpus and these models support the following entity types:

Type Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including ”%“.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.

Install spaCy library with pip and download the english model using the commands below from terminal.

            pip install spacy
            python -m spacy download en
        

We will be using text “John bought a Toyota camry 2019 model in Toronto in January 2020 at a cost of $38000” and pass this to spacy to recognize the entities in the text. Recognizing entities in a text using spaCy involves three steps.

First step: Loading the spaCy library

            spacy_nlp = spacy.load('en_core_web_sm')
        

If you are using spaCy medium or large pre-trained model, use 'en_core_web_md' or 'en_core_web_lg' insead of 'en_core_web_sm'.

Second step: Constructing a spaCy document from the text

            text = "John bought a Toyota camry 2019 model in Toronto in January 2020 at a cost of $38000"
            doc = spacy_nlp(text.strip())
        

Third step: The doc object constructed contains loaded entities which can be accessed to extract the entities recognized

            text = "John bought a Toyota camry 2019 model in Toronto in January 2020 at a cost of $38000"
            doc = spacy_nlp(text.strip())
        

The snippet below shows how the entities are extracted using the spaCy supported entity types.

        # create sets to hold words 
        named_entities = set()
        money_entities = set()
        organization_entities = set()
        location_entities = set()
        time_indicator_entities = set()

        for i in doc.ents:
            entry = str(i.lemma_).lower()
            text = text.replace(str(i).lower(), "")
            # Time indicator entities detection
            if i.label_ in ["TIM", "DATE"]:
                time_indicator_entities.add(entry)
            # money value entities detection
            elif i.label_ in ["MONEY"]:
                money_entities.add(entry)
            # organization entities detection
            elif i.label_ in ["ORG"]:
                organization_entities.add(entry)
            # Geographical and Geographical entities detection
            elif i.label_ in ["GPE", "GEO"]:
                location_entities.add(entry)
            # extract artifacts, events and natural phenomenon from text
            elif i.label_ in ["ART", "EVE", "NAT", "PERSON"]:
                named_entities.add(entry.title())
        

The recognized entities can be printed to the console.

        print(f"named entities - {named_entities}")
        print(f"money entities - {money_entities}")
        print(f"location entities - {location_entities}")
        print(f"time indicator entities - {time_indicator_entities}")
        print(f"organization entities - {organization_entities}")

        -------------------------------------------------------------

                    named entities - {'John'}
                    money entities - {'38000'}
                    location entities - {'toronto'}
                    time indicator entities - {'2019', 'january 2020'}
                    organization entities - {'toyota'}
        

In the output John was extracted as the named entity, 38000 as moeny entity, Toronto as location entity, Toyota as organization entity, lastly 2019 and Janauary 2020 as time indicator entities.

Complete source code listing is below.

            import spacy

            class NamedEntityExtractor:
                """
                Performs named entity recognition from texts
                """

                def extract(self, text: str) :
                    """
                    Performs named entity recognition from text
                    :param text: Text to extract
                    """
                    # load spacy nlp library
                    spacy_nlp  = spacy.load('en_core_web_sm')

                    # parse text into spacy document
                    doc = spacy_nlp(text.strip())

                    # create sets to hold words
                    named_entities = set()
                    money_entities = set()
                    organization_entities = set()
                    location_entities = set()
                    time_indicator_entities = set()

                    for i in doc.ents:
                        entry = str(i.lemma_).lower()
                        text = text.replace(str(i).lower(), "")
                        # Time indicator entities detection
                        if i.label_ in ["TIM", "DATE"]:
                            time_indicator_entities.add(entry)
                        # money value entities detection
                        elif i.label_ in ["MONEY"]:
                            money_entities.add(entry)
                        # organization entities detection
                        elif i.label_ in ["ORG"]:
                            organization_entities.add(entry)
                        # Geographical and Geographical entities detection
                        elif i.label_ in ["GPE", "GEO"]:
                            location_entities.add(entry)
                        # extract artifacts, events and natural phenomenon from text
                        elif i.label_ in ["ART", "EVE", "NAT", "PERSON"]:
                            named_entities.add(entry.title())

                    print(f"named entities - {named_entities}")
                    print(f"money entities - {money_entities}")
                    print(f"location entities - {location_entities}")
                    print(f"time indicator entities - {time_indicator_entities}")
                    print(f"organization entities - {organization_entities}")



            if __name__ == '__main__':
                named_entity_extractor = NamedEntityExtractor()
                text = "John bought a Toyota camry 2019 model in Toronto in January 2020 at a cost of $38000"
                named_entity_extractor.extract(text)
        

Named entity extraction has numerous real world usages, with samples mentioned earlier. spaCy can easily be boostrapped to add this fucntionality to your enterprise application.




Share this page on


  3 People Like(s) This Page   Permalink  

 Click  To Like This Page

comments powered by Disqus

page