Name DB Generator: Extract Names from Text using NLTK
This python script uses NLTK to extract names from a provided text file and shows them in a searchable format along with the context the name is used in and the sentiment of the sentence.
There are two difficult problems in Computer Science and Life: Cache invalidation and Naming things. I put this script together to help me name my daughter. The results of this script when run on the Bible are viewable here: http://namedb.herokuapp.com/
There are two pieces to this Project:
Part 1: Name Extractor Python Script
The first step is to pull out and store the list of names in the text. We do this with a python script using NLTK’s Vadar library and store the extracted names into an SQLite database.
For my purposes, I used the json formatted bible found here as input text. You can ofcourse point the script to any text you wish.
Some mistakes are seen in the extract of names:
- Plurals and singular variants of the same name are recongized as different names.
- NLTK detects a lot of words as proper nouns though they are not names. For example, the words: Accordingly, Afterward, Affliction and Against were all detected as nouns.
- The Vadar library certainly isn’t tuned for Bibilical English as some very negative sentences are detected as ‘neutral’
NLTK and Vadar aren’t perfect, but they still did a decent job getting a majority of the names out and it’s a decent place to start for sentiment analysis as well.
View Source for Name Extractor on Github
Part 2: Express UI
The second part is an ExpressJS UI for the database so that the names can be viewed and browsed through easily.