I'm currently developing a tool aiming to detect addresses (or any pattern, like job, sport team or anything) in a text.
So what I'm currently doing:
1/ Splitting the text in words 2/ Stemming the words
Users can create categories (job, sport team, address...) and will manually assign a sentence to a category.
Each stemmed word of this sentence will be stored in DB, with an updated score (+1)
When I will browse a new document, I will compute for each sentence the score thanks to all words in it.
Example:
I live in Brown Street, in London
=> (live+1, Brown +1, Street+1, London+1)
Then next time I see
I live in Orange Street, in London The score will be 3 (live +1, Street+1, London+1) so I can say "this sentence might be an address". If user validates, I update the words (live+1, orange+1, street+1, london+1). If he says "inaccurate", all words will be downvoted.
I think with more runs, I will be able to detect addresses since "Street" and "London" will have a large score (same for zip code etc)
My question is:
First, what do you think about this approach? Secondly, context is just ignored with this approach. A sentence with Street & London should have a better score. It means if I detect Street & London in the same sentence, we can likely say it's an address.
How can I store this information in a database? I'm currently using a relational database (MySQL), but I'm afraid the size will become huge if I store the link between each word.
Is it what we call a neural network? What is the best way to store it?
Do you have any tips to upgrade my detection algorithm?