Traditional Machine Learning projects use numeric and textual data stored in conventional databases. Developing intelligent applications based on purely text data is extremely challenging? Why is it so? In the first place, the available text data in this world is millions of times more than the numeric data available to us in the conventional databases. So, the question is can we extract some useful information from this huge corpus of text data – which can run into several terabytes or rather petabytes. The moment you talk about these sizes for the data, the whole perspective of machine learning changes. In the traditional databases, the number of columns is quite low and thus the number of features for machine learning too is very small – generally goes in tens and at the most few hundreds, max. In NLP applications, as there are no columns like structured databases, each word in the text corpus becomes a probable candidate to be considered as a feature for model training. It is impossible to train a model with millions of features. So, to develop ML applications, the first and the major requirement is to reduce this features count by reducing the vocabulary. The other major requirement is to convert the text data into binary format as our dumb machine understand only binaries. That is where the NLP learning becomes distinct from model development on structured databases. Once the text data is pre–processed to get a minimal number of features that represent the entire text corpus, the rest of the model development process remains same as the traditional one – popularly known as Good Old Fashioned AI.
Specification: Natural Language Processing using Python