The first part of this article brought out the example of creating a model, and familiarized us with the most important steps in ML: data preparation, selection of trait, model training (selection of model parameters), and final evaluation of results (AUC, Precision, Recall, etc.).

Now let’s look at a real example of using Studio to solve a practical ML problem. This project was successfully implemented and is already used for prediction based on cv/opportunity text.

The first thing we need to perform in machine learning is data. All data required for training is stored in Azure Document DB (Cosmos DB). For the improved process of data preparation and future training, we get data and store it into dataset into Azure ML Studio.

Once the import is done we can visualize it.

Also, because Document DB contains additional fields, we should select specific fields like skill, description…

As a final step, we store dataset as .csv into Azure ML Studio storage.

A dataset usually requires some preprocessing before it can be analyzed. For example, we might have noticed the missing values are present in the columns of various rows. These missing values need to be cleaned so the model can analyze the data correctly. In this case, we’ll remove any rows that have missing values. Then, we clean the text using Preprocess Text module. The cleaning reduces the noise in the dataset, helps you find the most important features, and improves the accuracy of the final model. We remove stopwords – common words such as “the” or “a”, numbers, special characters, duplicated characters, email addresses, and URLs. We also convert the text to lowercase, lemmatize the words, and detect sentence boundaries that are, then, indicated by “”|||” symbol in pre-processed text.||

As we see, in this experiment (in contrast to the part 1 example) the built-in module Preprocess Text (previously we showed the use of R script) is used. It allows us to clear the text, to remove stop words, numbers, special symbols, etc.

The main goal of this step is to receive a cleaned text (opportunity description) without stop words, numbers, emails, URLs, etc. For example here is a description before preprocessing:

And after:

To build a model for text data, we typically have to convert free-form text into numeric feature vectors. In our experiment, we use Extract N-Gram Features from Text module to transform the text data into such format. This module takes a column of whitespace-separated words and computes a dictionary of words, or N-grams of words, that appear in your dataset. Then, it counts how many times each word, or N-gram, appears in each record, and creates feature vectors from those counts. In our experiment we set N-gram size to 2, so our feature vectors include single words and combinations of two subsequent words.

We apply TF-IDF (Term Frequency Inverse Document Frequency) weighting to N-gram counts. This approach adds the weight of words that appear frequently in a single record, but are rare across the entire dataset. Other options include binary, TF, and graph weighing.

Such text features often have high dimensionality. For example, if your corpus has 100,000 unique words, your feature space will have 100,000 dimensions, or more if N-grams are used. The Extract N-Gram Features module gives you a set of options to reduce the dimensionality. You can choose to exclude words that are short or long, or too uncommon or too frequent to have significant predictive value. In our experiment, we exclude N-grams that appear in fewer than 5 records.

Also, we use feature selection to select only those features that are the most correlated with our target prediction. We use Chi-Squared feature selection to pick 50000 features. We can view the vocabulary of selected words or N-grams by clicking the right output of Extract N-grams module.

We use the Multiclass Neural Network module to create a neural network model that can be used to predict a target that has multiple values. Also, we use the Tune Model Hyperparameters module to build and test models using different combinations of settings, in order to determine the optimum hyperparameters for the given prediction task and data.

At first, we split data for training and score dataset, it means we select part of data that will not be included in the training process but will be used as test data for calculation of accuracy.

At the second step, we also split training dataset for Tune Model Hyperparameters. After that, we configure the main model of our experiment – Multiclass Neural Network.

After training, we can score and evaluate the model to analyze for analyzing received accuracy.

As we see, in a few fairly simple steps, we have built a model for solving a practical problem. Of course, some steps have been omitted, for example, searching for the optimal text processing method TF / TF-IDF and their parameters:

Or comparing the quality of different models:

In this article, we tried to show an example of using ML and Azure ML Studio to solve a practical problem. Of course, we showed it briefly – since the development of models for Natural Language Processing is a topic for the whole book, and the article can turn into a textbook for ML, NLP, Azure ML Studio.

But the main thing is that we proved that it is possible to build such models simply and quickly, having the initial knowledge of ML, the demonstrated tool – Azure ML Studio allows it.

Of course, we did not cover all aspects of ML and ML Studio, we did not consider other types of models, did not show how to effectively select the parameters of models – from simple logical regressions to neural networks. We did not show how to build neural networks with different types of activation functions and multi-layer networks. All these can be exposed in the next article.

First Steps in Machine Learning with Microsoft Azure. Part 2
4.3 (86.67%) 6 votes