Automated feature creation and processing for machine learning



None

One of the essential tasks in machine learning is feature creation. It is the process of constructing new features from existing data on which a machine learning model is trained. This step is as important as the actual model used as the algorithm only learns from the data we give it, and creating features that are relevant to the task is crucial to gain insight.

 

Usually, feature engineering is a tedious manual process- relying on domain knowledge, intuition, and extensive data manipulation. It is time-consuming because each new feature usually requires several steps to build, using information from multiple tables.

 

We can broadly group the operations of feature creation into two categories:

 

Transformations

 A transformation acts on a single table by creating new features out of one or more of the existing columns. Creating a logarithm of annual income table for a monthly salary table would be an example.

Aggregations

Aggregations are performed across tables to infer group observations and calculate statistics. An example would be taking the stands of mortgages and loans and calculating statistics on financial liabilities of a particular client. Performing these operations repeatedly on multi-table datasets is inefficient. Thus, automated feature engineering helps the data scientist by automatically creating many candidates features out of a dataset from which the relevant features can be selected and trained for. One good way to do this is to use the Feature tools Python library. This open-source Python library will automatically create many features from a set of related tables. It is based on a method known as “Deep Feature Synthesis (DFS).” DFS stacks multiple transformation and aggregation operations (called feature primitives) to create features from data spread across many tables.

 

Let us understand that by understanding some fundamental concepts of Feature tools first.

 

The two main concepts of Feature tools are Entities and EntitySets. An entity is merely a table. An EntitySet is a collection of tables and the relationships between them. A minimal input to DFS is a set of entities, a list of relationships, and the “target_entity” to calculate features for. The output of DFS is a feature matrix and the corresponding list of feature definitions. We show it’s working by taking a prototype example on publicly available data and replicating this in our environment.

 

Example:

 

In this dataset, there are 3 tables or entities

 

users

unique users with user ID who shares multiple posts. Each has a Zipcode, session count and join date.

posts

posts shared by a user, each post has a post ID.

interaction

Likes, views, comments on each post shared. Each post has multiple interactions.

 

The tables are related (through the user_id and the post_id variables), We can create an empty Entityset in feature tools using the following:

 

 

Now we have to add entities. Each entity must have an index, which is a column with all unique elements. The index in the users dataframe is the user_id. The index in the post dataframe is the post_id.  However, for the interactions dataframe, there is no unique index. When we add this entity to the entity set, we need to pass in the parameter make_index = True and specify the name of the index.


 

 

Second, we specify how the entities are related. When two entities have a one-to-many relationship, we call the “one” entity, the “parent entity.” For example, in our dataset, the users dataframe is a parent of the posts dataframe which in turn is the parent of the interactions dataframe. To specify a relationship in Feature tools, we only need to specify the variable that links two tables together. The users and the posts table are linked via the user_id variable and posts and interactions are linked with the post_id.


 

 

Operations of feature creation are called feature primitives in Feature tools. These primitives can be used by themselves or combined to create features. To make features with specified primitives we use the ft.dfs function (standing for deep feature synthesis). We pass in the entity set, the target_entity, which is the table where we want to add the features, the selected trans_primitives (transformations), and agg_primitives (aggregations). We are interested in the time the user joined and the number of interaction count for example. Feature tools select features automatically by stacking these features.



 




 

We now have hundreds of new features to describe a user’s behavior. This was done by combining or stacking multiple primitive features - an example of deep feature synthesis. We can modify the code appropriately for even more second-order features.




 

Following steps

Automated feature engineering has solved the problem of creating features, but created another problem of having too many features- not all of them relevant to a task we want to train our model on. Having too many features can lead to poor model performance because the less useful features overwhelm those that are more important.

This is known as the curse of dimensionality. As the number of features increases, it becomes exponentially more difficult for a model to learn the mapping between features and targets. This problem is dealt with feature reduction techniques like Principal Component Analysis (PCA), SelectKBest, autoencoders, etc. An often used powerful technique is LASSO (least absolute shrinkage and selection operator) which is a regression analysis method that performs both feature selection and regularization.

 

A good data scientist can thus incorporate these two steps - automated feature creation along with feature selection (domain knowledge or feature reduction techniques) to achieve the desired results.



Latest posts

AI tools and techniques in Semantic SEO

Semantic SEO are the techniques in search engine optimization which enables the search query to provide meaningful results to the user by understanding the intent behind that query

How AI is Demystifying Traditional SEO Ranking Factors in 2020?

AI is demystifying traditional claims and opinions on 200 Google SEO ranking factors in 2020. Modern-day SEO uses AI tools and techniques in automated content creation, crawling, indexing, semantic topic modeling using word embeddings to enhancing the user experience by page-layout algorithms.  

Top 16 strategic steps in eCommerce SEO and keyword research in 2019

Combine AI tools and techniques in eCommerce SEO with automated content creation and keyword research for eCommerce SEO.

Top 10 AI tools and techniques for automated content creation for SEO in 2019

The essential and ultimate guide to the best top 10 AI tools and techniques to create automated content for digital marketing SEO.

Wordpress Alternative in 2019 | Python Web Development vs PHP web development

WordPress SEO plugins are not so good and people are looking for a WordPress alternative in 2019. Like Python Web Development vs PHP Web Development! What's better, flexible and easy. Let's find out.

Django vs Flask and python based web development in 2019

Django, flask, and CherryPy are one of the best Frameworks for Web Development, so let's try to understand and figure out where Django and flask stands in 2019

Top 8 Factors and Top 8 Solutions for Local SEO in Digital Marketing

We provide top 8 factors which impact search engine results for local SEO based and top 8 solutions to rank your website on the first page of the search engine in 2019.


Coursera: Build New Skills Anytime, Anywhere with 100% online courses. Start Now!