Keywords: Data Science, Statistik Beratung, Automatic Feature Engineering

Introduction

We will be using the Auto MPG Data Set ( https://archive.ics.uci.edu/ml/datasets/Auto+MPG ) to demonstrate automatic feature engineering. This dataset is well suited since in practice one has datasets with mixed categorical and numerical attributes.

Motivation

Besides for feature engineering being the first step in many machine learning algorithms, there seems to be no method which works well across multiple datasets and machine learning algorithms. Many algorithms require the data to be in numerical format. For example clustering is difficult with categorical data. ( https://datascience.stackexchange.com/a/24/42229 ) If the data was in numerical vector format this would allow us to do dimensionality reduction (for example UMAP), visualize and plot the data after dimensionality reduction, observe the number of clusters and then do for example k-means clustering. On the other hand if the data is in numerical vector format, we could use SVM, Neural Networks, Linear Regression for regression or SVM, Neural Networks, Logistic Regression for classification. Ok, one could argue that Random Forest already deal with mixed categorical and numerical data frames, but why using only one method?

Proposed Method

The proposed method goes back to the mathematician Jean Bourgain (https://en.wikipedia.org/wiki/Jean_Bourgain), who proved in 1985 that every metric space can be embedded with a “small” distortion in an euclidean space ( Original Work: https://link.springer.com/article/10.1007/BF02776078 , See: http://www.cs.toronto.edu/~avner/teaching/S6-2414/LN2.pdf for an introduction). This means the following: If we have a distance between two data points, we can embed those data points in euclidean space, hence we get automatically numerical features for those data points. What remains is to define a distance between those points. I suggest the following general method:

For numerical features, scale them and use the cosine distance. For categorical features use the trivial metric (0 if the same category, 1 if not) and for sets or lits of categories (for example a list of words) use the Jaccard distance. Having those distance, with values between 0 and 1, we can simply compute the distance \(d = \sqrt{ \sum_i d_i^2 }\). Of course the distance depends on the application, but this suggestion is a first step.

Proposed Method in Action

Let us apply this method for the Auto MPG Data Set to do the following: 1) Regression 2) Visualization of the cars 3) Dimensionality reduction 4) Clustering 5) Computation of similar cars. Yes, all this can be done with mixed categorical and numerical data frames.

I do not want to go into details but the general procedure is this:

  1. Data Transformation:
  1. Define categorical, numerical features.
  2. NA in catgeroical are set to the category “NACAT”
  3. NA in numerical features are set to the median
  4. scale (variance = 1, mean = 0) the numerical variables
  1. Bourgain Embedding (Use the metrics defined above)

  2. Analyse the data now that it is in numerical format and apply your favourite machine learning algorithm.

Scripts

Below is the python Script which does the Bourgain Embedding. The script (bourgain.py where the method is coded) can be requested from me (info@orges-leka.de).

#!/usr/bin/python
import os,csv,re,math,random
from scipy.spatial import distance

from bourgain import BourgainEmbedding as BE

numFeat = [5,6,7] # define numerical features
catFeat = [0,1,2] # define categorical features
jaccFeat = [3] # define set / list features

def jaccardDist(set1,set2):
    return 1-len(set1.intersection(set2))/(len(set1.union(set2)))*1.0

def dist(prod1,prod2):
    vec1 = [ float(prod1[i]) for i in numFeat]
    vec2 = [ float(prod2[i]) for i in numFeat]
    d1 = distance.cosine(vec1,vec2)
    ds = [1-(prod1[cf]==prod2[cf]) for cf in catFeat]
    ds.insert(0,d1)
    for jf in jaccFeat:
        ds.append( jaccardDist( set(prod1[jf].split(" ")), set(prod2[jf].split(" ")))) 
    return math.sqrt(sum([dd**2 for dd in ds]))

filename = "./tt.csv"

def main():
    csvFile = open(filename,"r")
    csvReader = csv.reader(csvFile,dialect=csv.excel)
    newFile = open("./tt-features.csv","w")
    newCW = csv.writer(newFile,dialect=csv.excel)
    lines = []
    for line in csvReader:
        lines.append(line)
    csvFile.close()
    random.seed(12345)
    randomsubset = random.sample(lines,len(lines))
    be = BE(dist)
    be.fit(randomsubset,verbose=True)
    featureVectors = be.predict(lines,verbose=True)
    for i in range(len(lines)):
        fv = featureVectors[i]
        fv.insert(0,lines[i][4]) # mpg- for regression
        fv.insert(0,lines[i][3]) # name of car
        print(fv)
        newCW.writerow(fv)
    newCW

if __name__ == '__main__':
    main()

R-Script for computing the most similar cars: (knn)

library(FNN)
d <- read.csv("tt-features.csv",sep=",",header=F)
set.seed(12345)
rs <- sample(seq(1,dim(d)[1]))
dfnn <- data.frame(qp=numeric(0),knn=numeric(0))
vars <- seq(3,dim(d)[2])
k <- 3
knns <- get.knnx(data=d[,vars],query=d[,vars],k=k)

for(i in seq(1,N)){
    querypoint <- i
    knn <- knns$nn.index[i,]
    ndf <- data.frame(qp=rep(querypoint,k),knn=c(knn))
    result <- d[knn,c("V1","V2")]
    qr <- d[i,c("V1","V2")]
    names(result) <- c("V3","V4")
    simProds <- rbind(simProds,cbind(qr,result))

}


write.table(simProds,"similarCars.csv",sep=",",row.names=F,col.names=F)

Result of similarity products

Here are some similar cars:

I am not an expert in cars, but the first similarity result shows, that the cars Chevrolet Chevelle Malibu (https://de.wikipedia.org/wiki/Chevrolet_Chevelle) and Chevrolet Monte Carlo (https://de.wikipedia.org/wiki/Chevrolet_Monte_Carlo) are related by Wikipedia Articles, so also this is no mathematical proof, the method, can not be that bad. ;)