Feature Preprocessing on Kaggle

刚入手data science, 想着自己玩一玩kaggle,玩了新手Titanic和House Price的 项目, 觉得基本的baseline还是可以写出来,但是具体到一些细节,以至于到能拿到的出手的成绩还是需要理论分析的。

本文旨在介绍kaggle比赛到各种原理与技巧,当然一切源自于coursera,由于课程都是英文的,且都比较好理解,这里直接使用英文

  • Reference
    How to Win a Data Science Competition: Learn from Top Kagglers

Features: numeric, categorical, ordinal, datetime, coordinate, text

Numeric features

All models are divided into tree-based model and non-tree-based model.

 

Scaling

For example: if we apply KNN algorithm to the instances below, as we see in the second row, we caculate the distance between the instance and the object. It is obvious that dimension of large scale dominates the distance.

 

Tree-based models doesn’t depend on scaling

Non-tree-based models hugely depend on scaling

How to do

sklearn:

  1. To [0,1]
    sklearn.preprocessing.MinMaxScaler
    X = ( X-X.min( ) )/( X.max()-X.min() )
  2. To mean=0, std=1
    sklearn.preprocessing.StandardScaler
    X = ( X-X.mean( ) )/X.std()

    • if you want to use KNN, we can go one step ahead and recall that the bigger feature is, the more important it will be for KNN. So, we can optimize scaling parameter to boost features which seems to be more important for us and see if this helps

Outliers

The outliers make the model diviate like the red line.

这里写图片描述

We can clip features values between teo chosen values of lower bound and upper bound

  • Rank Transformation

If we have outliers, it behaves better than scaling. It will move the outliers closer to other objects

Linear model, KNN, Neural Network will benefit from this mothod.

rank([-100, 0, 1e5]) == [0,1,2]  
rank([1000,1,10]) = [2,0,1]

scipy:

scipy.stats.rankdata

  • Other method

    1. Log transform: np.log(1 + x)
    2. Raising to the power < 1: np.sqrt(x + 2/3)

Feature Generation

Depends on

a. Prior knowledge
b. Exploratory data analysis


Ordinal features

Examples:

  • Ticket class: 1,2,3
  • Driver’s license: A, B, C, D
  • Education: kindergarden, school, undergraduate, bachelor, master, doctoral

Processing

1.Label Encoding
* Alphabetical (sorted)
[S,C,Q] -> [2, 1, 3]

sklearn.preprocessing.LabelEncoder

  • Order of appearance
    [S,C,Q] -> [1, 2, 3]

Pandas.factorize

This method works fine with two ways because tree-methods can split feature, and extract most of the useful values in categories on its own. Non-tree-based-models, on the other side,usually can’t use this feature effectively.

2.Frequency Encoding
[S,C,Q] -> [0.5, 0.3, 0.2]

encoding = titanic.groupby(‘Embarked’).size()  
encoding = encoding/len(titanic)  
titanic[‘enc’] = titanic.Embarked.map(encoding)

from scipy.stats import rankdata

For linear model, it is also helpful.
if frequency of category is correlated with target value, linear model will utilize this dependency.

3.One-hot Encoding

pandas.get_dummies

It give all the categories of one feature a new columns and often used for non-tree-based model.
It will slow down tree-based model, so we introduce sparse matric. Most of libaraies can work with these sparse matrices directly. Namely, xgboost, lightGBM

Feature generation

Interactions of categorical features can help linear models and KNN

By concatenating string

这里写图片描述


Datetime and Coordinates

Date and time

1.Periodicity
2.Time since

a. Row-independent moment  
For example: since 00:00:00 UTC, 1 January 1970;b. Row-dependent important moment  
Number of days left until next holidays/ time passed after last holiday.

3.Difference betwenn dates

We can add date_diff feature which indicates number of days between these events

Coordicates

1.Interesting places from train/test data or additional data

Generate distance between the instance to a flat or an old building(Everything that is meanful)

2.Aggergates statistics

The price of surrounding building

3.Rotation

Sometime it makes the model more precisely to classify the instances.

这里写图片描述


Missing data

Hidden Nan, numeric

When drawing a histgram, we see the following picture:

这里写图片描述

It is obivous that -1 is a hidden Nan which is no meaning for this feature.

Fillna approaches

1.-999,-1,etc(outside the feature range)

It is useful in a way that it gives three possibility to take missing value into separate category. The downside of this is that performance of linear networks can suffer.

2.mean,median

Second method usually beneficial for simple linear models and neural networks. But again for trees it can be harder to select object which had missing values in the first place.

3.Reconstruct:

  • Isnull

  • Prediction

这里写图片描述
* Replace the missing data with the mean of medain grouped by another feature.
But sometimes it can be screwed up, like:

这里写图片描述

The way to handle this is to ignore missing values while calculating means for each category.

  • Treating values which do not present in trian data

Just generate new feature indicating number of occurrence in the data(freqency)

这里写图片描述

  • Xgboost can handle Nan

4.Remove rows with missing values

This one is possible, but it can lead to loss of important samples and a quality decrease.


Text

Bag of words

Text preprocessing

1.Lowercase

2.Lemmatization and Stemming
这里写图片描述

3.Stopwords

Examples:
1.Articles(冠词) or prepositions
2.Very common words

sklearn.feature_extraction.text.CountVectorizer:
max_df

  • max_df : float in range [0.0, 1.0] or int, default=1.0
    When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

CountVectorizer

The number of times a term occurs in a given document

sklearn.feature_extraction.text.CountVectorizer

TFiDF

In order to re-weight the count features into floating point values suitable for usage by a classifier

  • Term frequency
    tf = 1 / x.sum(axis=1) [:,None]
    x = x * tf

  • Inverse Document Frequency
    idf = np.log(x.shape[0] / (x > 0).sum(0))
    x = x * idf

N-gram

这里写图片描述

sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer

  • ngram_range : tuple (min_n, max_n)
    The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

Embeddings(~word2vec)

It converts each word to some vector in some sophisticated space, which usually have several hundred dimensions

a. Relatively small vectors

b. Values in vector can be interpreted only in some cases

c. The words with similar meaning often have similar
embeddings

Example:

这里写图片描述

 

转载于:https://www.cnblogs.com/bjwu/p/8970821.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/251863.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

十天冲刺-04

昨天&#xff1a;完成了日历界面的部署&#xff0c;并且能够获取到选中的日期 今天&#xff1a;完成根据日期查找消费记录功能 问题&#xff1a;日历界面占用屏幕太多&#xff0c;后期会进行调整转载于:https://www.cnblogs.com/liujinxin123/p/10760254.html

构建Spring Boot程序有用的文章

构建Spring Boot程序有用的文章&#xff1a; http://www.jb51.net/article/111546.htm转载于:https://www.cnblogs.com/xiandedanteng/p/7508334.html

如果您遇到文件或数据库问题,如何重置Joomla

2019独角兽企业重金招聘Python工程师标准>>> 如果您遇到Joomla站点的问题&#xff0c;那么重新安装其核心文件和数据库可能是最佳解决方案。 了解问题 这种方法无法解决您的所有问题。但它主要适用于由Joomla核心引起的问题。 运行Joomla核心更新后&#xff0c;这些…

数组初始化 和 vector初始化

int result[256] {0}; 整个数组都初始化为0 vector<int> B(length,1); 整个vector初始化为1 如果你定义的vector是这样定义的&#xff1a; vector<int> B; 去初始化&#xff0c;千万不要用&#xff1a; for(int i 0;i < length;i)B[i] 1; 这样会数组越界&…

Genymotion模拟器拖入文件报An error occured while deploying the file的错误

今天需要用到资源文件&#xff0c;需要将资源文件拖拽到sd卡中&#xff0c;但老是出现这个问题&#xff1a; 资源文件拖不进去genymotion。查看了sd的DownLoad目录&#xff0c;确实没有成功拖拽进去。 遇到这种问题的&#xff0c;我按下面的思路排查问题&#xff1a; Genymotio…

激光炸弹(BZOJ1218)

激光炸弹&#xff08;BZOJ1218&#xff09; 一种新型的激光炸弹&#xff0c;可以摧毁一个边长为R的正方形内的所有的目标。现在地图上有n(N<10000)个目标&#xff0c;用整数Xi,Yi(其值在[0,5000])表示目标在地图上的位置&#xff0c;每个目标都有一个价值。激光炸弹的投放是…

/usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15 not found

解决错误呈现该错误的原因是当前的GCC版本中&#xff0c;没有GLIBCXX_3.4.15&#xff0c;须要安装更高版本。我们可以输入&#xff1a;strings /usr/lib/libstdc.so.6 | grep GLIBCXX&#xff0c;查看当前的GCC版本&#xff0c;成果如下&#xff1a;GLIBCXX_3.4 GLIBCXX_3.4.1 …

用servlet设计OA管理系统时遇到问题

如果不加单引号会使得除变量和int类型的值不能传递 转发和重定向的区别 转发需要填写完整路径&#xff0c;重定向只需要写相对路径。原因是重定向是一次请求之内已经定位到了服务器端&#xff0c;转发则需要两次请求每次都需要完整的路径。 Request和response在解决中文乱码时的…

JDK源码——利用模板方法看设计模式

前言&#xff1a; 相信很多人都听过一个问题&#xff1a;把大象关进冰箱门&#xff0c;需要几步&#xff1f; 第一&#xff0c;把冰箱门打开&#xff1b;第二&#xff0c;把大象放进去&#xff1b;第三&#xff0c;把冰箱门关上。我们可以看见&#xff0c;这个问题的答案回答的…

[Usaco2010 Mar]gather 奶牛大集会

1827: [Usaco2010 Mar]gather 奶牛大集会 Time Limit: 1 Sec Memory Limit: 64 MB Submit: 1129 Solved: 525 [Submit][Status][Discuss]Description Bessie正在计划一年一度的奶牛大集会&#xff0c;来自全国各地的奶牛将来参加这一次集会。当然&#xff0c;她会选择最方便的…

与TIME_WAIT相关的几个内核参数

问题 公司用浏览器访问线上服务一会失败一会成功&#xff0c;通过ssh连接服务器排查时发现ssh也是这样&#xff1b; 检查 抓包后发现建立连接的请求已经到了服务器&#xff0c;但它没有响应&#xff1b; 纠结了好久&#xff0c;后来在腾讯云技术支持及查了相关资料后发现是开启…

View的绘制-layout流程详解

目录 作用 根据 measure 测量出来的宽高&#xff0c;确定所有 View 的位置。 具体分析 View 本身的位置是通过它的四个点来控制的&#xff1a; 以下涉及到源码的部分都是版本27的&#xff0c;为方便理解观看&#xff0c;代码有所删减。 layout 的流程 先通过 measure 测量出 Vi…

1-1、作用域深入和面向对象

课时1&#xff1a;预解释 JS中的数据类型 number、string、 boolean、null、undefined JS中引用数据类型 object: {}、[]、/^$/、Date Function var num12; var obj{name:白鸟齐鸣,age:10}; function fn(){ console.log(勿忘初心方得始终&#xff01;) }console.log(fn);//把整…

茶杯头开枪ahk代码

;说明这个工具是为了茶杯头写的,F1表示换枪攻击,F3表示不换枪攻击,F2表示停止攻击. $F1::loop{ GetKeyState, state, F2, Pif state D{break } Send, {l down}Send, {l up}sleep,10Send,{m down}Send,{m up} }return $F3::loop{ GetKeyState, state, F2, Pif state D{break }…

Vim使用技巧:撤销与恢复撤销

在使用VIM的时候&#xff0c;难免会有输错的情况&#xff0c;这个时候我们应该如何撤销&#xff0c;然后回到输错之前的状态呢&#xff1f;答案&#xff1a;使用u&#xff08;小写&#xff0c;且在命令模式下&#xff09;命令。 但如果有时我们一不小心在命令模式下输入了u&…

PaddlePaddle开源平台的应用

最近接触了百度的开源深度学习平台PaddlePaddle&#xff0c;想把使用的过程记录下来。 作用&#xff1a;按照这篇文章&#xff0c;能够实现对图像的训练和预测。我们准备了四种颜色的海洋球数据&#xff0c;然后给不同颜色的海洋球分类为0123四种。 一、安装paddlepaddle 1.系统…

Hyperledger Fabric区块链工具configtxgen配置configtx.yaml

configtx.yaml是Hyperledger Fabric区块链网络运维工具configtxgen用于生成通道创世块或通道交易的配置文件&#xff0c;configtx.yaml的内容直接决定了所生成的创世区块的内容。本文将给出configtx.yaml的详细中文说明。 如果需要快速掌握Fabric区块链的链码与应用开发&#x…

js闭包??

<script>var name "The Window";var object {name : "My Object",getNameFunc : function(){console.log("11111");console.log(this); //this object //调用该匿名函数的是对象return function(){console.log("22222");co…

JavaScript----BOM(浏览器对象模型)

BOM 浏览器对象模型 BOM 的全称为 Browser Object Model,被译为浏览器对象模型。BOM提供了独立于 HTML 页面内容&#xff0c;而与浏览器相关的一系列对象。主要被用于管理浏览器窗口及与浏览器窗口之间通信等功能。 1、Window 对象 window对象是BOM中最顶层对象&#xff1b;表示…

JWT协议学习笔记

2019独角兽企业重金招聘Python工程师标准>>> 官方 https://jwt.io 英文原版 https://www.ietf.org/rfc/rfc7519.txt 或 https://tools.ietf.org/html/rfc7519 中文翻译 https://www.jianshu.com/p/10f5161dd9df 1. 概述 JSON Web Token&#xff08;JWT&#xff09;是…