目录
代码: https://github.com/abhayspawar/featexp 说明: 业界 | 如何达到Kaggle竞赛top 2%?这里有一篇特征探索经验帖
参考https://www.cnblogs.com/jasonfreak/p/5448385.html
本质是一项工程活动,目的是最大限度地从原始数据中提取特征以供算法和模型使用。
库 | 所属方法 | 说明 |
---|---|---|
preprocessing.StandardScaler | 无量纲(dimension)化 | 标准化,基于特征矩阵的列,将特征值转换至服从标准正态分布 |
preprocessing.MinMaxScaler | 无量纲(dimension)化 | 区间缩放,基于最大最小值,将特征值转换到[0, 1]区间上 |
preprocessing.Normalizer | 归一化 | 基于特征矩阵的行,将样本向量转换为『单位向量』 |
preprocessing.Binarizer | 二值化 | 基于给定阈值,将定量特征按阈值划分 |
preprocessing.OneHotEncoder | one-hot编码 | 将定性数据转换为定量数据 |
preprocessing.Imputer | 缺失值计算 | 计算缺失值,填充为均值等 |
preprocessing.PolynomialFeatures | 多项式数据转换 | 多项式数据转换 |
preprocessing.FunctionTransformer | 自定义单元数据转换 | 使用单变元的函数来转换数据 |
简单的方差的方法
\[
D(x) = E(x^2)-[E(x)]^2
\]
所以用python来写就是:
N = len(nlist)
narray = numpy.array(nlist)
sum1 = narray.sum()
narray2 = narray * narray
sum2 = narray2.sum()
mean = sum1 / N
var = sum2 / N - mean ** 2
stdv = math.sqrt(var)
print mean
print var
print stdv
库 | 所属方法 | 说明 |
---|---|---|
feature_selection.VarianceThreshold | Filter | 方差选择法 |
feature_selection.SelectKBest | Filter | 常用函数
|
feature_selection.RFE | Wrapper | 特征递归消除法(递归地训练基模型[例如,sklearn.linear_model.LogisticRegression],将权值系数较小的特征从特征集合中消除) |
feature_selection.SelectFromModel | Embeded | 训练基模型,选择权值系数较高的特征
|
库 | 所属方法 | 说明 |
---|---|---|
decomposition.PCA | PCA | 主成分分析,为了让映射后的样本有最大的发散性,是一种无监督的降维方法。 |
lda.LDA | LDA | 线性判别分析法,为了让映射后的样本有最好的分类性能,是一种有监督的降维方法。 |
两个考虑点:
\(X^2\)
越小,说明变量越独立,越大越相关)根据特征选择的形式,可以分为以下三种方法:
比如有10个特征,然后你有一个模型a用这10个特征来预测xxx
wrapper我理解就是10个特征过这个模型,随机扔掉一些,看单特征的效果(类似所谓的的特征重要度);
embedded我理解是另外拿一个模型,目标可能不一定是你要预测的xxx,通过那个模型,来给每个特征学一个权重,然后筛选特征,剩下的特征再用a去走(例如模型是个复杂dnn,可以先用lr/gbdt,输入这些特征,去学某个label,然后学到权重用来筛选特征吧)
pca.explained_variaance_ratio_:percentage of variance explained by each of the selected components.
参考 http://www.cnblogs.com/jasonfreak/p/5448462.html
包 | 类或方法 | 说明 |
---|---|---|
sklearn.pipeline | Pipeline | 流水线处理 |
sklearn.pipeline | FeatureUnion | 并行处理 |
sklearn.grid_search | GridSearchCV | 网格搜索调参 |
官方文档:https://www.tensorflow.org/guide/feature_columns?hl=zh-cn
参考https://blog.csdn.net/cjopengler/article/details/78161748
同样地,参考https://zhuanlan.zhihu.com/p/41663141
通过调用tf.feature_column模块来创建feature columns。有两大类feature column
使用indicator_column能把categorical column得到的稀疏tensor转换为one-hot或者multi-hot形式的稠密tensor
demo:
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.python.feature_column.feature_column import _LazyBuilder
def test_shared_embedding_column_with_hash_bucket():
color_data = {'color': [[2, 2], [5, 5], [0, -1], [0, 0]],
'color2': [[2], [5], [-1], [0]]} # 4行样本
builder = _LazyBuilder(color_data)
color_column = feature_column.categorical_column_with_hash_bucket('color', 7, dtype=tf.int32)
color_column_tensor = color_column._get_sparse_tensors(builder)
color_column2 = feature_column.categorical_column_with_hash_bucket('color2', 7, dtype=tf.int32)
color_column_tensor2 = color_column2._get_sparse_tensors(builder)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print('not use input_layer' + '_' * 40)
print(session.run([color_column_tensor.id_tensor]))
print(session.run([color_column_tensor2.id_tensor]))
# 将稀疏的转换成dense,也就是one-hot形式,只是multi-hot
color_column_embed = feature_column.shared_embedding_columns([color_column2, color_column], 3, combiner='sum')
print(type(color_column_embed))
color_dense_tensor = feature_column.input_layer(color_data, color_column_embed)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print('use input_layer' + '_' * 40)
print(session.run(color_dense_tensor))
test_shared_embedding_column_with_hash_bucket()
还有:
def test_categorical_column_with_vocabulary_list():
color_data = {'color': [['R', 'R'], ['G', 'R'], ['B', 'G'], ['A', 'A']]} # 4行样本
builder = _LazyBuilder(color_data)
color_column = feature_column.categorical_column_with_vocabulary_list(
'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
color_column_tensor = color_column._get_sparse_tensors(builder)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print(session.run([color_column_tensor.id_tensor]))
# 将稀疏的转换成dense,也就是one-hot形式,只是multi-hot
color_column_identy = feature_column.indicator_column(color_column)
color_dense_tensor = feature_column.input_layer(color_data, [color_column_identy])
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print('use input_layer' + '_' * 40)
print(session.run([color_dense_tensor]))
test_categorical_column_with_vocabulary_list()
对于多个取值特征进行feature crossing。。。参考https://github.com/Lapis-Hong/wide_deep/blob/master/python/lib/dataset.py#L152的掉渣天的代码:
_CSV_COLUMNS = [
"ad_account_id", "education", "bid"
"show"]
_CSV_COLUMN_DEFAULTS = [
['-1'], ['-1'], [0.0],
[0.0]]
ad_product_id = tf.feature_column.categorical_column_with_hash_bucket(
'ad_product_id', hash_bucket_size=12000)
base_columns = [
ad_account_id, ]
crossed_columns = [
tf.feature_column.crossed_column(
['ad_product_id', 'education_cross2'],
hash_bucket_size=2000),
]
wide_columns = base_columns + crossed_columns
deep_columns = [
bid,
]
def input_fn(data_file, num_epochs, shuffle, batch_size):
def parse_csv(value):
tf.logging.info('Parsing {}'.format(data_file))
columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
# features = {"bid": csv_decode_obj, "show": csv_decode_obj, ...}
features["education_cross2"] = tf.string_split(columns[5:6], delimiter=":").values
labels = features.pop('show')
return features, labels
然而要batch的时候,还是会有点问题的,所以需要改成(注意!!padded_shape如果传入第一个元素的是个字典,会按key排序的,也就是”bid”会排在”age”的后面!!):
_CSV_COLUMNS = [
"ad_account_id", "education", "age","bid",
"show"]
_CSV_COLUMN_DEFAULTS = [
['-1'], ['-1'], ['-1'], [0.0],
[0.0]]
ad_product_id = tf.feature_column.categorical_column_with_hash_bucket(
'ad_product_id', hash_bucket_size=12000)
base_columns = [
ad_account_id, ]
crossed_columns = [
tf.feature_column.crossed_column(
['ad_product_id', 'education'],
hash_bucket_size=2000),
]
wide_columns = base_columns + crossed_columns
deep_columns = [
bid,
]
def input_fn(data_file, num_epochs, shuffle, batch_size):
def parse_csv(value):
tf.logging.info('Parsing {}'.format(data_file))
columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
# features = {"bid": csv_decode_obj, "show": csv_decode_obj, ...}
features["education"] = tf.string_split(columns[5:6], delimiter=":").values
features["age"] = tf.string_split(columns[6:7], delimiter=":").values
labels = features.pop('show')
return features, labels
# Extract lines from input files using the Dataset API.
dataset = tf.data.TextLineDataset(data_file)
if shuffle:
dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])
dataset = dataset.map(parse_csv, num_parallel_calls=5)
g_features = _CSV_COLUMNS[:-1]
padded_dict = {k: [] for k in g_features}
padded_dict["age"] = [-1] # 如果设成某个数,比如我想pad成长度5的之类的。。会出现『Attempted to pad to a smaller size than the input element.』的错。。比较蛋疼
padded_dict["education"] = [-1]
if mode == "pred":
padded_dict = (padded_dict, [])
else:
padded_dict = (padded_dict, [])
dataset = dataset.padded_batch(batch_size, padded_shapes=padded_dict)#, drop_remainder=True)#.filter(lambda fea,lab: tf.equal(tf.shape(lab)[0], batch_size))
# We call repeat after shuffling, rather than before, to prevent separate
# epochs from blending together.
dataset = dataset.repeat(num_epochs)
## dataset = dataset.batch(batch_size) ## 如果有变长的,就不能用这个啦,要用上面的padded_batch
另外,在https://github.com/daiwk/grace_t/tree/master/python/grace_t/basic_demos这里有些实践和尝试。。
参考https://zhuanlan.zhihu.com/p/32699487
参考https://github.com/PaddlePaddle/models/blob/develop/PaddleRec/ctr/preprocess.py
class CategoryDictGenerator:
"""
Generate dictionary for each of the categorical features
"""
def __init__(self, num_feature):
self.dicts = []
self.num_feature = num_feature
for i in range(0, num_feature):
self.dicts.append(collections.defaultdict(int))
def build(self, datafile, categorial_features, cutoff=0):
with open(datafile, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
for i in range(0, self.num_feature):
if features[categorial_features[i]] != '':
self.dicts[i][features[categorial_features[i]]] += 1
for i in range(0, self.num_feature):
self.dicts[i] = filter(lambda x: x[1] >= cutoff,
self.dicts[i].items())
self.dicts[i] = sorted(self.dicts[i], key=lambda x: (-x[1], x[0]))
vocabs, _ = list(zip(*self.dicts[i]))
self.dicts[i] = dict(zip(vocabs, range(1, len(vocabs) + 1)))
self.dicts[i]['<unk>'] = 0
def gen(self, idx, key):
if key not in self.dicts[idx]:
res = self.dicts[idx]['<unk>']
else:
res = self.dicts[idx][key]
return res
def dicts_sizes(self):
return list(map(len, self.dicts))
class ContinuousFeatureGenerator:
"""
Normalize the integer features to [0, 1] by min-max normalization
"""
def __init__(self, num_feature):
self.num_feature = num_feature
self.min = [sys.maxsize] * num_feature
self.max = [-sys.maxsize] * num_feature
def build(self, datafile, continous_features):
with open(datafile, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
for i in range(0, self.num_feature):
val = features[continous_features[i]]
if val != '':
val = int(val)
if val > continous_clip[i]:
val = continous_clip[i]
self.min[i] = min(self.min[i], val)
self.max[i] = max(self.max[i], val)
def gen(self, idx, val):
if val == '':
return 0.0
val = float(val)
return (val - self.min[idx]) / (self.max[idx] - self.min[idx])
将sklearn训练速度提升100多倍,美国「返利网」开源sk-dist框架
https://github.com/Ibotta/sk-dist