推荐系统中传统模型——LightGBM + FFM融合

1 深入FFM原理与实践



  • 经过One-Hot编码之后,大部分样本数据特征是比较稀疏的。
  • One-Hot编码的另一个特点就是导致特征空间大。

同时通过观察大量的样本数据可以发现,某些特征经过关联之后,与label之间的相关性就会提高。 FFM主要用来预估站内的CTR和CVR,即一个用户对一个商品的潜在点击率和点击后的转化率。


为了使用FFM方法,所有的特征必须转换成“field_id:feat_id:value”格式, field_id代表特征所属field的编号,feat_id是特征编号,value是特征的值


2 案例




label field_id:feature_id:value field_id:feature_id:value field_id:feature_id:value …

  • field_id表示每一个特征域的id号
  • feature_id表示所有特征值的id号(可采用连续编码以及hash编码)
  • value:当特征域不是连续特征时,value=1,若为连续特征,value=该特征的值

对于pandas DataFrame格式数据来说:

label  category_feature  continuous_feature  vector_feature
=====  ================  ==================  ==============
0           x               1.1               1 2
1           y               1.2               3 4 5   
0           x               2.2               6 7 8 9



def FFMFormat(df, label, path, train_len, category_feature = [], continuous_feature = []):
    index = df.shape[0]
    train = open(path   'train.ffm', 'w')
    test = open(path   'test.ffm', 'w')
    feature_index = 0
    feat_index = {}
    for i in range(index):
        feats = []
        field_index = 0
        for j, feat in enumerate(category_feature):
            t = feat   '_'   str(df[feat][i])
            if t not in  feat_index.keys():
                feat_index[t] = feature_index
                feature_index = feature_index   1
            feats.append('%s:%s:%s' % (field_index, feat_index[t], 1))
            field_index = field_index   1

        for j, feat in enumerate(continuous_feature):
            feats.append('%s:%s:%s' % (field_index, feature_index, df[feat][i]))
            feature_index = feature_index   1
            field_index = field_index   1

        print('%s %s' % (df[label][i], ' '.join(feats)))

        if i < train_len:
            train.write('%s %sn' % (df[label][i], ' '.join(feats)))
            test.write('%sn' % (' '.join(feats)))

其中LightGBM 之后的叶子节点数据是离散的数据,

3 Kaggle: Pandas to libffm


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.datasets import make_classification

Another CTR comp and so i suspect libffm will play its part, after all it is an atomic bomb for this kind of stuff.
A sci-kit learn inspired script to convert pandas dataframes into libFFM style data.

The script is fairly hacky (hey thats Kaggle) and takes a little while to run a huge dataset.
The key to using this class is setting up the features dtypes correctly for output (ammend transform to suit your needs)

Example below


class FFMFormatPandas:
    def __init__(self):
        self.field_index_ = None
        self.feature_index_ = None
        self.y = None

    def fit(self, df, y=None):
        self.y = y
        df_ffm = df[df.columns.difference([self.y])]
        if self.field_index_ is None:
            self.field_index_ = {col: i for i, col in enumerate(df_ffm)}

        if self.feature_index_ is not None:
            last_idx = max(list(self.feature_index_.values()))

        if self.feature_index_ is None:
            self.feature_index_ = dict()
            last_idx = 0

        for col in df.columns:
            vals = df[col].unique()
            for val in vals:
                if pd.isnull(val):
                name = '{}_{}'.format(col, val)
                if name not in self.feature_index_:
                    self.feature_index_[name] = last_idx
                    last_idx  = 1
            self.feature_index_[col] = last_idx
            last_idx  = 1
        return self

    def fit_transform(self, df, y=None):
        self.fit(df, y)
        return self.transform(df)

    def transform_row_(self, row, t):
        ffm = []
        if self.y != None:
            ffm.append(str(row.loc[row.index == self.y][0]))
        if self.y is None:

        for col, val in row.loc[row.index != self.y].to_dict().items():
            col_type = t[col]
            name = '{}_{}'.format(col, val)
            if col_type.kind ==  'O':
                ffm.append('{}:{}:1'.format(self.field_index_[col], self.feature_index_[name]))
            elif col_type.kind == 'i':
                ffm.append('{}:{}:{}'.format(self.field_index_[col], self.feature_index_[col], val))
        return ' '.join(ffm)

    def transform(self, df):
        t = df.dtypes.to_dict()
        return pd.Series({idx: self.transform_row_(row, t) for idx, row in df.iterrows()})

########################### Lets build some data and test ############################

train, y = make_classification(n_samples=100, n_features=5, n_informative=2, n_redundant=2, n_classes=2, random_state=42)

train=pd.DataFrame(train, columns=['int1','int2','int3','s1','s2'])
train['int1'] = train['int1'].map(int)
train['int2'] = train['int2'].map(int)
train['int3'] = train['int3'].map(int)
train['s1'] = round(np.log(abs(train['s1']  1 ))).map(str)
train['s2'] = round(np.log(abs(train['s2']  1 ))).map(str)
train['clicked'] = y

ffm_train = FFMFormatPandas()
ffm_train_data = ffm_train.fit_transform(train, y='clicked')
print('Base data')
print('FFM data')

