循环列表字典并更新相应的列 - 熊猫

Posted

技术标签:

【中文标题】循环列表字典并更新相应的列 - 熊猫【英文标题】:Loop over a dictionary of list and update the corresponding columns - pandas 【发布时间】:2020-11-25 23:12:22 【问题描述】:

我有一个df 和列表字典,如下所示。

 Date                  Tea_Good       Tea_bad    coffee_good      coffee_bad
2020-02-01             3              1           10                7
2020-02-02             3              1           10                7
2020-02-03             3              1           10                7
2020-02-04             3              1           10                7
2020-02-05             6              1           10                7
2020-02-06             6              2           10                11
2020-02-07             6              2           5                 11
2020-02-08             6              2           5                 11
2020-02-09             9              2           5                 11
2020-02-10             9              2           4                 11
2020-02-11             9              2           4                 11   
2020-02-12             9              2           4                 11         
2020-02-13             9              2           4                 11 
2020-02-14             9              2           4                 11

dict

rf = 
"tea": 
    [
      
          "type": "linear",
          "from": "2020-02-01T20:00:00.000Z",
          "to": "2020-02-03T20:00:00.000Z",
          "days":3,
          "coef":[0.1,0.1,0.1,0.1,0.1,0.1],
          "case":"bad"
      ,
      
          "type": "polynomial",
          "from": "2020-02-08T20:00:00.000Z",
          "to": "2020-02-10T20:00:00.000Z",
          "days":3,
          "coef":[0.1,0.1,0.1,0.1,0.1,0.1],
          "case":"good"
      ],
"coffee": [
          
              "type": "quadratic",
              "from": "2020-02-01T20:00:00.000Z",
              "to": "2020-02-10T20:00:00.000Z",
              "days": 10,
              "coef": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
              "case":"good"
          ,
          
              "type": "constant",
              "from": "2020-02-11T20:00:00.000Z",
              "to": "2020-02-13T20:00:00.000Z",
              "days": 5,
              "coef": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
              "case":"bad"
          ]

解释:

字典包含两个键

1. "tea"
2. "coffee"

根据键值我要更新df的列。

1. Which column?
If key == "tea" and "case" == "bad" update the Tea_bad column

2. When? 
"from": "2020-02-01T20:00:00.000Z",
"to": "2020-02-03T20:00:00.000Z"

3. How?
if "type": "linear",
when  "from": "2020-02-01T20:00:00.000Z"
t = 0,
a0 = coef[0]
a1 = coef[1]
a2 = coef[2]
a3 = coef[3]
a4 = coef[4]
a5 = coef[5]

df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_bad'] = a0 + a1 * t.

我尝试了下面的代码,但它不起作用。请不要查看代码。 如果你想以你自己的方式实施并帮助我。

def rf_user_input(df, REQUEST_OBJ):
    '''
        This functions returns the tea_coffee dataframe with the user input functions for tea, coffee

        params: data : tea_coffee dataframe uploaded from user
                request_object_api: The api should contain the below params
                    start_date: start date of the user function for rf
                    end_date : end date of the user function for the rf
                    label : 'constant', 'linear', 'quadratic', 'polynomial', 'exponential', 'df'
                    coef : list with 6 indexes [a0,a1,a2,a3,a4,a5]

        return: rf computed with user inputs
    '''
    # df.days.iloc[(df[df.Date==start_date].index[0])]
    df = df.sort_values(by='Date')
    df['days'] = (df['Date'] - df.at[0, 'Date']).dt.days + 1

    REQUIRED_KEYS = ["tea", "coffee"]

    for teacoffee_category in REQUIRED_KEYS:
        print(f" teacoffee_category - teacoffee_category")
        if teacoffee_category in REQUEST_OBJ.keys():
            param_obj_list = REQUEST_OBJ[teacoffee_category]

            for params_obj in param_obj_list:
                # Do the data processing
                goodbad_catgeory = params_obj['case']
                kind = teacoffee_category + '_' + goodbad_catgeory
                start_date, end_date, label, coef, n_days = params_obj['from'], params_obj['to'], params_obj['type'], \
                                                            params_obj['coef'], params_obj['days']

                start_date = DT.datetime.strptime(start_date, "%Y-%m-%dT%H:%M:%S.%fZ")
                end_date = DT.datetime.strptime(end_date, "%Y-%m-%dT%H:%M:%S.%fZ")
                print(f" start date - start_date")
                print(f" end date - end_date")

                # Additional n_days code - Start
                first_date = df['Date'].min()
                period_days = (start_date - first_date)
                print(f" period day - period_days")
                # Additional n_days code - End

                # Checking 'start_date' , 'end_date' and 'n_days' conditions
                
                # If the start_date and end_date is null return the calibration df as it is
                if (start_date == 0) & (end_date == 0):
                    return df
                
                if (start_date == 0) & (end_date != 0) & (n_days == 0):
                    return df
                
                if (start_date != 0) & (end_date == 0) & (n_days == 0):
                    return df

                # if start date, end date and n_days are non zero then consider start date and n_days
                if (start_date != 0) & (end_date != 0) & (n_days != 0):
                    #n_days = (end_date - start_date).days
                    #n_days = (end_date - start_date).days
                    end_date = start_date + DT.timedelta(days=n_days)
                
                if (start_date != 0) & (end_date != 0) & (n_days == 0) :
                    n_days = (end_date - start_date)
                    print(f" n day = n_days")
                    end_date = end_date
                
                if (start_date != 0) & (end_date == 0) & (n_days != 0) :
                    #n_days = (end_date - start_date)
                    #print(f" n day = n_days")
                    end_date = start_date + DT.timedelta(days=n_days)
                    
                if (start_date == 0) & (end_date != 0) & (n_days != 0) :
                    start_date = end_date - DT.timedelta(days=n_days)
                    
                




                if (n_days != 0) & (start_date != 0):
                    end_date = start_date + DT.timedelta(days=n_days)

                    # If the start_date and end_date is null return the calibration df as it is

                if len(coef) == 6:
                        # Coefficients Index Initializations
                    a0 = coef[0]
                    a1 = coef[1]
                    a2 = coef[2]
                    a3 = coef[3]
                    a4 = coef[4]
                    a5 = coef[5]

                    # Constant
                    if label == 'constant':
                        if kind == 'tea_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_Good'] = a0 + (df['days']) - period_days
                        elif kind == 'tea_bad':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_bad'] = a0 + df['days'] - period_days
                        elif kind == 'coffee_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_good'] = a0 + df['days'] - period_days
                        elif kind == 'coffee_bad':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_bad'] = a0 + df['days'] - period_days

                    # Linear
                    if label == 'linear':
                        if kind == 'tea_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_Good'] = a0 + (
                                    a1 * ((df['days']) - period_days))
                        elif kind == 'tea_bad':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_bad'] = a0 + (
                                    a1 * ((df['days']) - period_days))
                        elif kind == 'coffee_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_good'] = a0 + (
                                    a1 * ((df['days']) - period_days))
                        elif kind == 'coffee_bad':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_bad'] = a0 + (
                                    a1 * ((df['days']) - period_days))

                    # Quadratic
                    if label == 'quadratic':
                        if kind == 'tea_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_Good'] = a0 + (
                                    a1 * ((df['days']) - period_days)) + (a2 * ((df['days']) - period_days) ** 2)
                        elif kind == 'tea_bad':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_bad'] = a0 + (
                                    a1 * ((df['days']) - period_days)) + (a2 * ((df['days']) - period_days) ** 2)
                        elif kind == 'coffee_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_good'] = a0 + (
                                    a1 * ((df['days']) - period_days)) + (a2 * ((df['days']) - period_days) ** 2)
                        elif kind == 'coffee_bad':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_bad'] = a0 + (
                                    a1 * ((df['days']) - period_days)) + (a2 * ((df['days']) - period_days) ** 2)

                    # Polynomial
                    if label == 'polynomial':
                        if kind == 'tea_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_Good'] = a0 + (
                                    a1 * ((df['days']) - period_days)) + (a2 * (
                                    (df['days']) - period_days) ** 2) + (a3 * (
                                    (df['days']) - period_days) ** 3) + (a4 * (
                                    (df['days']) - period_days) ** 4) + (a5 * ((df['days']) - period_days) ** 5)
                        elif kind == 'tea_bad':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_bad'] = a0 + (
                                    a1 * ((df['days']) - period_days)) + (a2 * (
                                    (df['days']) - period_days) ** 2) + (a3 * (
                                    (df['days']) - period_days) ** 3) + (a4 * (
                                    (df['days']) - period_days) ** 4) + (a5 * ((df['days']) - period_days) ** 5)
                        elif kind == 'coffee_good':
                            df.loc[
                                (df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_good'] = a0 + (
                                    a1 * ((df['days']) - period_days)) + (a2 * (
                                    (df['days']) - period_days) ** 2) + (a3 * (
                                    (df['days']) - period_days) ** 3) + (a4 * (
                                    (df['days']) - period_days) ** 4) + (a5 * ((df['days']) - period_days) ** 5)
                        elif kind == 'coffee_bad':
                            df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_bad'] = a0 + (
                                a1 * ((df['days']) - period_days)) +  (a2 * (
                                (df['days']) - period_days) ** 2) + (a3 * (
                                (df['days']) - period_days) ** 3) + (a4 * (
                                (df['days']) - period_days) ** 4) + (a5 * ((df['days']) - period_days) ** 5)

                    # Exponential
                    if label == 'exponential':
                        if kind == 'tea_good':
                            df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_Good'] = np.exp(a0)
                        elif kind == 'tea_bad':
                            df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'Tea_bad'] = np.exp(a0)
                        elif kind == 'coffee_good':
                            df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_good'] = np.exp(a0)
                        elif kind == 'coffee_bad':
                            df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'coffee_bad'] = np.exp(a0)

                    # Calibration File
                    if label == 'calibration_file':
                        pass
                    #                     return df
                else:
                    raise Exception('Coefficients index do not match. All values of coefficients should be passed')

            else:
                return df
    return df

我以不同的方式添加了相同的问题。我想在我没有解释好。这个问题的链接如下。

Replace the column values based on the list of dictionary and specific date condition - use if and for loop - Pandas

【问题讨论】:

【参考方案1】:

用途:

def rf_user_input(df, req_obj):
    df = df.sort_values('Date')
    df['days'] = (df['Date'] - df.at[0, 'Date']).dt.days + 1

    cols, df.columns = df.columns, df.columns.str.lower()

    for category in ("tea", "coffee"):
        if category not in req_obj.keys():
            continue

        for params_obj in req_obj[category]:
            case = params_obj['case']
            kind = '_'.format(category, case)

            start_date = pd.to_datetime(params_obj['from'], format='%Y-%m-%dT%H:%M:%S.%fZ')
            end_date = pd.to_datetime(params_obj['to'], format='%Y-%m-%dT%H:%M:%S.%fZ')
            label, coef, n_days = params_obj['type'], params_obj['coef'], params_obj['days']

            # Additional n_days code - Start
            first_date = df['date'].min()
            period_days = (start_date - first_date).days
            # Additional n_days code - End

            # Checking 'start_date' , 'end_date' and 'n_days' conditions

            # If the start_date and end_date is null return the calibration df as it is
            if (start_date == 0) and (end_date == 0):
                return df.set_axis(cols, axis=1)

            if (start_date == 0) and (end_date != 0) and (n_days == 0):
                return df.set_axis(cols, axis=1)

            if (start_date != 0) and (end_date == 0) and (n_days == 0):
                return df.set_axis(cols, axis=1)

            # if start date, end date and n_days are non zero then consider start date and n_days
            if (start_date != 0) and (end_date != 0) and (n_days != 0):
                end_date = start_date + pd.Timedelta(days=n_days)

            if (start_date != 0) and (end_date != 0) and (n_days == 0):
                n_days = (end_date - start_date)

            if (start_date != 0) and (end_date == 0) and (n_days != 0):
                end_date = start_date + pd.Timedelta(days=n_days)

            if (start_date == 0) and (end_date != 0) and (n_days != 0):
                start_date = end_date - pd.Timedelta(days=n_days)

            if (n_days != 0) and (start_date != 0):
                end_date = start_date + pd.Timedelta(days=n_days)

            # If the start_date and end_date is null return the calibration df as it is

            if len(coef) == 6:
                a0, a1, a2, a3, a4, a5 = coef
                mask = df['date'].between(start_date, end_date)

                if label == 'constant':
                    if kind in ('tea_good', 'tea_bad', 'coffee_good', 'coffee_bad'):
                        df.loc[mask, kind] = a0 + df['days'] - period_days

                elif label == 'linear':
                    if kind in ('tea_good', 'tea_bad', 'coffee_good', 'coffee_bad'):
                        df.loc[mask, kind] = a0 + \
                            (a1 * ((df['days']) - period_days))

                # Quadratic
                elif label == 'quadratic':
                    if kind in ('tea_good', 'tea_bad', 'coffee_good', 'coffee_bad'):
                        df.loc[mask, kind] = a0 + (a1 * ((df['days']) - period_days)) + (
                            a2 * ((df['days']) - period_days) ** 2)

                # Polynomial
                elif label == 'polynomial':
                    if kind in ('tea_good', 'tea_bad', 'coffee_good', 'coffee_bad'):
                        df.loc[mask, kind] = a0 + (
                            a1 * ((df['days']) - period_days)) + (a2 * (
                                (df['days']) - period_days) ** 2) + (a3 * (
                                    (df['days']) - period_days) ** 3) + (a4 * (
                                        (df['days']) - period_days) ** 4) + (a5 * ((df['days']) - period_days) ** 5)

                # Exponential
                elif label == 'exponential':
                    if kind in ('tea_good', 'tea_bad', 'coffee_good', 'coffee_bad'):
                        df.loc[mask, kind] = np.exp(a0)

                # Calibration File
                elif label == 'calibration_file':
                    pass
            else:
                raise Exception(
                    'Coefficients index do not match. All values of coefficients should be passed')

    return df.set_axis(cols, axis=1)

结果:

# rf_unser_input(df, rf)

         Date  Tea_Good  Tea_bad  coffee_good  coffee_bad  days
0  2020-02-01       3.0      1.0         10.0         7.0     1
1  2020-02-02       3.0      0.3          0.3         7.0     2
2  2020-02-03       3.0      0.4          0.4         7.0     3
3  2020-02-04       3.0      0.5          0.5         0.3     4
4  2020-02-05      12.0      1.0          3.1         0.4     5
5  2020-02-06      13.0      2.0          4.3         0.5     6
6  2020-02-07       6.0      2.0          5.7         0.6     7
7  2020-02-08       6.0      2.0          7.3        11.0     8
8  2020-02-09       6.3      2.0          9.1        11.0     9
9  2020-02-10      36.4      2.0         11.1        11.0    10
10 2020-02-11     136.5      2.0         13.3        11.0    11
11 2020-02-12       9.0      2.0          4.0        11.0    12
12 2020-02-13       9.0      2.0          4.0        11.0    13
13 2020-02-14       9.0      2.0          4.0        11.0    14

【讨论】:

【参考方案2】:

一种解决方案是遍历字典并使用 apply:

    df.Date = pd.to_datetime(df.Date)
    df = df.set_index('Date', drop=True)
    df['Period'] = [(date - df.index[0]).days for date in df.index]

    for key, val in rf.items():
        for elem in val:
            type_method = elem.get('type')
            col_name = f'key.capitalize()_elem.get("case")'
            date_from = pd.to_datetime(elem.get('from'))
            date_to = pd.to_datetime(elem.get('to'))
            a0, a1, a2, a3, a4, a5 = elem.get('coef')
            mask_dates = (df.index >= date_from) & (df.index <= date_to)

            func_dict = 
                'linear': lambda x: a0 + a1 * x['Period'],
                'constant': lambda x: a0 + x['Period'],
                'quadratic': lambda x: a0 + a1 * (x['Period']) + a2 * (x['Period'] ** 2),
                'exponential': lambda x: np.exp(a0),
                'polynomial': lambda x: a0 +
                                        a1 * (x['Period']) +
                                        a2 * (x['Period'] ** 2) +
                                        a3 * (x['Period'] ** 3) +
                                        a4 * (x['Period'] ** 4) +
                                        a5 * (x['Period'] ** 5),
            

            df.loc[mask_dates, col_name] = df[mask_dates].apply(func_dict[type_method], axis=1)

输出:

            Tea_good  Tea_bad  Coffee_good  Coffee_bad  Period
Date                                                          
2020-02-01       3.0      1.0         10.0         7.0       0
2020-02-02       3.0      0.2          0.3         7.0       1
2020-02-03       3.0      0.3          0.7         7.0       2
2020-02-04       3.0      1.0          1.3         7.0       3
2020-02-05       6.0      1.0          2.1         7.0       4
2020-02-06       6.0      2.0          3.1        11.0       5
2020-02-07       6.0      2.0          4.3        11.0       6
2020-02-08       6.0      2.0          5.7        11.0       7
2020-02-09    3744.9      2.0          7.3        11.0       8
2020-02-10    6643.0      2.0          9.1        11.0       9
2020-02-11       9.0      2.0          4.0        11.0      10
2020-02-12       9.0      2.0          4.0        11.1      11
2020-02-13       9.0      2.0          4.0        12.1      12
2020-02-14       9.0      2.0          4.0        11.0      13

请注意,我必须更改列名,以便将 tea/coffee 大写。另外,像这样使用 lambda 函数是惰性的,应该重构为普通函数。

【讨论】:

以上是关于循环列表字典并更新相应的列 - 熊猫的主要内容,如果未能解决你的问题,请参考以下文章

我想将国家/地区列表与作为熊猫数据框 Python 中字典对象类型的列数据进行比较

熊猫风格的循环无法正常工作

如何将包含数组中的值的熊猫列扩展到多列?

从两个熊猫系列(csv的列作为DataFrame)创建元素字典

如何将字典附加到熊猫数据框?

在循环中将字典附加到熊猫数据框