从一个函数在 Pandas Dataframe 中创建多个列

Posted

技术标签:

【中文标题】从一个函数在 Pandas Dataframe 中创建多个列【英文标题】:Create multiple columns in Pandas Dataframe from one function 【发布时间】:2016-01-07 18:15:42 【问题描述】:

我是python新手,所以希望我的两个问题清晰完整。我在下面发布了实际代码和一个csv格式的测试数据集。

我已经能够构建以下代码(主要是在 *** 贡献者的帮助下)来使用 Newton-Raphson 方法计算期权合约的隐含波动率。该过程在确定隐含波动率时计算 Vega。虽然我可以使用 Pandas DataFrame 应用方法为隐含波动率创建一个新的 DataFrame 列,但我无法为 Vega 创建第二个列。当函数一起返回 IV 和 Vega 时,有没有办法创建两个单独的 DataFrame 列?

我试过了:

return iv, vega 来自函数 df[['myIV', 'Vega']] = df.apply(newtonRap, axis=1) 得到ValueError: Shape of passed values is (56, 2), indices imply (56, 13)

也试过了:

return iv, vega 来自函数 df['myIV'], df['Vega'] = df.apply(newtonRap, axis=1) 得到ValueError: Shape of passed values is (56, 2), indices imply (56, 13)

此外,计算过程很慢。我导入了 numba 并实现了 @jit(nogil=True) 装饰器,但我只看到了 25% 的性能提升。测试数据集是性能测试,有近 90 万条记录。运行时间为 2 小时 9 分钟,没有 numba 或使用 numba,但没有 nogil=True。使用 numba 和 @jit(nogil=True) 的运行时间为 1 小时 32 分钟。我能做得更好吗?

from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from scipy.stats import norm
from numba import jit


# dff = Daily Fed Funds (Posted rate is usually one day behind)
dff = pd.read_csv('https://research.stlouisfed.org/fred2/data/DFF.csv', parse_dates=[0], index_col='DATE')
rf = float('%.4f' % (dff['VALUE'][-1:][0] / 100))
# rf = .0015                        # Get Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
tradingMinutesDay = 450             # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400        # trading minutes per day * 252 trading days per year
cal = USFederalHolidayCalendar()    # Load US Federal holiday calendar


@jit(nogil=True)                                # nogil=True arg improves performance by 25%
def newtonRap(row):
    """Estimate Implied Volatility (IV) using Newton-Raphson method

    :param row (dataframe):  Options contract params for function
        TimeStamp (datetime): Close date
        Expiry (datetime): Option contract expiration date
        Strike (float): Option strike
        OptType (object): 'C' for call; 'P' for put
        RootPrice (float): Underlying close price
        Bid (float): Option contact closing bid
        Ask (float): Option contact closing ask

    :return:
        float: Estimated implied volatility
    """
    if row['Bid'] == 0.0 or row['Ask'] == 0.0 or row['RootPrice'] == 0.0 or row['Strike'] == 0.0 or \
       row['TimeStamp'] == row['Expiry']:
        iv, vega = 0.0, 0.0         # Set iv and vega to zero if option contract is invalid or expired
    else:
        # dte (Days to expiration) uses pandas bdate_range method to determine the number of business days to expiration
        #   minus USFederalHolidays minus constant of 1 for the TimeStamp date
        dte = float(len(pd.bdate_range(row['TimeStamp'], row['Expiry'])) -
                    len(cal.holidays(row['TimeStamp'], row['Expiry']).to_pydatetime()) - 1)
        mark = (row['Bid'] + row['Ask']) / 2
        cp = 1 if row['OptType'] == 'C' else -1
        S = row['RootPrice']
        K = row['Strike']
        # T = the number of trading minutes to expiration divided by the number of trading minutes in year
        T = (dte * tradingMinutesDay) / tradingMinutesAnnum
        # TODO get dividend value
        d = 0.00
        iv = sqrt(2 * pi / T) * mark / S        # Closed form estimate of IV Brenner and Subrahmanyam (1988)
        vega = 0.0
        for i in range(1, 100):
            d1 = (log(S / K) + T * (rf - d + iv ** 2 / 2)) / (iv * sqrt(T))
            d2 = d1 - iv * sqrt(T)
            vega = S * norm.pdf(d1) * sqrt(T)
            model = cp * S * norm.cdf(cp * d1) - cp * K * exp(-rf * T) * norm.cdf(cp * d2)
            iv -= (model - mark) / vega
            if abs(model - mark) < 1.0e-9:
                break
        if isnan(iv) or isnan(vega):
            iv, vega = 0.0, 0.0
    # TODO Return vega with iv if add'l pandas column possible
    # return iv, vega
    return iv


if __name__ == "__main__":
    # test function from baseline data
    get_csv = True

    if get_csv:
        csvHeaderList = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last',
                         'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
        fileName = 'C:/tmp/test-20150930-56records.csv'
        df = pd.read_csv(fileName, parse_dates=[0, 3], names=csvHeaderList)
    else:
        pass

    start = datetime.now()
    # TODO Create add'l pandas dataframe column, if possible, for vega
    # df[['myIV', 'Vega']] = df.apply(newtonRap, axis=1)
    # df['myIV'], df['Vega'] = df.apply(newtonRap, axis=1)
    df['myIV'] = df.apply(newtonRap, axis=1)
    end = datetime.now()
    print end - start

测试数据:C:/tmp/test-20150930-56records.csv

2015-09-30 16:00:00,AAPL151016C00109000,AAPL,2015-10-16 16:00:00,109,C,109.95,3.46,3.6,3.7,1565,1290,0.3497 2015-09-30 16:00:00,AAPL151016P00109000,AAPL,2015-10-16 16:00:00,109,P,109.95,2.4,2.34,2.42,3790,3087,0.3146 2015-09-30 16:00:00,AAPL151016C00110000,AAPL,2015-10-16 16:00:00,110,C,109.95,3,2.86,3,10217,28850,0.3288 2015-09-30 16:00:00,AAPL151016P00110000,AAPL,2015-10-16 16:00:00,110,P,109.95,2.81,2.74,2.8,12113,44427,0.3029 2015-09-30 16:00:00,AAPL151016C00111000,AAPL,2015-10-16 16:00:00,111,C,109.95,2.35,2.44,2.45,6674,2318,0.3187 2015-09-30 16:00:00,AAPL151016P00111000,AAPL,2015-10-16 16:00:00,111,P,109.95,3.2,3.1,3.25,2031,3773,0.2926 2015-09-30 16:00:00,AAPL151120C00110000,AAPL,2015-11-20 16:00:00,110,C,109.95,5.9,5.7,5.95,5330,17112,0.3635 2015-09-30 16:00:00,AAPL151120P00110000,AAPL,2015-11-20 16:00:00,110,P,109.95,6.15,6.1,6.3,3724,15704,0.3842

【问题讨论】:

【参考方案1】:

如果我理解正确,你应该做的是从你的函数中返回一个系列。比如:

return pandas.Series("IV": iv, "Vega": vega)

如果要将结果放入同一输入 DataFrame 的新列中,则只需:

df[["IV", "Vega"]] = df.apply(newtonRap, axis=1)

【讨论】:

【参考方案2】:

就 numba 的性能而言,numba 对 pandas 数据帧一无所知,也无法将对它们的操作编译为快速机器代码。您最好的选择是分析您的方法的哪一部分是慢的(例如使用line_profiler),然后将该部分卸载到另一个方法,您使用数据框列的.values 属性构造输入,这使您可以访问到底层的 numpy 数组。否则 numba 将主要在“对象模式”下运行(参见numba glossary)并且不会显着提高性能

【讨论】:

【参考方案3】:

向量化代码的技巧是不要考虑行,而是考虑列。

我几乎可以完成这项工作(我稍后会尝试完成它),但您想要按照以下方式做一些事情:

from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from numpy import inf, nan
from scipy.stats import norm
import pandas as pd
from pandas import Timestamp
from pandas.tseries.holiday import USFederalHolidayCalendar

# Initial parameters
rf = .0015                          # Get Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
tradingMinutesDay = 450             # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400        # trading minutes per day * 252 trading days per year
cal = USFederalHolidayCalendar()    # Load US Federal holiday calendar
two_pi = 2 * pi                     # 2 * Pi (to reduce computations)
threshold = 1.0e-9                  # convergence threshold.

# Create sample data:
col_order = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last', 'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
df = pd.DataFrame('Ask': 0: 3.7000000000000002, 1: 2.4199999999999999, 2: 3.0, 3: 2.7999999999999998, 4: 2.4500000000000002, 5: 3.25, 6: 5.9500000000000002, 7: 6.2999999999999998,
                   'Bid': 0: 3.6000000000000001, 1: 2.3399999999999999, 2: 2.8599999999999999, 3: 2.7400000000000002, 4: 2.4399999999999999, 5: 3.1000000000000001, 6: 5.7000000000000002, 7: 6.0999999999999996,
                   'Expiry': 0: Timestamp('2015-10-16 16:00:00'), 1: Timestamp('2015-10-16 16:00:00'), 2: Timestamp('2015-10-16 16:00:00'), 3: Timestamp('2015-10-16 16:00:00'), 4: Timestamp('2015-10-16 16:00:00'), 5: Timestamp('2015-10-16 16:00:00'), 6: Timestamp('2015-11-20 16:00:00'), 7: Timestamp('2015-11-20 16:00:00'),
                   'IV': 0: 0.3497, 1: 0.3146, 2: 0.3288, 3: 0.3029, 4: 0.3187, 5: 0.2926, 6: 0.3635, 7: 0.3842,
                   'Last': 0: 3.46, 1: 2.34, 2: 3.0, 3: 2.81, 4: 2.35, 5: 3.20, 6: 5.90, 7: 6.15,
                   'OpenInt': 0: 1290.0, 1: 3087.0, 2: 28850.0, 3: 44427.0, 4: 2318.0, 5: 3773.0, 6: 17112.0, 7: 15704.0,
                   'OpraSymbol': 0: 'AAPL151016C00109000', 1: 'AAPL151016P00109000', 2: 'AAPL151016C00110000', 3: 'AAPL151016P00110000', 4: 'AAPL151016C00111000', 5: 'AAPL151016P00111000', 6: 'AAPL151120C00110000', 7: 'AAPL151120P00110000',
                   'OptType': 0: 'C', 1: 'P', 2: 'C', 3: 'P', 4: 'C', 5: 'P', 6: 'C', 7: 'P',
                   'RootPrice': 0: 109.95, 1: 109.95, 2: 109.95, 3: 109.95, 4: 109.95, 5: 109.95, 6: 109.95, 7: 109.95,
                   'RootSymbol': 0: 'AAPL', 1: 'AAPL', 2: 'AAPL', 3: 'AAPL', 4: 'AAPL', 5: 'AAPL', 6: 'AAPL', 7: 'AAPL',
                   'Strike': 0: 109.0, 1: 109.0, 2: 110.0, 3: 110.0, 4: 111.0, 5: 111.0, 6: 110.0, 7: 110.0,
                   'TimeStamp': 0: Timestamp('2015-09-30 16:00:00'), 1: Timestamp('2015-09-30 16:00:00'), 2: Timestamp('2015-09-30 16:00:00'), 3: Timestamp('2015-09-30 16:00:00'), 4: Timestamp('2015-09-30 16:00:00'), 5: Timestamp('2015-09-30 16:00:00'), 6: Timestamp('2015-09-30 16:00:00'), 7: Timestamp('2015-09-30 16:00:00'),
                   'Volume': 0: 1565.0, 1: 3790.0, 2: 10217.0, 3: 12113.0, 4: 6674.0, 5: 2031.0, 6: 5330.0, 7: 3724.0)
df = df[col_order]

# Vectorize columns
df['mark'] = (df.Bid + df.Ask) / 2
df['cp'] = df.OptType.map('C': 1, 'P': -1)
df['Log_S_K'] = (df.RootPrice / df.Strike).apply(log)
df['divs'] = 0  # TODO: Get dividend value.
df['vega'] = 0.
df['converged'] = False

# Vectorized datetime calculations
date_pairs = set(zip(df.TimeStamp, df.Expiry))
total_days = (t1, t2): len(pd.bdate_range(t1, t2)) 
                        for t1, t2 in date_pairs
hols = (t1, t2): len(cal.holidays(t1, t2).to_pydatetime()) 
                  for t1, t2 in date_pairs
del date_pairs

df['total_days'] = [total_days.get((t1, t2))
                    for t1, t2 in zip(df.TimeStamp, df.Expiry)]
df['hols'] = [hols.get((t1, t2))
              for t1, t2 in zip(df.TimeStamp, df.Expiry)]
df['days_to_exp'] = df.total_days - df.hols - 1
df.loc[df.days_to_exp < 0, 'days_to_exp'] = 0  # Min zero.
df.drop(['total_days', 'hols'], axis='columns', inplace=True)
df['years_to_expiry'] = (df.days_to_exp * tradingMinutesDay / tradingMinutesAnnum)

# Initial implied vol 'guess'
df['implied_vol'] = (two_pi / df.years_to_expiry) ** 0.5 * df.mark / df.RootPrice  

for i in xrange(100):  # range(100) in Python 3.x
    # Create mask of options where the vol has not converged.
    mask = [not c for c in df.converged.values]
    if df.converged.all():
        break

    # Aliases.
    data = df.loc[mask, :]
    cp = data.cp
    mark = data.mark
    S = data.RootPrice
    K = data.Strike
    d = data.divs
    T = data.years_to_expiry
    log_S_K = data.Log_S_K
    iv = data.implied_vol

    # Calcs.
    d1 = (log_S_K + T * (rf - d + .5 * iv ** 2)) / (iv * T ** 0.5)
    d2 = d1 - iv * T ** 0.5
    df.loc[mask, 'vega'] = vega = S * d1.apply(norm.pdf) * T ** 0.5
    model = cp * (S * (cp * d1).apply(norm.cdf)
                  - K * (-rf * T).apply(exp) * (cp * d2).apply(norm.cdf))
    iv_delta = (model - mark) / vega
    df.loc[mask, 'implied_vol'] = iv - iv_delta

    # Clean-up and check for convergence.
    df.loc[df.implied_vol < 0, 'implied_vol'] = 0
    idx = model[(model - mark).abs() < threshold].index
    df.ix[idx, 'converged'] = True
    df.loc[:, 'implied_vol'].fillna(0, inplace=True)
    df.loc[:, 'implied_vol'].replace([inf, -inf], nan, inplace=True)
    df.loc[:, 'vega'].fillna(0, inplace=True)
    df.loc[:, 'vega'].replace([inf, -inf], nan, inplace=True)

【讨论】:

Alexander,我知道你打算怎么处理这个了。我很期待最终的结果。在代码的第 38 行,它显示“df['implied_vol'] = df.IV # Initial 'guess'” 我提供的数据集中的 IV 是基于到期的日历天数和每次 365 天的实际 IV年。我的代码中实际的 IV 初始猜测是“iv = sqrt(2 * pi / T) * mark / S # IV Brenner 和 Subrahmanyam (1988) 的封闭形式估计”。 S是股价基本等于马克,所以马克/S大约是1.0,对吧? Alexander, S 是标的股票价格。标记是期权合约的买价和卖价之间的中点。以 AMZN 为例,AMZN 于 10 月 10 日星期五收于 539.80。然而,11 月 20 日的 540 看涨期权以 32.30 的出价收盘。卖出价收于 32.60。因此,马克是 32.45。 Alexander,我遇到了代码中“iv = data.implied_vol”行的问题。您是否打算定义“df['implied_vol'] = sqrt(two_pi / T) * df['Log_S_K'] / S”或类似的东西? 原始方法似乎快了大约 20%+。 pdfcdf 计算占了大部分计算时间。您可以从 for 循环中提取一些常量来获得一些额外的性能。 root_T = sqrt(T) k1 = S * root_T k2 = K * exp(-rf * T) log_S_K = log(S / K) for i in range(1, 100): d1 = (log_S_K + T * (rf - d + iv ** 2 / 2)) / (iv * root_T) d2 = d1 - iv * root_T vega = k1 * norm.pdf(d1) 模型 = cp * (S * norm.cdf(cp * d1) - k2 * norm.cdf(cp * d2))

以上是关于从一个函数在 Pandas Dataframe 中创建多个列的主要内容,如果未能解决你的问题,请参考以下文章

从 pandas DataFrame 在 pyomo 中定义参数

pandas使用dt.isocalendar().week函数从dataframe(Series)中的日期数据列中抽取当年第多少周信息生成新的数据列

从大型元组/行列表中有效地构建 Pandas DataFrame

从大型元组/行列表中有效地构建 Pandas DataFrame

pandas使用extract函数根据正则表达式从dataframe指定数据列的字符串中抽取出数字(设置expand=false之后返回的为series)将series转化为dataframe

尝试从 Pandas DataFrame 中的字典访问第一个值时出现浮点错误