组合2个字符串的最快方法，将字符串从第二列交错到整个数据帧中的第一列

Posted 2023-02-16

技术标签:

【中文标题】组合2个字符串的最快方法，将字符串从第二列交错到整个数据帧中的第一列【英文标题】：Quickest way to combine 2 strings, interleaving string from second column into the first by row throughout the dataframe 【发布时间】：2019-11-14 14:22:21 【问题描述】：

我编写了一个函数（包含从堆栈溢出中清除的点点滴滴），它将逐行移动整个数据帧，将字符串从 col-x 交错到 col-y，用于所有行中的所有两列 x,y 对.

我有一个可行的解决方案。问题是在大型数据帧上运行速度很慢。

有没有更快的方法？

我尝试了以下设置：

# Import modules
import pandas as pd
from itertools import chain, zip_longest

def interleave_strings(string1, string2):
    tuples = zip_longest(string1, string2, fillvalue='')
    string_list = [''.join(item) for item in tuples]
    return ''.join(string_list)

# Create the pandas DataFrame 
data = [['timy', 'toma', 'tama', 'tima', 'tomy', 'tome'], ['nicka', 'nacka', 'nucka', 'necka', 'nomy', 'nome'], ['julia', 'Julia', 'jalia', 'jilia', 'jomy', 'jome']] 
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D', 'E', 'F']) 

df

这让我们...

    timy    toma    tama    tima    tomy    tome
    nicka   nacka   nucka   necka   nomy    nome
    julia   Julia   jalia   jilia   jomy    jome

这行得通，但速度很慢......

# new_df

il_df = pd.DataFrame()
for i in range (int(len(df.columns)/2)):
    selection = df.iloc[:,2*i:2*i+2]
    L = []
    for j in range (len(df.index)):
        res = interleave_strings(selection.iloc[j,0], selection.iloc[j,1])

        L.append(res)
        S = pd.Series(L)
    #il_df = pd.concat(D, ignore_index=True)   
    il_df = il_df.append(S, ignore_index=True)

与

il_df.transpose()

正确的输出是：

    0           1           2
0   ttiommya    ttaimmaa    ttoommye
1   nniacckkaa  nnuecckkaa  nnoommye
2   jJuulliiaa  jjailliiaa  jjoommye

【问题讨论】：

“正确输出”中显示的列是否部分？我期待看到 6P2 列。组合列中的单词是否总是相同的长度？例如：timy, toma 和 nicka, nacka? 【参考方案1】：

我们可以在axis=1 上的每对两列上使用groupby（阅读：列轴）。

就像您自己的解决方案一样，我们使用交错：

from toolz import interleave

m = [x//2 for x in range(len(df.columns))]

df = df.groupby(m, axis=1).apply(lambda x: [''.join(interleave(t)) for t in zip(x.iloc[:, 0], x.iloc[:, 1])])

df = pd.DataFrame(df.to_numpy().tolist(), columns = df.index).T

输出

            0           1         2
0    ttiommya    ttaimmaa  ttoommye
1  nniacckkaa  nnuecckkaa  nnoommye
2  jJuulliiaa  jjailliiaa  jjoommye

通知如果您的 pandas 版本是 0.24，请使用 .values 而不是 .to_numpy

df = pd.DataFrame(df.values.tolist(), columns = df.index).T

【讨论】：

【参考方案2】：

我们可以分两步完成。首先创建一个包含 (x, y) 的所有排列的新框架，然后应用一个函数来交错新框架的元素字符串。

  >>>import pandas as pd
  >>>import itertools
  >>>df
  Out[61]: 
         A      B      C      D     E     F
  0   timy   toma   tama   tima  tomy  tome
  1  nicka  nacka  nucka  necka  nomy  nome
  2  julia  Julia  jalia  jilia  jomy  jome

  >>>df_permute = df.apply(lambda x: pd.Series(list(itertools.permutations(x, 2))), axis=1)
  >>>df_permute
  Out[66]: 
                 0               1       ...                  28            29
  0    (timy, toma)    (timy, tama)      ...        (tome, tima)  (tome, tomy)
  1  (nicka, nacka)  (nicka, nucka)      ...       (nome, necka)  (nome, nomy)
  2  (julia, Julia)  (julia, jalia)      ...       (jome, jilia)  (jome, jomy)
  [3 rows x 30 columns]

  >>>def foo(x, y):
  ...  """Interleave string x, and y"""
  ...  return ''.join(p for p in itertools.chain(*izip_longest(x, y)) if p)
  ...

  >>> df_permute.applymap(lambda x: foo(*x))
  Out[68]: 
             0           1           2     ...            27         28        29
  0    ttiommya    ttiammya    ttiimmya    ...      ttoammea   ttoimmea  ttoommey
  1  nniacckkaa  nniucckkaa  nniecckkaa    ...     nnoumceka  nnoemceka  nnoommey
  2  jJuulliiaa  jjualliiaa  jjuilliiaa    ...     jjoamleia  jjoimleia  jjoommey
  [3 rows x 30 columns]

【讨论】：

【参考方案3】：

感谢您的回复！他们受到赞赏。我最初问，“有没有更快的方法来做到这一点。”因此，如果您有兴趣，似乎 Erfan 的方法比我的方法快一半，而 Karthik 的方法比我的要慢一些。

以下是在 jupyterlab 中运行的 %%timeit 实际交错的结果。如果您有更大的数据框，这些 ms 会加起来。

Erfan   - 3.46 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
greg    - 6.81 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Karthik - 10.6 ms ± 98.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

干杯！

【讨论】：

以上是关于组合2个字符串的最快方法，将字符串从第二列交错到整个数据帧中的第一列的主要内容，如果未能解决你的问题，请参考以下文章