数据框：从单个 ID 中提取多个父级并计算出现次数

Posted 2023-04-15

技术标签:

【中文标题】数据框：从单个 ID 中提取多个父级并计算出现次数【英文标题】：Dataframe: extract multiple parents from single ID and count occurrences 【发布时间】：2018-04-13 23:04:56 【问题描述】：

不知道标题是否足够好。随意调整！

情况：我得到了一个基本上是产品目录的数据框。其中有两个重要的列。一个是产品 ID，一个是 12 位类别。这是一些示例数据。当然，原始数据包含更多的产品、更多的列和许多不同的类别。

products = [
    'category': 110401010601, 'product': 1000023,
    'category': 110401020601, 'product': 1000024,
    'category': 110401030601, 'product': 1000025,
    'category': 110401040601, 'product': 1000026,
    'category': 110401050601, 'product': 1000027]

pd.DataFrame.from_records(products)

任务是使用 12 位数的类别编号来形成父类别，并使用这些父类别来计算与该父类别匹配的产品数量。父类别以 2 位数的步长形成。每个父项的计数稍后用于为每个具有最少记录数（假设 12 个子项）的产品查找父项。当然，数字越短，匹配该数字的产品就越多。这是一个示例父结构：

110401050601 # product category
1104010506 # 1st parent
11040105 # 2nd parent
110401 # 3rd parent
1104 # 4th parent
11 # 5th super-parent

您会发现，匹配的产品可能更多，例如 1104 而不仅仅是 110401050601。

小数据的想法 1：只要您将中小型数据完全加载到 Pandas 数据框中，这就是一项简单的任务。我用这段代码解决了它。缺点是此代码假定所有数据都在内存中，并且每个循环都是对完整数据帧的另一个选择，这在性能方面并不好。示例：对于 100.000 行和 6 个父组（由 12 位数字组成），您可能最终通过 DataFrame.loc[...] 选择 600.000，因此逐渐增长（最坏情况）。为了防止这种情况发生，如果之前见过父母，我会打破循环。注：df.shape[0]方法与len(df)类似。

df = df.drop_duplicates()
categories = df['category'].unique()

counts = dict()
for cat in categories:
    counts[cat] = df.loc[df['category'] == cat].shape[0]

    for i in range(10,1,-2):
        parent = cat[:i]

        if parent not in counts:
            counts[parent] = df.loc[df['category'].str.startswith(parent)].shape[0]
        else:
            break

counts = key: value for key, value in counts.items() if value >= MIN_COUNT

这会导致这样的事情（使用我的部分原始数据）：

'11': 100,
 '1103': 7,
 '110302': 7,
 '11030202': 7,
 '1103020203': 7,
 '110302020301': 7,
 '1104': 44,
 '110401': 15,
 '11040101': 15,
 '1104010106': 15,
 '110401010601': 15

使用 flatmap-reduce 实现大数据的想法 2： 现在假设您有更多的按行加载的数据，并且您希望实现与上述相同的目标。我正在考虑使用flatmap 将类别编号拆分为其父级（一对多），使用每个父级的 1 计数器，然后应用groupby-key 以获取所有可能的父级的计数。此版本的优点是，它不需要一次所有数据，并且它不会对数据框进行任何选择。但在平面图步骤中，行数增加了 6 倍（由于 12 位类别编号分为 6 组）。由于 Pandas 没有 flatten/flatmap 方法，我不得不使用 unstack 应用解决方法（用于解释 see this post）。

df = df.drop_duplicates()
counts_stacked = df['category'].apply(lambda cat: [(cat[:i], 1) for i in range(10,1,-2)])
counts = counts_stacked.apply(pd.Series).unstack().reset_index(drop=True)

df_counts = pd.DataFrame.from_records(list(counts), columns=['category', 'count'])
counts = df_counts.groupby('category').count().to_dict()['count']
counts = key: value for key, value in counts.items() if value >= MIN_COUNT

问题： 两种解决方案都很好，但我想知道是否有更优雅的方法来实现相同的结果。我觉得我错过了什么。

【问题讨论】：

你想优化什么？速度、内存使用、可读性？优雅有点模糊。基本上我正在寻找一个可以轻松转移到 PySpark 或 Apache Beam 的版本，这些版本都基于 map-reduce 概念。我的第二个代码与此类似，但会大量增加行数。如果有其他方法可以做到这一点，我很感兴趣。 【参考方案1】：

你可以在这里使用cumsum

df.category.astype(str).str.split('(..)').apply(pd.Series).replace('',np.nan).dropna(1).cumsum(1).stack().value_counts()
Out[287]: 
11              5
1104            5
110401          5
11040102        1
110401050601    1
1104010206      1
110401040601    1
11040101        1
1104010106      1
110401010601    1
110401020601    1
11040104        1
110401030601    1
11040103        1
1104010406      1
1104010306      1
11040105        1
1104010506      1
dtype: int64

【讨论】：

不错。不知道你可以使用 split('(..)')。这个解决方案很大程度上依赖于 pandas，我猜在 PySpark 或 Apache Beam 中不起作用【参考方案2】：

这是另一个使用 Apache Beam SDK for Python 的解决方案。这与使用 map-reduce 范例的大数据兼容。示例文件应包含产品 ID 作为第一列和 12 位类别作为第二列，使用 ; 作为分隔符。这段代码的优雅之处在于您可以很好地看到每行的每个转换。

# Python 2.7

import apache_beam as beam
FILE_IN = 'my_sample.csv'
SEPARATOR = ';'

# the collector target must be created outside the Do-Function to be globally available
results = dict()

# a custom Do-Function that collects the results
class Collector(beam.DoFn):    
    def process(self, element):
        category, count = element
        results[category] = count
        return  category: count 


# This runs the pipeline locally.
with beam.Pipeline() as p:
    counts = (p
     | 'read file row-wise' >> beam.io.ReadFromText(FILE_IN, skip_header_lines=True)
     | 'split row' >> beam.Map(lambda line: line.split(SEPARATOR))
     | 'remove useless columns' >> beam.Map(lambda words: words[0:2])
     | 'remove quotes' >> beam.Map(lambda words: [word.strip('\"') for word in words])
     | 'convert from unicode' >> beam.Map(lambda words: [str(word) for word in words])
     | 'convert to tuple' >> beam.Map(lambda words: tuple(words))
     | 'remove duplicates' >> beam.RemoveDuplicates()
     | 'extract category' >> beam.Map(lambda (product, category): category)
     | 'create parent categories' >> beam.FlatMap(lambda cat: [cat[:i] for i in range(12,1,-2)])
     | 'group and count by category' >> beam.combiners.Count.PerElement()
     | 'filter by minimum count' >> beam.Filter(lambda count: count[1] >= MIN_COUNT)
     | 'collect results' >> beam.ParDo(collector)
    )

result = p.run()
result.wait_until_finish()

# investigate the result; 
# expected is a list of tuples each consisting of the category and its count
print(results)

代码是用 Python 2.7 编写的，因为用于 Python 的 Apache Beam SDK 还不支持 Python 3。

【讨论】：

以上是关于数据框：从单个 ID 中提取多个父级并计算出现次数的主要内容，如果未能解决你的问题，请参考以下文章

在Wordpress中检查页面的第一个父级并添加主体id

如何从多个文件夹读取到单个数据框

根据机器 ID 从数据框中提取行

如何构建 Maven 父级并选择模块数量？

计算单元格 pandas 中多个子字符串的出现次数

两个数据框，计算重复的ID并与另一个具有相同ID的数据框合并？