没有文件更改数据的合并提交无法在数据框中显示
Posted
技术标签:
【中文标题】没有文件更改数据的合并提交无法在数据框中显示【英文标题】:merged commit who does not has file changed data could not be shown in dataframe 【发布时间】:2021-09-25 02:59:16 【问题描述】:我使用命令git log --all --numstat --pretty=format:'--%h--%ad--%aN--%s' > ../react_git.logs
提取了 git 日志。我也想有一个合并提交的列,这样我就可以分析哪个作者合并提交最多等等。我以这种方式编码以从生成日志中创建数据框
COMMIT_LOG = os.path.join(os.path.abspath(''), 'react_git.logs')
raw_df = pd.read_csv(COMMIT_LOG, sep="\u0012", header=None, names=["raw"])
commit_marker = raw_df[raw_df["raw"].str.startswith("--", na=False)]
commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<timestamp>.*?)--(?P<author>.*?)--(?P<message>.*?)$", expand=True)
commit_info.insert(loc=2, column='date', value=pd.to_datetime(commit_info['timestamp'], utc=True))
commit_info_copy = commit_info.loc[:]
commit_info_copy['today'] = pd.to_datetime('today', utc=True)
commit_info['age'] = commit_info_copy['date'] - commit_info_copy['today']
file_stats_marker = raw_df[~raw_df.index.isin(commit_info.index)]
file_stats = file_stats_marker['raw'].str.split("\t", expand=True)
file_stats = file_stats.rename(columns=0: "insertion", 1: "deletion", 2: "filepath")
file_stats['insertion'] = pd.to_numeric(file_stats['insertion'], errors="coerce")
file_stats['deletion'] = pd.to_numeric(file_stats['deletion'], errors="coerce")
file_stats['churn'] = file_stats['insertion'] - file_stats['deletion']
commit_data = commit_info.reindex(raw_df.index).fillna(method="ffill")
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
df = commit_data.join(file_stats)
print(df.size)
print(df.head())
如果我有这样的日志
--cae635054--Sat Jun 26 14:51:23 2021 -0400--Andrew Clark--`act`: Resolve to return value of scope function (#21759)
31 0 packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js
1 1 packages/react-test-renderer/src/ReactTestRenderer.js
24 14 packages/react/src/ReactAct.js
--e2453e200--Fri Jun 25 15:39:46 2021 -0400--Andrew Clark--act: Add test for bypassing queueMicrotask (#21743)
50 0 packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js
--8f03109cd--Wed Sep 11 09:51:32 2019 -0700--Brian Vaughn--Moved backend injection to the content script (#16752)
--efa780d0a--Wed Sep 11 09:51:24 2019 -0700--Brian Vaughn--Removed DT inject() script since it's no longer being used
0 24 packages/react-devtools-extensions/src/inject.js
--4290967d4--Wed Sep 11 09:34:31 2019 -0700--Brian Vaughn--Merge branch 'tt-compat' of https://github.com/onionymous/react into onionymous-tt-compat
--f09854a9e--Wed Sep 11 09:30:57 2019 -0700--Brian Vaughn--Moved inline comment.
3 5 packages/react-devtools-extensions/src/injectGlobalHook.js
那么8f03109cd
和4290967d4
将不会出现在数据框中,这对于分析查找合并提交的数量非常重要,就像我上面所说的那样。
如何将这些数据放入数据框中,插入:0、删除:0 和文件路径:0 以及一列或以任何方式将它们与其他数据区分开来,以便更容易知道它与合并提交有关?
我在 repl 上也有这个,还有日志文件 https://replit.com/@milanregmi/metricsLogs#main.py
【问题讨论】:
【参考方案1】:我将您的问题理解为一个双重问题:
-
如何从 git log 中获取有关合并提交的信息
如何计算每个用户的合并次数并将其写入数据帧
首先,您可以为git log
添加pretty
的%b
格式化程序。这将为您提供提交正文,其中包含顶部的一行,表明此提交是合并提交。像这样的格式字符串
git log --all --numstat --pretty=format:'--%h--%ad--%aN--%s--%b'
然后会产生这样的结果
--364d23ac--Wed Jul 14 13:44:55 2021 +0200--Doe, John--Pull request #114: Bugfix/foo-branch--Merge in <repository> from bugfix/foo-branch to dev
* commit '64ee15b12345670d1ec214cb83468cf0a55a341':
Bugfix: lorem ipsum
Fix: dolor sit amet
您可以在 commit_marker
数据框中查找 Merge
来识别合并提交。其次,您可以计算每个作者的合并次数,为所有个人作者及其合并提交进行逆累积。
这一切都在一起了:
import os
import pandas as pd
COMMIT_LOG = os.path.join(os.path.abspath(''), 'react_git.logs')
raw_df = pd.read_csv(COMMIT_LOG, sep="\u0012", header=None, names=["raw"])
commit_marker = raw_df[raw_df["raw"].str.startswith("--", na=False)]
# Add extraction for the new body part
commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<timestamp>.*?)--(?P<author>.*?)--("r"?P<message>.*?)--(?P<body>.*?)$", expand=True)
commit_info.insert(loc=2, column='date', value=pd.to_datetime(commit_info['timestamp'], utc=True))
commit_info_copy = commit_info.loc[:]
commit_info_copy['today'] = pd.to_datetime('today', utc=True)
commit_info['age'] = commit_info_copy['date'] - commit_info_copy['today']
file_stats_marker = raw_df[~raw_df.index.isin(commit_info.index)]
file_stats = file_stats_marker['raw'].str.split("\t", expand=True)
file_stats = file_stats.rename(columns=0: "insertion", 1: "deletion", 2: "filepath")
file_stats['insertion'] = pd.to_numeric(file_stats['insertion'], errors="coerce")
file_stats['deletion'] = pd.to_numeric(file_stats['deletion'], errors="coerce")
file_stats['churn'] = file_stats['insertion'] - file_stats['deletion']
commit_data = commit_info.reindex(raw_df.index).fillna(method="ffill")
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
df = commit_data.join(file_stats)
# Remove additional lines coming from the git commit body.
df.drop_duplicates(inplace=True)
# count merges per author
for author in df['author'].unique():
idx = df[(df['author'] == author) & (df['body'].str.contains('Merge'))].index
df.loc[idx, 'merges'] = list(
range(1, len(df[(df['author'] == author) & (df['body'].str.contains('Merge'))]) + 1)[::-1])
print(df.size)
print(df.head())
你最终得到的是你之前拥有的数据框加上一个列 merges
越来越多地计算每个作者的合并次数
sha timestamp date author message body age insertion deletion filepath churn merges
b3e5eb7 Fri Jul 16 11:36:43 2021 +0200 2021-07-16 09:36:43+00:00 Doe, John test deploy -4 days +00:51:00.878903000 31.0 3.0 file/path/file1.py 28.0
4fc0c34 Thu Jul 15 11:12:10 2021 +0200 2021-07-15 09:12:10+00:00 Cow, Jane Pull request #116: Dev Merge in repo from dev to master -5 days +00:26:27.878903000 14.0
8188751 Thu Jul 15 07:42:40 2021 +0200 2021-07-15 05:42:40+00:00 Doe, John Pull request #115: Feature/foo-bar Merge in repo from feature/foo-bar to dev -6 days +20:56:57.878903000 7.0
6fa89c3 Wed Jul 14 16:02:38 2021 +0200 2021-07-14 14:02:38+00:00 Cow, Jane Added: foo bar -6 days +05:16:55.878903000 4056.0 0.0 file/path/file2.py 4056.0
【讨论】:
非常感谢您的解决方案和您的时间。我查看了数据框,我看到包含插入、删除、文件路径和合并的提交全部为空,所以可以删除该行吗? @milan 我添加了一个示例,数据框如何为我查找我用来测试它的存储库。我没有看到所有insertion
、deletion
、filepath
和merges
中的任何提交都为空。如果您看到此类条目并愿意删除这些行,请继续。
当然,我会的。再次感谢您的帮助。只是确认,“加上一个列合并,越来越多地计算每个作者的合并次数”意味着它是每个作者的总合并提交?就像您的情况一样,Cow,Jane 的合并提交总数为 14,而 Doe,John 的合并提交总数为 7。
@米兰。是的,每个作者所做的每次合并,这个数字都在增加。正如你所说,Cow,Jane 做了 14 次合并,Doe John 做了 7 次以上是关于没有文件更改数据的合并提交无法在数据框中显示的主要内容,如果未能解决你的问题,请参考以下文章