GitHub API：贡献给的存储库

Posted 2023-03-28

技术标签:

【中文标题】GitHub API：贡献给的存储库【英文标题】：GitHub API: Repositories Contributed To 【发布时间】：2014-01-09 23:07:12 【问题描述】：

有没有办法通过 GitHub API 访问 GitHub 个人资料页面上的“存储库贡献”模块中的数据？理想情况下是整个列表，而不仅仅是前五名，这显然是您可以在网络上获得的所有内容。

【问题讨论】：

我相信没有简单的方法来做到这一点。挖掘（非官方）GitHub Archive 项目中可用的数据将有所帮助（但仅适用于公共项目）：githubarchive.org 有兴趣知道具体如何在 javascript 中执行此操作。 repos 不仅应该包括一个人已经提交的 repos，还应该包括一个有问题的 repos 和 cmets 等。我脑子里没有一个明确的方法。您需要进行大量查询才能得出结果。 GitHub 用于确定某些内容是否可以算作贡献的规则如下：help.github.com/articles/… 【参考方案1】：

使用 GraphQL API v4，您现在可以使用以下方式获取这些贡献的 repo：


  viewer 
    repositoriesContributedTo(first: 100, contributionTypes: [COMMIT, ISSUE, PULL_REQUEST, REPOSITORY]) 
      totalCount
      nodes 
        nameWithOwner
      
      pageInfo 
        endCursor
        hasNextPage

Try it in the explorer

如果您有超过 100 个贡献的 repo（包括您的），您将必须通过分页在 repositoriesContributedTo 中指定 after: "END_CURSOR_VALUE" 以获取下一个请求。

【讨论】：

现在我们在未来（2017 年），这个问题的最佳解决方案是使用 GitHub 的新 GraphQL API，而不是依赖 githubarchive Google BigQuery 的 2014 时代解决方案。奇怪的是，没有展示我自己的项目……但是很酷的解决方案！这看起来不错，但文档说“用户最近贡献的存储库列表”。（强调我的）。此外，缺少自己的项目。可以使用`includeUserRepositories:true`包含自己的项目 @Mythaar 你可以使用personal access token，python 脚本示例见this【参考方案2】：

将Google BigQuery 与GitHub Archive 结合使用，我提取了我提出拉取请求以使用的所有存储库：

SELECT repository_url 
FROM [githubarchive:github.timeline]
WHERE payload_pull_request_user_login ='rgbkrk'
GROUP BY repository_url;

您可以使用类似的语义来提取您贡献的存储库数量以及它们使用的语言：

SELECT COUNT(DISTINCT repository_url) AS count_repositories_contributed_to,
       COUNT(DISTINCT repository_language) AS count_languages_in
FROM [githubarchive:github.timeline]
WHERE payload_pull_request_user_login ='rgbkrk';

如果您正在寻找整体贡献，其中包括报告的使用问题

SELECT COUNT(DISTINCT repository_url) AS count_repositories_contributed_to,
       COUNT(DISTINCT repository_language) AS count_languages_in
FROM [githubarchive:github.timeline]
WHERE actor_attributes_login = 'rgbkrk'
GROUP BY repository_url;

区别在于actor_attributes_login 来自Issue Events API。

您可能还想捕获自己的 repos，其中可能没有您自己提交的问题或 PR。

【讨论】：

自 2015 年 1 月起，githubarchive:github.timeline 表已被弃用。除了@sulaiman 指出的表弃用之外，替换表的表结构已经完全改变（例如表githubarchive:year.2017），因此当前查询看起来像：SELECT repo.name FROM [githubarchive:year.2017] WHERE actor.login ='rgbkrk' GROUP BY repo.name; 除了@sulaimansudirman 和@gene_wood cmets：语法稍有改变，所以当前查询将是这样的：SELECT repo.name FROM `githubarchive.year.2019` WHERE actor.login ='rgbkrk' GROUP BY repo.name;。附带说明：可以使用* 而不是年份。【参考方案3】：

I tried implementing something like this a while ago for a Github summarizer... 我获取用户贡献的、他们不拥有的存储库的步骤如下（以我自己的用户为例）：

Search 用于用户提交的最后 100 个关闭的拉取请求。当然，如果第一页已满，您可以请求第二页以获得更旧的 prs

https://api.github.com/search/issues?q=type:pr+state:closed+author:megawac&per_page=100&page=1

接下来我会请求每个repos contributors。如果有问题的用户在贡献者列表中，我们会将 repo 添加到列表中。例如：

https://api.github.com/repos/jashkenas/underscore/contributors

我们还可以尝试检查用户正在观看的所有存储库。我们将再次检查每个 repos repos/:owner/:repo/contributors

https://api.github.com/users/megawac/subscriptions

此外，我会迭代用户所在组织的所有存储库

https://api.github.com/users/megawac/orgshttps://api.github.com/orgs/jsdelivr/repos

如果用户被列为任何 repos 的贡献者，我们会将 repo 添加到列表中（与上述步骤相同）

这会错过用户未提交任何拉取请求但已被添加为贡献者的存储库。我们可以通过搜索来增加找到这些 repos 的几率

1) 任何已打开的问题（不仅仅是关闭的拉取请求） 2) 用户已加星标的存储库

显然，这需要比我们想要的更多的请求，但是当它们让你捏造功能时你能做什么\o/

【讨论】：

如果你可以让你的 Javascript 搜索那些有问题被用户打开和评论的 repos，那将是理想的。 GitHub 用来生成他们贡献的 repos 列表的规则在这里，但我们不需要太接近它：help.github.com/articles/…【参考方案4】：

您可以使用Search provided by GitHub API。您的查询应如下所示：

https://api.github.com/search/repositories?q=%20+fork:true+user:username

fork 参数设置为 true 可确保您查询所有用户的 repos，包括 fork。

但是，如果您想确保用户不仅派生了存储库，而且对其做出了贡献，您应该遍历通过“搜索”请求获得的每个存储库，并检查用户是否在其中。这很糟糕，因为 github 只返回 100 个贡献者，而且没有解决方案......

【讨论】：

这只会产生用户仓库的当前列表，而不是曾经贡献过的仓库列表。【参考方案5】：

你可能会通过 GitHub 的 GraphQL API 获得最后一年左右的时间，如 Bertrand Martel's answer 所示。

2011 年发生的所有事情都可以在 GitHub 存档中找到，如 Kyle Kelley's answer 中所述。但是，BigQuery 的语法和 GitHub 的 API 似乎发生了变化，其中显示的示例在 08/2020 中不再适用。

这就是我找到我贡献的所有回购的方法

SELECT distinct repo.name
FROM (
  SELECT * FROM `githubarchive.year.2011` UNION ALL
  SELECT * FROM `githubarchive.year.2012` UNION ALL
  SELECT * FROM `githubarchive.year.2013` UNION ALL
  SELECT * FROM `githubarchive.year.2014` UNION ALL
  SELECT * FROM `githubarchive.year.2015` UNION ALL
  SELECT * FROM `githubarchive.year.2016` UNION ALL
  SELECT * FROM `githubarchive.year.2017` UNION ALL
  SELECT * FROM `githubarchive.year.2018`
)
WHERE (type = 'PushEvent' 
  OR type = 'PullRequestEvent')
  AND actor.login = 'YOUR_USER'

其中一些返回的 Repos 只有一个名称，没有用户或组织。但无论如何我必须手动处理结果。

【讨论】：

【参考方案6】：

我遇到了问题。 (GithubAPI: Get repositories a user has ever committed in)

我发现的一个实际 hack 是有一个名为 http://www.githubarchive.org/ 的项目他们记录了从 2011 年开始的所有公共事件。不理想，但可能会有所帮助。

因此，例如，在您的情况下：

SELECT  payload_pull_request_head_repo_clone_url 
FROM [githubarchive:github.timeline]
WHERE payload_pull_request_base_user_login='outoftime'
GROUP BY payload_pull_request_head_repo_clone_url;

如果我没记错的话，给出你请求的 repos 列表：

https://github.com/jreidthompson/noaa.git
https://github.com/kkrol89/sunspot.git
https://github.com/rterbush/sunspot.git
https://github.com/ottbot/cassandra-cql.git
https://github.com/insoul/cequel.git
https://github.com/mcordell/noaa.git
https://github.com/hackhands/sunspot_rails.git
https://github.com/lgierth/eager_record.git
https://github.com/jnicklas/sunspot.git
https://github.com/klclee/sunspot.git
https://github.com/outoftime/cequel.git

您可以在此处使用 bigquery：bigquery.cloud.google.com，可以在此处找到数据架构：https://github.com/igrigorik/githubarchive.org/blob/master/bigquery/schema.js

【讨论】：

【参考方案7】：

我写了一个 selenium python 脚本来做这个

"""
Get all your repos contributed to for the past year.

This uses Selenium and Chrome to login to github as your user, go through 
your contributions page, and grab the repo from each day's contribution page.

Requires python3, selenium, and Chrome with chromedriver installed.

Change the username variable, and run like this:

GITHUB_PASS="mypassword" python3 github_contributions.py
"""

import os
import sys
import time
from pprint import pprint as pp
from urllib.parse import urlsplit
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

username = 'jessejoe'
password = os.environ['GITHUB_PASS']

repos = []
driver = webdriver.Chrome()
driver.get('https://github.com/login')

driver.find_element_by_id('login_field').send_keys(username)
password_elem = driver.find_element_by_id('password')
password_elem.send_keys(password)
password_elem.submit()

# Wait indefinitely for 2-factor code
if 'two-factor' in driver.current_url:
    print('2-factor code required, go enter it')
while 'two-factor' in driver.current_url:
    time.sleep(1)

driver.get('https://github.com/'.format(username))

# Get all days that aren't colored gray (no contributions)
contrib_days = driver.find_elements_by_xpath(
    "//*[@class='day' and @fill!='#eeeeee']")

for day in contrib_days:
    day.click()
    # Wait until done loading
    WebDriverWait(driver, 10).until(
        lambda driver: 'loading' not in driver.find_element_by_css_selector('.contribution-activity').get_attribute('class'))

    # Get all contribution URLs
    contribs = driver.find_elements_by_css_selector('.contribution-activity a')
    for contrib in contribs:
        url = contrib.get_attribute('href')
        # Only care about repo owner and name from URL
        repo_path = urlsplit(url).path
        repo = '/'.join(repo_path.split('/')[0:3])
        if repo not in repos:
            repos.append(repo)
    # Have to click something else to remove pop-up on current day
    driver.find_element_by_css_selector('.vcard-fullname').click()

driver.quit()
pp(repos)

它使用 python 和 selenium 自动化 Chrome 浏览器登录 github，进入你的贡献页面，每天点击并从任何贡献中获取 repo 名称。由于此页面仅显示 1 年的活动价值，因此您可以使用此脚本获得的全部内容。

【讨论】：

【参考方案8】：

有一个声称列出所有贡献的新项目：

https://github.com/AurelienLourot/github-contribs

它还支持生成更详细的用户配置文件的服务：

https://ghuser.io/

【讨论】：

【参考方案9】：

我没有在 API 中看到任何方法。我能找到的最接近的方法是从公共用户那里获取最新的 300 个事件（不幸的是，300 个是限制），然后您可以对这些事件进行排序，以便对其他存储库的贡献。

https://developer.github.com/v3/activity/events/#list-public-events-performed-by-a-user

我们需要让 Github 在他们的 API 中实现这一点。

【讨论】：

问题在于 GitHub 上的“已贡献的存储库”不仅包括您已提交的存储库，还包括开放问题。 @Cupcake 打开一个问题被视为对 github 用户页面的贡献【参考方案10】：

您可以查看https://github.com/casperdcl/cdcl/tree/master/ghstat，它会自动计算所有可见存储库中编写的代码行数。提取相关代码并整理：

需要来自https://github.com/cli/cli 的gh 需要jq 需要bash 需要$GH_USER环境变量集将“贡献者”定义为“提交者”

#!/bin/bash
ghjq()  # <endpoint> <filter>
  # filter all pages of authenticated requests to https://api.github.com
  gh api --paginate "$1" | jq -r "$2"

repos="$(
  ghjq users/$GH_USER/repos .[].full_name
  ghjq "search/issues?q=is:pr+author:$GH_USER+is:merged" \
    '.items[].repository_url | sub(".*github.com/repos/"; "")'
  ghjq users/$GH_USER/subscriptions .[].full_name
  for org in "$(ghjq users/$GH_USER/orgs .[].login)"; do
    ghjq orgs/$org/repos .[].full_name
  done
)"
repos="$(echo "$repos" | sort -u)"
# print repo if user is a contributor
for repo in $repos; do
  if [[ $(ghjq repos/$repo/contributors "[.[].login | test(\"$GH_USER\")] | any") == "true" ]]; then
    echo $repo
  fi
done

【讨论】：

【参考方案11】：

我正在使用python：

import requests
import pandas as pd
import datetime
token='..........................'
g=Github(token,per_page=10000)
repos=g.search_repositories(query="q:example")

【讨论】：

一个好的答案将始终包括解释为什么这会解决问题，以便 OP 和任何未来的读者可以从中学习。【参考方案12】：

截至目前 GitHub API v3，不提供获取用户当前连续记录的方法。

您可以使用它来计算当前的连胜。

https://github.com/users/<username>/contributions.json

【讨论】：

这可能是真实陈述的集合，但没有回答问题，这与当前的连续性无关。我在该网址收到 406 错误。它是否打算用于 API 调用？

以上是关于GitHub API：贡献给的存储库的主要内容，如果未能解决你的问题，请参考以下文章