GCP 中的流式处理抓取的音频

Posted 2023-02-16

技术标签:

【中文标题】GCP 中的流式处理抓取的音频【英文标题】：Stream Process Scraped Audio in GCP 【发布时间】：2019-12-27 18:50:13 【问题描述】：

我想从网站上抓取多个音频通道。我想同时实时执行以下操作：

1. Save the audio to GCP Storage. 
2. Apply speech-to-text ML and send transcripts to an app.

我想重点关注这篇文章的 (1)。在 GCP 中执行此操作的最佳方法是什么，是 Pubsub 吗？如果不是，那么构建它的最佳方法是什么？

我有一个功能性 Python 脚本。

设置录音功能。

def record(url): 
  # Open url. 
  response = urllib.request.urlopen(url)
  block_size = 1024

  # Make folder with station name. 
  # Example, 'www.music.com/station_1' has folder name of '/station_1/'
  channel = re.search('([^\/]+$)' , url)[0]
  folder = '/' + channel + '/'
  os.makedirs(os.path.dirname(folder), exist_ok=True)

  # Run indefinitely
  while True:
    # Name recording as the current date_time. 
    filename = folder + time.strftime("%m-%d-%Y--%H-%M-%S") + '.mp3'
    f = open(filename, 'wb')

    start = time.time()
    # Create new file every 60 seconds. 
    while time.time() - start < 60:
      buffer = response.read(block_size)
      f.write(buffer)
    f.close()

声明要记录的 URL

urls = ['www.music.com/station_1',...,'www.music.com/station_n']

线程一次从多个 URL 记录。

p = Pool(len(urls))
p.map(record, urls)
p.terminate()
p.join()

【问题讨论】：

【参考方案1】：

Beam 不适合这种用例。

解释：

假设频道名称是元素。

您的示例需要无限期地处理单个元素，这不是梁做得很好。

即使我们将每个元素定义为（频道名称，时间戳），问题也不会得到解决，因为我们无法在任意时间窗口基于站拉取数据。

【讨论】：

以上是关于GCP 中的流式处理抓取的音频的主要内容，如果未能解决你的问题，请参考以下文章