表格到 Bigquery - 设置最大允许错误

Posted

技术标签:

【中文标题】表格到 Bigquery - 设置最大允许错误【英文标题】:Sheets to Bigquery - set max allowed errors 【发布时间】:2021-05-06 15:47:33 【问题描述】:

我使用此脚本从 Google 表格上传数据。它在这里以某种方式,如何设置最大错误?我只想忽略所有错误并上传数据,无论有多少错误。我有很多不同的大表,它们每次都是另一种格式等。

我能够手动正确加载此数据(我只设置了 100 或 1000 个允许的错误)。但是这个脚本使用 autodetect:true 运行并且不允许出错。谢谢

/**
 * Function to run from the UI menu.
 *
 * Uploads the sheets defined in the active sheet into BigQuery.
 */
function runFromUI() 
  // Column indices.
  const SHEET_URL = 1;
  const PROJECT_ID = 2;
  const DATASET_ID = 3;
  const TABLE_ID = 4;
  const APPEND = 5;
  const STATUS = 6;


  // Get the data range rows, skipping the header (first) row.
  let sheet = SpreadsheetApp.getActiveSheet();
  let rows = sheet.getDataRange().getValues().slice(1);

  // Run the sheetToBigQuery function for every row and write the status.
  rows.forEach((row, i) => 
    let status = sheetToBigQuery(
      row[SHEET_URL],
      row[PROJECT_ID],
      row[DATASET_ID],
      row[TABLE_ID],
      row[APPEND],
    );
    sheet.getRange(i+2, STATUS+1).setValue(status);
  );


/**
 * Uploads a single sheet to BigQuery.
 *
 * @param string sheetUrl - The Google Sheet Url containing the data to upload.
 * @param string projectId - Google Cloud Project ID.
 * @param string datasetId - BigQuery Dataset ID.
 * @param string tableId - BigQuery Table ID.
 * @param bool append - Appends to BigQuery table if true, otherwise replaces the content.
 * 
 * @return string status - Returns the status of the job.
 */
function sheetToBigQuery(sheetUrl, projectId, datasetId, tableId, append) 
  try 
    createDatasetIfDoesntExist(projectId, datasetId);
   catch (e) 
    return `$e: Please verify your "Project ID" exists and you have permission to edit BigQuery`;
  

  let sheet;
  try 
    sheet = openSheetByUrl(sheetUrl);
   catch (e) 
    return `$e: Please verify the "Sheet URL" is pasted correctly`;
  

  // Get the values from the sheet's data range as a matrix of values.
  let rows = sheet.getDataRange().getValues();

  // Normalize the headers (first row) to valid BigQuery column names.
  // https://cloud.google.com/bigquery/docs/schemas#column_names
  rows[0] = rows[0].map((header) => 
    header = header.toLowerCase().replace(/[^\w]+/g, '_');
    if (header.match(/^\d/))
      header = '_' + header;
    return header;
  );

  // Create the BigQuery load job config. For more information, see:
  // https://developers.google.com/apps-script/advanced/bigquery
  let loadJob = 
    configuration: 
      load: 
        destinationTable: 
          projectId: projectId,
          datasetId: datasetId,
          tableId: tableId
        ,
        autodetect: true,  // Infer schema from contents.
        writeDisposition: append ? 'WRITE_APPEND' : 'WRITE_TRUNCATE',
      
    
  ;

  // BigQuery load jobs can only load files, so we need to transform our
  // rows (matrix of values) into a blob (file contents as string).
  // For convenience, we convert the rows into a CSV data string.
  // https://cloud.google.com/bigquery/docs/loading-data-local
  let csvRows = rows.map(values =>
      // We use JSON.stringify() to add "quotes to strings",
      // but leave numbers and booleans without quotes.
      // If a string itself contains quotes ("), JSON escapes them with
      // a backslash as \" but the CSV format expects them to be
      // escaped as "", so we replace all the \" with "".
      values.map(value => JSON.stringify(value).replace(/\\"/g, '""'))
  );
  let csvData = csvRows.map(values => values.join(',')).join('\n');
  let blob = Utilities.newBlob(csvData, 'application/octet-stream');

  // Run the BigQuery load job.
  try 
    BigQuery.Jobs.insert(loadJob, projectId, blob);
   catch (e) 
    return e;
  

  Logger.log(
    'Load job started. Click here to check your jobs: ' +
    `https://console.cloud.google.com/bigquery?project=$projectId&page=jobs`
  );

  // The status of a successful run contains the timestamp.
  // return `Last run: $new Date().setDate `;
  return `last run: $Utilities.formatDate(new Date(), SpreadsheetApp.getActive().getSpreadsheetTimeZone(), "yyyy-MM-dd HH:mm") `;


/**
 * Creates a dataset if it doesn't exist, otherwise does nothing.
 *
 * @param string projectId - Google Cloud Project ID.
 * @param string datasetId - BigQuery Dataset ID.
 */
function createDatasetIfDoesntExist(projectId, datasetId) 
  try 
    BigQuery.Datasets.get(projectId, datasetId);
   catch (err) 
    let dataset = 
      datasetReference: 
        projectId: projectId,
        datasetId: datasetId,
      ,
    ;
    BigQuery.Datasets.insert(dataset, projectId);
    Logger.log(`Created dataset: $projectId:$datasetId`);
  


/**
 * Opens the spreadsheet sheet (tab) with the given URL.
 *
 * @param string sheetUrl - Google Sheet Url.
 * 
 * @returns Sheet - The sheet corresponding to the URL.
 * 
 * @throws Throws an error if the sheet doesn't exist.
 */
function openSheetByUrl(sheetUrl) 
  // Extract the sheet (tab) ID from the Url.
  let sheetIdMatch = sheetUrl.match(/gid=(\d+)/);
  let sheetId = sheetIdMatch ? sheetIdMatch[1] : null;

  // From the open spreadsheet, get the sheet (tab) that matches the sheetId.
  let spreadsheet = SpreadsheetApp.openByUrl(sheetUrl);
  let sheet = spreadsheet.getSheets().filter(sheet => sheet.getSheetId() == sheetId)[0];
  if (!sheet)
    throw 'Sheet tab ID does not exist';

  return sheet;

【问题讨论】:

【参考方案1】:

如果要设置最大错误数,可以在 load 配置中使用 maxBadRecords 参数。如果您想完全忽略错误,可以将ignoreUnknownValues 设置为true

  let loadJob = 
    configuration: 
      load: 
        destinationTable: 
          projectId: projectId,
          datasetId: datasetId,
          tableId: tableId
        ,
        autodetect: true,  // Infer schema from contents.
        // maxBadRecords: 1000,
        ignoreUnknownValues: true, // use one or the other
        writeDisposition: append ? 'WRITE_APPEND' : 'WRITE_TRUNCATE',
      
    
  ;

参考资料:

BigQuery v2 | Job Configuration Load

【讨论】:

ignoreUnknownVlaues 没有帮助,坚果 maxBadErrors 有帮助!谢谢 非常感谢!

以上是关于表格到 Bigquery - 设置最大允许错误的主要内容,如果未能解决你的问题,请参考以下文章

从BigQuery数据查询中查询名为chartio的BI工具时的最大数据大小是多少?

Google bigquery 中的最大不良记录

错误:GoogleJsonResponseException:对 bigquery.jobs.get 的 API 调用失败并出现错误:未找到:作业 YXZ

数据可视化之DAX篇(十七)Power BI表格总计行错误的终极解决方案

Bigquery 最大处理数据大小允许?

如何将 52.4 MB 的 XLSX 电子表格加载到 BigQuery 中?