加入两个表,文件名有额外的字符串,正则表达式从文件名中删除字符串并进行连接
Posted
技术标签:
【中文标题】加入两个表,文件名有额外的字符串,正则表达式从文件名中删除字符串并进行连接【英文标题】:Join on two table, file_name having extra string, regex to remove string from filename and do the join 【发布时间】:2020-10-21 03:52:54 【问题描述】:我有两个表需要分别在 table_name 和 file_name 上应用连接。问题是 table_name 与表 2 中的 file_name 相比有一些额外的字符串。
使用正则表达式,如何从 table_name 中删除额外的字符串,使其与表 2 的 file_name 的连接兼容?
TABLE 1:
table_name audit_record_count
Immunology_COVID-19_Treatment_202006221630_01.csv 1260124
Immunology_COVID-19_Trial_Design_202006221630_01.csv 2173762
Immunology_COVID-19_Planned_Treatment_202006221630_01.csv 1350135
Immunology_COVID-19_Patient_Characteristic_202006221630_01.csv 2173762
Immunology_COVID-19_Intervention_Type_202006221630_01.csv 2173762
Immunology_COVID-19_Arm_202006221630_01.csv 4
Immunology_COVID-19_Actual_Treatment_202006221630_01.csv 2173762
Immunology_COVID-19_Publication_202006221630_01.csv 2173762
Immunology_COVID-19_Outcome_202006221630_01.csv 2173762
Immunology_COVID-19_Intervention_Type_Factor_202006221630_01.csv 2173762
Immunology_COVID-19_Inclusion_Criteria_202006221630_01.csv 2173762
Immunology_COVID-19_Curation_202006221630_01.csv 2173762
TABLE 2:
file_name csv_record_count
Treatment 1260124
Trial_Design 2173762
Planned_Treatment 1350135
Patient_Characteristic 2173762
Intervention_Type 2173762
Arm 4
Actual_Treatment 2173762
Publication 2173762
Outcome 2173762
Intervention_Type_Factor 2173762
Inclusion_Criteria 2173762
Curation 2173762
我尝试过的:
audit_file_df = spark.read.csv(
f"s3://config['raw_bucket']/config['landing_directory']/config['audit_file']/watermark_timestamp*.csv",
header=False, inferSchema=True) \
.withColumnRenamed("_c0", "table_name").withColumnRenamed("_c1", "audit_record_count")\
.selectExpr("regexp_extract(table_name, '^(.(?!(\\\\d12_\\\\d2,4.csv|\\\\d12.csv)))*', 0) AS table_name",'audit_record_count')
print("audit_file_df :",audit_file_df)
audit_file_df.show()
validation_df = audit_file_df.join(schema_validation_df, how='inner', on=audit_file_df['table_name'] == schema_validation_df['file_name']).withColumn("count_match",
col=col(
'audit_record_count') == col(
'csv_record_count'))
print("Record validation result")
validation_df.show()
我能够从 table_name 中删除时间戳,但无法提取 file_name 以使连接条件起作用。
加法
Immunology_COVID-19 未修复,可能会因另一个文件而改变,table_name 的格式为:
TA_Indication_data_timestamp_nn.csv
【问题讨论】:
【参考方案1】:在表 1 中创建一个包含 data
部分的附加列:
df = df.withColumn('data', F.regexp_extract(F.col('table_name'), '.*?_.*?_(.*)_\d12_\d2\.csv', 1))
给予
+----------------------------------------------------------------+---------+------------------------+
|table_name |audit_rec|data |
+----------------------------------------------------------------+---------+------------------------+
|Immunology_COVID-19_Treatment_202006221630_01.csv |1260124 |Treatment |
|Immunology_COVID-19_Trial_Design_202006221630_01.csv |2173762 |Trial_Design |
|Immunology_COVID-19_Planned_Treatment_202006221630_01.csv |1350135 |Planned_Treatment |
|Immunology_COVID-19_Patient_Characteristic_202006221630_01.csv |2173762 |Patient_Characteristic |
|Immunology_COVID-19_Intervention_Type_202006221630_01.csv |2173762 |Intervention_Type |
|Immunology_COVID-19_Arm_202006221630_01.csv |4 |Arm |
|Immunology_COVID-19_Actual_Treatment_202006221630_01.csv |2173762 |Actual_Treatment |
|Immunology_COVID-19_Publication_202006221630_01.csv |2173762 |Publication |
|Immunology_COVID-19_Outcome_202006221630_01.csv |2173762 |Outcome |
|Immunology_COVID-19_Intervention_Type_Factor_202006221630_01.csv|2173762 |Intervention_Type_Factor|
|Immunology_COVID-19_Inclusion_Criteria_202006221630_01.csv |2173762 |Inclusion_Criteria |
|Immunology_COVID-19_Curation_202006221630_01.csv |2173762 |Curation |
+----------------------------------------------------------------+---------+------------------------+
然后您可以使用table1.data
和table2.file_name
加入表格并继续您在问题中已经给出的审计检查。
正则表达式的棘手部分是使用non-greedy 限定符,因为data
部分本身可以包含下划线字符。
【讨论】:
你能帮我解决以下问题吗:***.com/questions/63550188/…以上是关于加入两个表,文件名有额外的字符串,正则表达式从文件名中删除字符串并进行连接的主要内容,如果未能解决你的问题,请参考以下文章