CSV到列,与基于行的数据连接,分析和输出 - 是否可以有效地完成?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了CSV到列,与基于行的数据连接,分析和输出 - 是否可以有效地完成?相关的知识,希望对你有一定的参考价值。

我有一个复杂的SQL Server问题,我一直在尝试解决,但我卡住了,我希望我能得到一些帮助!

我有两个数据表,以不同的格式存储,我需要一起创建一个指定的输出。更糟糕的是,其中一个表有一些关键数据存储在逗号分隔值中(我知道这不是数据应该存储的方式 - 怜悯,我没有设计这些表!)。

学生表:

| id |              oldSkill |                             newSkill |
+----+-----------------------+--------------------------------------+
|  1 |                  Word |                Excel,PowerPoint,Word |
|  2 | Excel,PowerPoint,Word |        Excel,Outlook,PowerPoint,Word |
|  3 |       PowerPoint,Word |                Excel,PowerPoint,Word |
|  4 |          Access,Excel | Access,Excel,Outlook,PowerPoint,Word |
|  5 |          Outlook,Word |        Excel,Outlook,PowerPoint,Word |

技能表:

| id |      skill | assignment |
+----+------------+------------+
|  1 |       Word |          B |
|  1 |       Word |          P |
|  2 |      Excel |          P |
|  2 | PowerPoint |          B |
|  2 | PowerPoint |          P |
|  2 |       Word |          P |
|  3 | PowerPoint |          P |
|  3 |       Word |          P |
|  4 |     Access |          B |
|  4 |      Excel |          B |
|  4 |     Access |          P |
|  4 |      Excel |          P |
|  5 |    Outlook |          P |
|  5 |       Word |          B |

以下是我被要求输出的内容:

| id | skill_1 | skill_1_primary | skill_1_backup |    skill_2 | skill_2_primary | skill_2_backup |    skill_3 | skill_3_primary | skill_3_backup |    skill_4 | skill_4_primary | skill_4_backup | skill_5 | skill_5_primary | skill_5_backup |
|----|---------|-----------------|----------------|------------|-----------------|----------------|------------|-----------------|----------------|------------|-----------------|----------------|---------|-----------------|----------------|
|  1 |   Excel |               Y |         (null) | PowerPoint |               Y |         (null) |       Word |               Y |              Y |     (null) |          (null) |         (null) |  (null) |          (null) |         (null) |
|  2 |   Excel |               Y |         (null) |    Outlook |               Y |         (null) | PowerPoint |               Y |              Y |       Word |               Y |         (null) |  (null) |          (null) |         (null) |
|  3 |   Excel |               Y |         (null) | PowerPoint |               Y |         (null) |       Word |               Y |         (null) |     (null) |          (null) |         (null) |  (null) |          (null) |         (null) |
|  4 |  Access |               Y |              Y |      Excel |               Y |              Y |    Outlook |               Y |         (null) | PowerPoint |               Y |         (null) |    Word |               Y |         (null) |
|  5 |   Excel |               Y |         (null) |    Outlook |               Y |         (null) | PowerPoint |               Y |         (null) |       Word |          (null) |              Y |  (null) |          (null) |         (null) |

为了打破它,我需要:

  • 输出newSkill表中Students列中的所有项目。这些值需要分成单独的列,每个列都有一个相应的标志,以指示技能是主要技术还是备用技能。请注意,newSkill列包含oldSkill
  • 如果技能是旧的,请从Skills表中获取标志值,其中P是主要的,B是备份
  • 如果技能是新的,只需使用'y'值标记Primary

我一直在尝试从不同的角度(CTE,枢轴,光标等)来看这个,我已经成功使用UDF将CSV列值分开,但是从Skills表的行中获取数据并进行组合它与他们想要的格式一起,以及Student数据,正在逃避我。

我还设置了一个SQL小提琴来为这篇文章构建我的测试数据:http://sqlfiddle.com/#!6/e8d5a/1/0

在此先感谢您的任何帮助或指导... SQL不是我最强大的技能之一。我可以用另一种语言更容易地做到这一点,但我被要求将其构建为存储过程。 = P

更新:根据评论中发布的建议,我已经完成了很多工作。我只需要最终输出的帮助。我认为可以使用带动态sql的数据透视表来完成,但是如何透视和聚合这三个与技能相关的列并按照指定的方式对它们进行编号就是逃避我。

-- this pivots the skills table into a single row for each skill
select *
into #skillPiv
from 
(
  select id, skill, assignment,
    'assignment_'+cast(row_number() over(partition by id, skill order by skill) as varchar(10)) rn
  from skills
) d
pivot
(
  max(assignment)
  for rn in ([assignment_1], [assignment_2])
) piv
order by id;


-- this converts the student's oldSkills from CSV into rows and looks up the corresponding skill assignments in the #skills table
with st(id, skill, oldSkill) as (
select id, LEFT(CAST(oldSkill as varchar(max)), CHARINDEX(',',oldSkill+',')-1),
    STUFF(CAST(oldSkill as varchar(max)), 1, CHARINDEX(',',oldSkill+','), '')
from students
union all
select id, LEFT(CAST(oldSkill as varchar(max)), CHARINDEX(',',oldSkill+',')-1),
    STUFF(CAST(oldSkill as varchar(max)), 1, CHARINDEX(',',oldSkill+','), '')
from st
where oldSkill > ''
)
select st.id
    ,st.skill
    ,CASE WHEN sp.assignment_1 = 'P' OR sp.assignment_2 = 'P'
        THEN 'Y'
        ELSE ''
        END AS [primary]
    ,CASE WHEN sp.assignment_1 = 'B' OR sp.assignment_2 = 'B'
        THEN 'Y'
        ELSE ''
        END AS [backup]
into #oldSkills
from st
inner join #skillPiv sp on st.id = sp.id and st.skill = sp.skill
order by id;


-- convert the newSkills column from CSV to rows and insert our default skill assignment values
with tmp(id, skill, newSkill) as (
select id, LEFT(CAST(newSkill as varchar(max)), CHARINDEX(',',newSkill+',')-1),
    STUFF(CAST(newSkill as varchar(max)), 1, CHARINDEX(',',newSkill+','), '')
from students
union all
select id, LEFT(CAST(newSkill as varchar(max)), CHARINDEX(',',newSkill+',')-1),
    STUFF(CAST(newSkill as varchar(max)), 1, CHARINDEX(',',newSkill+','), '')
from tmp
where newSkill > ''
)
select id
    ,skill
    ,'Y' as [primary]
    ,'' as [backup]
into #newSkills
from tmp
where skill NOT IN (
    select skill from #oldSkills where id = tmp.id
    )
order by id;


-- now combine #oldSkills and #newSkills into one table that has all the values we need
select *
into #studentSkills
from (
    select * from #newSkills
    UNION
    select * from #oldSkills
) as ss;

select * from #studentSkills;

Example on RexTester

我在使用临时表来处理SQL Fiddle时遇到了问题,所以我将测试代码移到了RexTester。

在我的实际代码中,我使用DelimitedSplit8K来解析Students表中的CSV值。

上面的代码生成了这个最终表:

| id |      skill | primary | backup |
|----|------------|---------|--------|
|  1 |      Excel |       Y | (null) |
|  1 | PowerPoint |       Y | (null) |
|  1 |       Word |       Y |      Y |
|  2 |      Excel |       Y | (null) |
|  2 |    Outlook |       Y | (null) |
|  2 | PowerPoint |       Y |      Y |
|  2 |       Word |       Y | (null) |
|  3 |      Excel |       Y | (null) |
|  3 | PowerPoint |       Y | (null) |
|  3 |       Word |       Y | (null) |
|  4 |     Access |       Y |      Y |
|  4 |      Excel |       Y |      Y |
|  4 |    Outlook |       Y | (null) |
|  4 | PowerPoint |       Y | (null) |
|  4 |       Word |       Y | (null) |
|  5 |      Excel |       Y | (null) |
|  5 |    Outlook |       Y | (null) |
|  5 | PowerPoint |       Y | (null) |
|  5 |       Word |  (null) |      Y |

现在我只需要将它转动为所需的输出:

| id | skill_1 | skill_1_primary | skill_1_backup |    skill_2 | skill_2_primary | skill_2_backup |    skill_3 | skill_3_primary | skill_3_backup |    skill_4 | skill_4_primary | skill_4_backup | skill_5 | skill_5_primary | skill_5_backup |
|----|---------|-----------------|----------------|------------|-----------------|----------------|------------|-----------------|----------------|------------|-----------------|----------------|---------|-----------------|----------------|
|  1 |   Excel |               Y |         (null) | PowerPoint |               Y |         (null) |       Word |               Y |              Y |     (null) |          (null) |         (null) |  (null) |          (null) |         (null) |
|  2 |   Excel |               Y |         (null) |    Outlook |               Y |         (null) | PowerPoint |               Y |              Y |       Word |               Y |         (null) |  (null) |          (null) |         (null) |
|  3 |   Excel |               Y |         (null) | PowerPoint |               Y |         (null) |       Word |               Y |         (null) |     (null) |          (null) |         (null) |  (null) |          (null) |         (null) |
|  4 |  Access |               Y |              Y |      Excel |               Y |              Y |    Outlook |               Y |         (null) | PowerPoint |               Y |         (null) |    Word |               Y |         (null) |
|  5 |   Excel |               Y |         (null) |    Outlook |               Y |         (null) | PowerPoint |               Y |         (null) |       Word |          (null) |              Y |  (null) |          (null) |         (null) |

我感谢任何帮助。谢谢!

答案

这个设计真的非常非常糟糕:-D

不过,如果你必须坚持下去,你可以试试这个:

注意:我依赖你的陈述

请注意,newSkill列包含oldSkill值

我认为“没有旧技能,不包括在新技能中!”

该解决方案完全内联并基于集合:

DECLARE @students TABLE(id INT,oldSkill VARCHAR(100),newSkill VARCHAR(100));
INSERT INTO @students VALUES
 (1,'Word','Excel,PowerPoint,Word')
,(2,'Excel,PowerPoint,Word','Excel,Outlook,PowerPoint,Word')
,(3,'PowerPoint,Word','Excel,PowerPoint,Word')
,(4,'Access,Excel','Access,Excel,Outlook,PowerPoint,Word')
,(5,'Outlook,Word','Excel,Outlook,PowerPoint,Word');

DECLARE @skills TABLE(id INT, skill VARCHAR(100),assignment VARCHAR(1));
INSERT INTO @skills VALUES
 (1,'Word','B')
,(1,'Word','P')
,(2,'Excel','P')
,(2,'PowerPoint','B')
,(2,'PowerPoint','P')
,(2,'Word','P')
,(3,'PowerPoint','P')
,(3,'Word','P')
,(4,'Access','B')
,(4,'Excel','B')
,(4,'Access','P')
,(4,'Excel','P')
,(5,'Outlook','P')
,(5,'Word','B');

- 第一个CTE将使用XML技巧来分割逗号分隔值

WITH Step1 AS
(
    SELECT id
          ,A.*     
    FROM @students AS s
    OUTER APPLY(
                 SELECT CAST('<x>' + REPLACE(s.oldSkill,',','</x><x>') + '</x>' AS XML) AS OldSkillXml
                       ,CAST('<x>' + REPLACE(s.newSkill,',','</x><x>') + '</x>' AS XML) AS NewSkillXml
                ) AS A
)

- 第二个CTE获得了旧技能列表以及旗帜

,OldSkills AS
(
    SELECT ROW_NUMBER() OVER(PARTITION BY Step1.id ORDER BY (SELECT NULL)) AS OldSkillOrder
          ,Step1.id
          ,os.value('text()[1]','varchar(100)') AS Skill
          ,CASE WHEN (SELECT assignment FROM @skills AS s WHERE s.id=Step1.id AND s.skill=os.value('text()[1]','varchar(100)') AND s.assignment='P') IS NOT NULL THEN 'Y' END AS IsPrimary
          ,CASE WHEN (SELECT assignment FROM @skills AS s WHERE s.id=Step1.id AND s.skill=os.value('text()[1]','varchar(100)') AND s.assignment='B') IS NOT NULL THEN 'Y' END AS IsBackup
    FROM Step1 
    OUTER APPLY Step1.OldSkillXml.nodes('x') AS A(os)
)

- 这个CTE获得了新技能列表,全部标有“IsPrimary ='Y'”

,NewSkills AS
(
    SELECT ROW_NUMBER() OVER(PARTITION BY Step1.id ORDER BY (SELECT NULL)) AS NewSkillOrder
          ,Step1.id
          ,ns.value('text()[1]','varchar(100)') AS Skill
          ,'Y' AS IsPrimary
          ,NULL AS IsBackup
    FROM Step1 
    OUTER APPLY Step1.NewSkillXml.nodes('x') AS A(ns)
)

- 中间列表是您在枢轴之前的结果

,IntermediateList AS
(
    SELECT ns.id
          ,ns.Skill
          ,ns.IsPrimary
          ,os.IsBackup
          ,ns.NewSkillOrder
    FROM NewSkills AS ns
    FULL OUTER JOIN OldSkills AS os ON os.id=ns.id AND os.Skill=ns.Skill 
)

- 在这里我使用“条件聚合”(老式的枢轴),这是一个伟大的做一个多个列的PIVOT

SELECT id

      ,MAX(CASE WHEN NewSkillOrder = 1 THEN Skill END) AS skill_1
      ,MAX(CASE WHEN NewSkillOrder = 1 THEN IsPrimary END) AS skill_1_primary
      ,MAX(CASE WHEN NewSkillOrder = 1 THEN IsBackup END) AS skill_1_backup

      ,MAX(CASE WHEN NewSkillOrder = 2 THEN Skill END) AS skill_2
      ,MAX(CASE WHEN NewSkillOrder = 2 THEN IsPrimary END) AS skill_2_primary
      ,MAX(CASE WHEN NewSkillOrder = 2 THEN IsBackup END) AS skill_2_backup

      ,MAX(CASE WHEN NewSkillOrder = 3 THEN Skill END) AS skill_3
      ,MAX(CASE WHEN NewSkillOrder = 3 THEN IsPrimary END) AS skill_3_primary
      ,MAX(CASE WHEN NewSkillOrder = 3 THEN IsBackup END) AS skill_3_backup

      ,MAX(CASE WHEN NewSkillOrder = 4 THEN Skill END) AS skill_4
      ,MAX(CASE WHEN NewSkillOrder = 4 THEN IsPrimary END) AS skill_4_primary
      ,MAX(CASE WHEN NewSkillOrder = 4 THEN IsBackup END) AS skill_4_backup

      ,MAX(CASE WHEN NewSkillOrder = 5 THEN Skill END) AS skill_5
      ,MAX(CASE WHEN NewSkillOrder = 5 THEN IsPrimary END) AS skill_5_primary
      ,MAX(CASE WHEN NewSkillOrder = 5 THEN IsBackup END) AS skill_5_backup
FROM IntermediateList AS il
GROUP BY id; 

结果

+----+---------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+---------+-----------------+----------------+
| id | skill_1 | skill_1_primary | skill_1_backup | skill_2    | skill_2_primary | skill_2_backup | skill_3    | skill_3_primary | skill_3_backup | skill_4    | skill_4_primary | skill_4_backup | skill_5 | skill_5_primary | skill_5_backup |
+----+---------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+---------+-----------------+----------------+
| 1  | Excel   | Y               | NULL           | PowerPoint | Y               | NULL           | Word       | Y               | Y              | NULL       | NULL            | NULL           | NULL    | NULL            | NULL           |
+----+---------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+---------+-----------------+----------------+
| 2  | Excel   | Y               | NULL           | Outlook    | Y               | NULL           | PowerPoint | Y               | Y              | Word       | Y               | NULL           | NULL    | NULL            | NULL           |
+----+---------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+---------+-----------------+----------------+
| 3  | Excel   | Y               | NULL           | PowerPoint | Y               | NULL           | Word       | Y               | NULL           | NULL       | NULL            | NULL           | NULL    | NULL            | NULL           |
+----+---------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+---------+-----------------+----------------+
| 4  | Access  | Y               | Y              | Excel      | Y               | Y              | Outlook    | Y               | NULL           | PowerPoint | Y               | NULL           | Word    | Y               | NULL           |
+----+---------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+---------+-----------------+----------------+
| 5  | Excel   | Y               | NULL           | Outlook    | Y               | NULL           | PowerPoint | Y               | NULL           | Word       | Y               | Y              | NULL    | NULL            | NULL           |
+----+---------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+------------+-----------------+----------------+---------+-----------------+----------------+

注意 有一个不同之处:你的学生5已经获得了NULL / Y和技能“Word”,我不明白,为什么这个技能,因为它包含在“新技能”中不应该是“主要”。

以上是关于CSV到列,与基于行的数据连接,分析和输出 - 是否可以有效地完成?的主要内容,如果未能解决你的问题,请参考以下文章

将 csv 数据转换为逗号分隔的列表

在 postgres 上将表导出为 CSV,而无需使用“文本到列”

我有一个大型 CSV 文件,其中包含单个列中的信息。如何使用 python 在 excel 中复制“文本到列”任务? [复制]

如何将Python Dask Dataframes合并到列中?

基于日期时间的 Python CSV 数据分析

基于月份的 Oracle SQL 数据迁移行到列因类型而失败