spark怎么实现将mysql表中按照字段的优先级关联起来

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark怎么实现将mysql表中按照字段的优先级关联起来相关的知识,希望对你有一定的参考价值。

参考技术A spark将mysql表中按照字段的优先级关联,可以尝试下面的操作。
(sc)sqlContext:org.apache.spark.sql.SQLContext=org.apache.spark.sql.SQLContext@6cd1ee scala> val url="jdbc:mysql://slave02:3306/testdb?use..
在操作数据库时会将两个或多个数据表关联起来通过一些条件筛选数据,在关联表时我们要遵循原则,这样会在效率上快很多。

MySQL单表查询

MySQL单表查询

一、单表查询的语法及关键字执行 的优先级

1.1单表查询语法

SELECT DISTINCT 字段1,字段2... FROM 表名
                              WHERE 条件
                              GROUP BY field
                              HAVING 筛选
                              ORDER BY field
                              LIMIT 限制条数

1.2关键字执行的优先级

  1. from:找到表
  2. where:拿着where指定的约束条件,去文件/表中取出一条条记录
  3. group by:将取出的一条条记录进行分组group by,如果没有group by,则整体作为一组
  4. select:执行select
  5. distinct:去重
  6. having:将分组的结果进行having过滤
  7. order by:将结果按条件排序:order by
  8. limit:限制结果的显示条数

二、简单查询

2.1建表和数据准备

company.employee
    员工id      id                  int             
    姓名        emp_name            varchar
    性别        sex                 enum
    年龄        age                 int
    入职日期     hire_date           date
    岗位        post                varchar
    职位描述     post_comment        varchar
    薪水        salary              double
    办公室       office              int
    部门编号     depart_id           int



# 创建表
create table employee(
id int not null unique auto_increment,
emp_name varchar(20) not null,
sex enum('male','female') not null default 'male',  #  大部分是男的
age int(3) unsigned not null default 28,
hire_date date not null,
post varchar(50),
post_comment varchar(100),
salary double(15,2),
office int,  #  一个部门一个屋子
depart_id int
);


# 查看表结构
mysql> desc employee;
+--------------+-----------------------+------+-----+---------+----------------+
| Field        | Type                  | Null | Key | Default | Extra          |
+--------------+-----------------------+------+-----+---------+----------------+
| id           | int(11)               | NO   | PRI | NULL    | auto_increment |
| emp_name     | varchar(20)           | NO   |     | NULL    |                |
| sex          | enum('male','female') | NO   |     | male    |                |
| age          | int(3) unsigned       | NO   |     | 28      |                |
| hire_date    | date                  | NO   |     | NULL    |                |
| post         | varchar(50)           | YES  |     | NULL    |                |
| post_comment | varchar(100)          | YES  |     | NULL    |                |
| salary       | double(15,2)          | YES  |     | NULL    |                |
| office       | int(11)               | YES  |     | NULL    |                |
| depart_id    | int(11)               | YES  |     | NULL    |                |
+--------------+-----------------------+------+-----+---------+----------------+

# 插入记录
# 三个部门:教学,销售,运营
insert into employee(emp_name,sex,age,hire_date,post,salary,office,depart_id) values
('nick','male',18,'20170301','老男孩驻上海虹桥最帅',7300.33,401,1),  #  以下是教学部
('jason','male',78,'20150302','teacher',1000000.31,401,1),
('sean','male',81,'20130305','teacher',8300,401,1),
('tank','male',73,'20140701','teacher',3500,401,1),
('oscar','male',28,'20121101','teacher',2100,401,1),
('mac','female',18,'20110211','teacher',9000,401,1),
('rocky','male',18,'19000301','teacher',30000,401,1),
('成龙','male',48,'20101111','teacher',10000,401,1),

('歪歪','female',48,'20150311','sale',3000.13,402,2),   #  以下是销售部门
('丫丫','female',38,'20101101','sale',2000.35,402,2),
('丁丁','female',18,'20110312','sale',1000.37,402,2),
('星星','female',18,'20160513','sale',3000.29,402,2),
('格格','female',28,'20170127','sale',4000.33,402,2),

('张野','male',28,'20160311','operation',10000.13,403,3),  #  以下是运营部门
('程咬金','male',18,'19970312','operation',20000,403,3),
('程咬银','female',18,'20130311','operation',19000,403,3),
('程咬铜','male',18,'20150411','operation',18000,403,3),
('程咬铁','female',18,'20140512','operation',17000,403,3)
;

#  ps:如果在windows系统中,插入中文字符,select的结果为空白,可以将所有字符编码统一设置成gbk
# 简单查询
    SELECT id,emp_name,sex,age,hire_date,post,post_comment,salary,office,depart_id 
    FROM employee;

    SELECT * FROM employee;

    SELECT emp_name,salary FROM employee;

# 避免重复DISTINCT
    SELECT DISTINCT post FROM employee;    

# 通过四则运算查询
    SELECT emp_name, salary*12 FROM employee;
    SELECT emp_name, salary*12 AS Annual_salary FROM employee;
    SELECT emp_name, salary*12 Annual_salary FROM employee;

# 定义显示格式
   CONCAT() 函数用于连接字符串
   SELECT CONCAT('姓名: ',emp_name,'  年薪: ', salary*12)  AS Annual_salary 
   FROM employee;
   
   CONCAT_WS() 第一个参数为分隔符
   SELECT CONCAT_WS(':',emp_name,salary*12)  AS Annual_salary 
   FROM employee;

   结合CASE语句:
   SELECT
       (
           CASE
           WHEN emp_name = 'mac' THEN
               emp_name
           WHEN emp_name = 'jason' THEN
               CONCAT(emp_name,'_BIGSB')
           ELSE
               concat(emp_name, 'SB')
           END
       ) as new_name
   FROM
       employee;

三、约束条件

where子句中可以使用:

  1. 比较运算符:> < >= <= <> !=
  2. between 80 and 100 值在80到100之间
  3. in(80,90,100) 值是80或90或100
  4. like ‘n%‘
    • 通配符可以是%或_,
      • %表示任意多字符
      • _表示一个字符
  5. 逻辑运算符:在多个条件直接可以使用逻辑运算符 and or not
1. 单条件查询
    SELECT emp_name FROM employee
        WHERE post='sale';
        
2. 多条件查询
    SELECT emp_name,salary FROM employee
        WHERE post='teacher' AND salary>10000;

3. 关键字BETWEEN AND
    SELECT emp_name,salary FROM employee 
        WHERE salary BETWEEN 10000 AND 20000;

    SELECT emp_name,salary FROM employee 
        WHERE salary NOT BETWEEN 10000 AND 20000;
    
4. 关键字IS NULL(判断某个字段是否为NULL不能用等号,需要用IS)
    SELECT emp_name,post_comment FROM employee 
        WHERE post_comment IS NULL;

    SELECT emp_name,post_comment FROM employee 
        WHERE post_comment IS NOT NULL;
        
    SELECT emp_name,post_comment FROM employee 
        WHERE post_comment=''; 注意''是空字符串,不是null
    ps:
        执行
        update employee set post_comment='' where id=2;
        再用上条查看,就会有结果了

5. 关键字IN集合查询
    SELECT emp_name,salary FROM employee 
        WHERE salary=3000 OR salary=3500 OR salary=4000 OR salary=9000 ;
    
    SELECT emp_name,salary FROM employee 
        WHERE salary IN (3000,3500,4000,9000) ;

    SELECT emp_name,salary FROM employee 
        WHERE salary NOT IN (3000,3500,4000,9000) ;

6. 关键字LIKE模糊查询
    通配符’%’
    SELECT * FROM employee 
            WHERE emp_name LIKE 'ni%';

    通配符’_’
    SELECT * FROM employee 
            WHERE emp_name LIKE 'ja__';

四、分组(group by)

单独使用GROUP BY关键字分组
    SELECT post FROM employee GROUP BY post;
    注意:我们按照post字段分组,那么select查询的字段只能是post,想要获取组内的其他相关信息,需要借助函数

GROUP BY关键字和GROUP_CONCAT()函数一起使用
    SELECT post,GROUP_CONCAT(emp_name) FROM employee GROUP BY post;  # 按照岗位分组,并查看组内成员名
    SELECT post,GROUP_CONCAT(emp_name) as emp_members FROM employee GROUP BY post;

GROUP BY与聚合函数一起使用
    select post,count(id) as count from employee group by post;  #  按照岗位分组,并查看每个组有多少人

注意:如果我们用unique的字段作为分组的依据,则每一条记录自成一组,这种分组没有意义;多条记录之间的某个字段值相同,该字段通常用来作为分组的依据。

五、聚合函数

强调:聚合函数聚合的是组的内容,若是没有分组,则默认一组。

示例:

SELECT COUNT(*) FROM employee;
    SELECT COUNT(*) FROM employee WHERE depart_id=1;
    SELECT MAX(salary) FROM employee;
    SELECT MIN(salary) FROM employee;
    SELECT AVG(salary) FROM employee;
    SELECT SUM(salary) FROM employee;
    SELECT SUM(salary) FROM employee WHERE depart_id=3;

六、过滤(having)

6.1where和having的区别

执行优先级从高到低:where > group by > having

  1. Where 发生在分组group by之前,因而Where中可以有任意字段,但是绝对不能使用聚合函数。
  2. Having发生在分组group by之后,因而Having中可以使用分组的字段,无法直接取到其他字段,可以使用聚合函数

6.1.1验证

mysql> select @@sql_mode;
+--------------------+
| @@sql_mode         |
+--------------------+
| ONLY_FULL_GROUP_BY |
+--------------------+
row in set (0.00 sec)

mysql> select * from emp where salary > 100000;
+----+------+------+-----+------------+---------+--------------+------------+--------+-----------+
| id | emp_name | sex  | age | hire_date  | post    | post_comment | salary     | office | depart_id |
+----+------+------+-----+------------+---------+--------------+------------+--------+-----------+
|  2 | jason | male |  78 | 2015-03-02 | teacher | NULL         | 1000000.31 |    401 |         1 |
+----+------+------+-----+------------+---------+--------------+------------+--------+-----------+
row in set (0.00 sec)

mysql> select post,group_concat(emp_name) from emp group by post having salary > 10000;#错误,分组后无法直接取到salary字段
ERROR 1054 (42S22): Unknown column 'salary' in 'having clause'
mysql> select post,group_concat(emp_name) from emp group by post having avg(salary) > 10000;
+-----------+-------------------------------------------------------+
| post | group_concat(emp_name) |
+-----------+-------------------------------------------------------+
| operation | 程咬铁,程咬铜,程咬银,程咬金,张野 |
| teacher | 成龙,rocky,mac,oscar,tank,sean,jason |
+-----------+-------------------------------------------------------+
rows in set (0.00 sec)

七、查询排序(order by)

按单列排序
    SELECT * FROM employee ORDER BY salary;
    SELECT * FROM employee ORDER BY salary ASC;  # 降序
    SELECT * FROM employee ORDER BY salary DESC; # 升序

按多列排序:先按照age排序,如果年纪相同,则按照薪资排序
    SELECT * from employee
        ORDER BY age,
        salary DESC;

八、限制查询的记录数(limit)

示例:

SELECT * FROM employee ORDER BY salary DESC 
    LIMIT 3;                    #默认初始位置为0 

SELECT * FROM employee ORDER BY salary DESC
    LIMIT 0,5; #从第0开始,即先出第一条,然后包含这一条在内往后查5条

SELECT * FROM employee ORDER BY salary DESC
    LIMIT 5,5; #从第5开始,即先出第6条,然后包含这一条在内往后查5条

九、使用正则表达式查询

SELECT * FROM employee WHERE emp_name REGEXP '^jas';

SELECT * FROM employee WHERE emp_name REGEXP 'on$';

SELECT * FROM employee WHERE emp_name REGEXP 'm2';


小结:对字符串匹配的方式
WHERE emp_name = 'nick';
WHERE emp_name LIKE 'sea%';
WHERE emp_name REGEXP 'on$';

以上是关于spark怎么实现将mysql表中按照字段的优先级关联起来的主要内容,如果未能解决你的问题,请参考以下文章

mySQL怎么批量替换查询结果中的字段值?

mysql四-1:单表查询

MySQL数据库数据怎么实现排序输出?

mysql中有2个结构一样的表,我想把两个表的交集存到另一个表中,请问怎么操作呢?

MYSQL怎么将表中的A字段值更新B字段值?求sql语句

怎么删除mysql数据库中某表中的某个字段的数据?