在 R 中为特定条件创建 sql 表达式
Posted
技术标签:
【中文标题】在 R 中为特定条件创建 sql 表达式【英文标题】:create sql expression in R for certain condition 【发布时间】:2018-05-10 14:44:24 【问题描述】:我从sql server获取数据进行回归分析,然后将回归结果返回到另一个sql表中。
library("RODBC")
library(sqldf)
dbHandle <- odbcDriverConnect("driver=SQL Server;server=MYSERVER;database=MYBASE;trusted_connection=true")
sql <-
"select
Dt
,CustomerName
,ItemRelation
,SaleCount
,DocumentNum
,DocumentYear
,IsPromo
from dbo.mytable"
df <- sqlQuery(dbHandle, sql)
After this query i must perform regression analysis separately for groups
my_lm <- function(df)
lm(SaleCount~IsPromo, data = df)
reg=df %>%
group_by(CustomerName,ItemRelation,DocumentNum,DocumentYear) %>%
nest() %>%
mutate(fit = map(data, my_lm),
tidy = map(fit, tidy)) %>%
select(-fit, - data) %>%
unnest()
View(reg)
#save to sql table
sqlSave(dbHandle, as.data.frame(reg), "dbo.mytableforecast", verbose = TRUE) # use "append = TRUE" to add rows to an existing table
odbcClose(dbHandle)
问题:
脚本自动运行,即在调度程序中有任务在特定时间启动脚本。 例如,今天加载了 100 个观测值。
From 01.01.2017-10.04.2017
脚本执行回归并将数据返回到 sql 表。 明天将加载新的 100 个观测值。
11.04.2017-20.07.2017
I.E.明天何时加载数据,脚本将在晚上 10 点开始,it must work only with data from 11.04.2017-20.07.2017, and not from 01.01.2017-20.07.2017
由于回归后 Dt 列被删除,情况变得复杂,所以这里给我的解决方案不起作用 Automatic transfer data from the sql to R 因为 Dt 不存在。
如何设置日程select Dt ,CustomerName ,ItemRelation ,SaleCount ,DocumentNum ,DocumentYear ,IsPromo from dbo.mytable "where Dt>the last date when the script was launched"
的条件
是否可以创建这个表达式?
来自 sql 的数据示例
df=structure(list(Dt = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 12L, 13L, 13L, 14L, 14L,
15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 18L, 18L, 18L,
18L, 19L), .Label = c("2017-10-12 00:00:00.000", "2017-10-13 00:00:00.000",
"2017-10-14 00:00:00.000", "2017-10-15 00:00:00.000", "2017-10-16 00:00:00.000",
"2017-10-17 00:00:00.000", "2017-10-18 00:00:00.000", "2017-10-19 00:00:00.000",
"2017-10-20 00:00:00.000", "2017-10-21 00:00:00.000", "2017-10-22 00:00:00.000",
"2017-10-23 00:00:00.000", "2017-10-24 00:00:00.000", "2017-10-25 00:00:00.000",
"2017-10-26 00:00:00.000", "2017-10-27 00:00:00.000", "2017-10-28 00:00:00.000",
"2017-10-29 00:00:00.000", "2017-10-30 00:00:00.000"), class = "factor"),
CustomerName = structure(c(1L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L), .Label = c("x1", "x10", "x11", "x12", "x13", "x14",
"x15", "x16", "x17", "x18", "x2", "x3", "x4", "x5", "x6",
"x7", "x8", "x9"), class = "factor"), ItemRelation = c(13322L,
13322L, 13322L, 13322L, 13322L, 13322L, 13322L, 11706L, 13322L,
11706L, 13322L, 11706L, 13322L, 11706L, 13322L, 11706L, 13322L,
11706L, 13322L, 11706L, 13322L, 11706L, 13322L, 11706L, 13163L,
13322L, 158010L, 11706L, 13163L, 13322L, 158010L, 11706L,
13163L, 13322L, 158010L, 11706L), SaleCount = c(10L, 3L,
1L, 0L, 9L, 5L, 5L, 11L, 7L, 0L, 5L, 11L, 1L, 0L, 0L, 19L,
10L, 0L, 1L, 12L, 1L, 11L, 6L, 0L, 167L, 7L, 0L, 16L, 165L,
1L, 0L, 0L, 29L, 0L, 0L, 11L), DocumentNum = c(36L, 36L,
36L, 36L, 36L, 36L, 36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L,
36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L, 131L, 36L,
89L, 51L, 131L, 36L, 89L, 51L, 131L, 36L, 89L, 51L), DocumentYear = c(2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L),
IsPromo = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Dt", "CustomerName",
"ItemRelation", "SaleCount", "DocumentNum", "DocumentYear", "IsPromo"
), class = "data.frame", row.names = c(NA, -36L))
【问题讨论】:
【参考方案1】:考虑将最大 DT(在删除字段的回归之前检索)保存在计划脚本末尾的日志文件中,然后在脚本开头添加日志读取以用于最后记录包含在WHERE
子句中的日期:
# READ DATE FROM LOG FILE
log_dt <- readLines("/path/to/SQL_MaxDate.txt", warn=FALSE)
# QUERY WITH WHERE CLAUSE
sql <- paste0("SELECT Dt, CustomerName, ItemRelation, SaleCount,
DocumentNum, DocumentYear, IsPromo
FROM dbo.mytable WHERE Dt > '", log_dt, "'")
df <- sqlQuery(dbHandle, sql)
# RETRIEVE MAX DATE VALUE
max_DT <- as.character(max(df$Dt))
# ... regression
# WRITE DATE TO LOG FILE
cat(max_DT, file="/path/to/SQL_MaxDate.txt")
更好的是,使用带有 RODBCext 的参数化来避免字符串连接和引用:
library(RODBC)
library(RODBCext)
# READ DATE FROM LOG FILE
log_dt <- readLines("/path/to/SQL_MaxDate.txt", warn=FALSE)
dbHandle <- odbcDriverConnect(...)
# PREPARED STATEMENT WITH PLACEHOLDER
sql <- "SELECT Dt, CustomerName, ItemRelation, SaleCount,
DocumentNum, DocumentYear, IsPromo
FROM dbo.mytable WHERE Dt > ?")
# EXECUTE QUERY BINDING PARAM VALUE
df <- sqlExecute(dbHandle, sql, log_dt, fetch=TRUE)
# RETRIEVE MAX DATE VALUE
max_DT <- as.character(max(df$Dt))
# ... regression
# WRITE DATE TO LOG FILE
cat(max_DT, file="/path/to/SQL_MaxDate.txt")
【讨论】:
以上是关于在 R 中为特定条件创建 sql 表达式的主要内容,如果未能解决你的问题,请参考以下文章
有没有办法在文本字段表达式中访问SQL查询结果,以有条件地在标题带中显示它?