在 R 中为特定条件创建 sql 表达式

Posted 2023-05-07

技术标签:

【中文标题】在 R 中为特定条件创建 sql 表达式【英文标题】：create sql expression in R for certain condition 【发布时间】：2018-05-10 14:44:24 【问题描述】：

我从sql server获取数据进行回归分析，然后将回归结果返回到另一个sql表中。

library("RODBC")
library(sqldf)

dbHandle <- odbcDriverConnect("driver=SQL Server;server=MYSERVER;database=MYBASE;trusted_connection=true")

sql <- 
  "select
Dt
,CustomerName
,ItemRelation
,SaleCount
,DocumentNum
,DocumentYear
,IsPromo

from dbo.mytable"

df <- sqlQuery(dbHandle, sql)

After this query i must perform regression analysis separately for groups

my_lm <- function(df) 
  lm(SaleCount~IsPromo, data = df)


reg=df %>% 
  group_by(CustomerName,ItemRelation,DocumentNum,DocumentYear) %>% 
  nest() %>%
  mutate(fit = map(data, my_lm),
         tidy = map(fit, tidy)) %>%
  select(-fit, - data) %>%
  unnest()
View(reg)

#save to sql table
sqlSave(dbHandle, as.data.frame(reg), "dbo.mytableforecast", verbose = TRUE)  # use "append = TRUE" to add rows to an existing table

odbcClose(dbHandle)

问题：

脚本自动运行，即在调度程序中有任务在特定时间启动脚本。例如，今天加载了 100 个观测值。

From 01.01.2017-10.04.2017

脚本执行回归并将数据返回到 sql 表。明天将加载新的 100 个观测值。

11.04.2017-20.07.2017

I.E.明天何时加载数据，脚本将在晚上 10 点开始，it must work only with data from 11.04.2017-20.07.2017, and not from 01.01.2017-20.07.2017

由于回归后 Dt 列被删除，情况变得复杂，所以这里给我的解决方案不起作用 Automatic transfer data from the sql to R 因为 Dt 不存在。

如何设置日程select Dt ,CustomerName ,ItemRelation ,SaleCount ,DocumentNum ,DocumentYear ,IsPromo from dbo.mytable "where Dt>the last date when the script was launched"的条件

是否可以创建这个表达式？

来自 sql 的数据示例

df=structure(list(Dt = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 12L, 13L, 13L, 14L, 14L, 
15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 
18L, 19L), .Label = c("2017-10-12 00:00:00.000", "2017-10-13 00:00:00.000", 
"2017-10-14 00:00:00.000", "2017-10-15 00:00:00.000", "2017-10-16 00:00:00.000", 
"2017-10-17 00:00:00.000", "2017-10-18 00:00:00.000", "2017-10-19 00:00:00.000", 
"2017-10-20 00:00:00.000", "2017-10-21 00:00:00.000", "2017-10-22 00:00:00.000", 
"2017-10-23 00:00:00.000", "2017-10-24 00:00:00.000", "2017-10-25 00:00:00.000", 
"2017-10-26 00:00:00.000", "2017-10-27 00:00:00.000", "2017-10-28 00:00:00.000", 
"2017-10-29 00:00:00.000", "2017-10-30 00:00:00.000"), class = "factor"), 
    CustomerName = structure(c(1L, 11L, 12L, 13L, 14L, 15L, 16L, 
    17L, 18L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 11L, 12L, 
    13L, 14L, 15L, 16L, 17L, 18L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
    9L, 10L), .Label = c("x1", "x10", "x11", "x12", "x13", "x14", 
    "x15", "x16", "x17", "x18", "x2", "x3", "x4", "x5", "x6", 
    "x7", "x8", "x9"), class = "factor"), ItemRelation = c(13322L, 
    13322L, 13322L, 13322L, 13322L, 13322L, 13322L, 11706L, 13322L, 
    11706L, 13322L, 11706L, 13322L, 11706L, 13322L, 11706L, 13322L, 
    11706L, 13322L, 11706L, 13322L, 11706L, 13322L, 11706L, 13163L, 
    13322L, 158010L, 11706L, 13163L, 13322L, 158010L, 11706L, 
    13163L, 13322L, 158010L, 11706L), SaleCount = c(10L, 3L, 
    1L, 0L, 9L, 5L, 5L, 11L, 7L, 0L, 5L, 11L, 1L, 0L, 0L, 19L, 
    10L, 0L, 1L, 12L, 1L, 11L, 6L, 0L, 167L, 7L, 0L, 16L, 165L, 
    1L, 0L, 0L, 29L, 0L, 0L, 11L), DocumentNum = c(36L, 36L, 
    36L, 36L, 36L, 36L, 36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L, 
    36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L, 131L, 36L, 
    89L, 51L, 131L, 36L, 89L, 51L, 131L, 36L, 89L, 51L), DocumentYear = c(2017L, 
    2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
    2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
    2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
    2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L), 
    IsPromo = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Dt", "CustomerName", 
"ItemRelation", "SaleCount", "DocumentNum", "DocumentYear", "IsPromo"
), class = "data.frame", row.names = c(NA, -36L))

【问题讨论】：

【参考方案1】：

考虑将最大 DT（在删除字段的回归之前检索）保存在计划脚本末尾的日志文件中，然后在脚本开头添加日志读取以用于最后记录包含在WHERE 子句中的日期：

# READ DATE FROM LOG FILE
log_dt <- readLines("/path/to/SQL_MaxDate.txt", warn=FALSE)

# QUERY WITH WHERE CLAUSE
sql <- paste0("SELECT Dt, CustomerName, ItemRelation, SaleCount, 
                      DocumentNum, DocumentYear, IsPromo
               FROM dbo.mytable WHERE Dt > '", log_dt, "'")

df <- sqlQuery(dbHandle, sql)

# RETRIEVE MAX DATE VALUE
max_DT <- as.character(max(df$Dt))

# ... regression

# WRITE DATE TO LOG FILE
cat(max_DT, file="/path/to/SQL_MaxDate.txt")

更好的是，使用带有 RODBCext 的参数化来避免字符串连接和引用：

library(RODBC)
library(RODBCext)

# READ DATE FROM LOG FILE
log_dt <- readLines("/path/to/SQL_MaxDate.txt", warn=FALSE)

dbHandle <- odbcDriverConnect(...)

# PREPARED STATEMENT WITH PLACEHOLDER
sql <- "SELECT Dt, CustomerName, ItemRelation, SaleCount, 
               DocumentNum, DocumentYear, IsPromo
        FROM dbo.mytable WHERE Dt > ?")

# EXECUTE QUERY BINDING PARAM VALUE
df <- sqlExecute(dbHandle, sql, log_dt, fetch=TRUE)

# RETRIEVE MAX DATE VALUE
max_DT <- as.character(max(df$Dt))

# ... regression

# WRITE DATE TO LOG FILE
cat(max_DT, file="/path/to/SQL_MaxDate.txt")

【讨论】：

以上是关于在 R 中为特定条件创建 sql 表达式的主要内容，如果未能解决你的问题，请参考以下文章