根据父文件头的大小将大型 CSV 文件拆分为多个文件
Posted
技术标签:
【中文标题】根据父文件头的大小将大型 CSV 文件拆分为多个文件【英文标题】:Split large CSV file into multiple files based on the size with the parent file header 【发布时间】:2019-06-04 18:39:38 【问题描述】:我有一个场景,我必须将一个大的 csv 文件拆分为多个 csv 文件,每个文件的大小应为 100 MB 或更小,并带有标题。
我在我的 ssis 包中尝试了下面的 VB.net 代码,但我没有得到子文件的标题行。
请帮忙..
Public Sub Main()
Dim FileSize As Integer = 100000 'Specify In KB. Can Be Modified.
Dim MasterPath As String = CStr(Dts.Variables("Filepath").Value) & "\Master.Csv"
Dim ChildPath As String = CStr(Dts.Variables("Filepath").Value) & "\Child.Csv"
Dim LogPath As String = CStr(Dts.Variables("Filepath").Value) & "\Log.Txt"
Try
Call SplitFile(MasterPath, ChildPath, LogPath, FileSize)
Catch Ex As Exception
MsgBox(Ex.Message)
End Try
Dts.TaskResult = ScriptResults.Success
End Sub
Sub SplitFile(ByVal MasterPath As String, ByVal ChildPath As String, ByVal Logpath As String, ByVal SizeKB As Integer)
Dim FilesizeCounter As Integer
Dim FileCounter As Integer = 0
Dim RowCount As Integer = 0
'Open The Stream And Read It Back.
Dim Parentsr As StreamReader = File.OpenText(MasterPath)
Dim Childfs As FileStream
Dim Logfs As FileStream
Call CreateFile(Logpath, Logfs) 'Create Log File
Do While Parentsr.Peek() >= 0 'Looping Master File Stream
If FilesizeCounter = 0 Then
FileCounter = FileCounter + 1
Call CreateFile(Replace(ChildPath, ".Csv", "_" & FileCounter & ".Csv"), Childfs)
End If
If FilesizeCounter < (SizeKB * 1024) Then
Call WriteFile(Childfs, Parentsr.ReadLine() & vbNewLine, FilesizeCounter)
If Parentsr.EndOfStream Then
Childfs.Close()
Call WriteFile(Logfs, "---------", 0)
Call WriteFile(Logfs, "File Name:" & Replace(ChildPath, ".Csv", "_" & FileCounter & ".Csv") & vbNewLine & "Row Count:" & RowCount & vbNewLine & "Size(Bytes):" & FilesizeCounter & vbNewLine & "Extract End:" & Now().ToString, 0)
End If
RowCount = RowCount + 1
Else
Call WriteFile(Childfs, Parentsr.ReadLine() & vbNewLine, FilesizeCounter)
Childfs.Close() ' Close Child File
Call WriteFile(Logfs, "---------", 0)
Call WriteFile(Logfs, "File Name:" & Replace(ChildPath, ".Csv", "_" & FileCounter & ".Csv") & vbNewLine & "Row Count:" & RowCount & vbNewLine & "Size(Bytes):" & FilesizeCounter & vbNewLine & "Extract End:" & Now().ToString, 0)
RowCount = RowCount + 1
FilesizeCounter = 0 ' Reset File Size Counter
End If
Loop
Parentsr.Close() ' Close Master File
Logfs.Close() ' Close Log File
End Sub
Sub CreateFile(ByVal Path As String, ByRef Fs As FileStream)
If File.Exists(Path) Then File.Delete(Path) 'Delete The If Already Exist.
Fs = File.Create(Path)
End Sub
Sub WriteFile(ByRef Fs As FileStream, ByVal LineInfo As String, ByRef FilesizeCounter As Integer)
Dim Info As Byte() = New Text.UTF8Encoding(True).GetBytes(LineInfo & vbNewLine)
Fs.Write(Info, 0, Info.Length) ' Add Some Information To The File.
FilesizeCounter = FilesizeCounter + Info.Length
End Sub
#Region "ScriptResults declaration"
'This enum provides a convenient shorthand within the scope of this class for setting the
'result of the script.
'This code was generated automatically.
Enum ScriptResults
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
End Enum
#End Region
End Class
我需要在子文件中包含主文件头。
【问题讨论】:
我没有看到您正在阅读Parentsr
的第一行的任何地方。这就是你需要做的。阅读第一行,然后每次创建新文件时,在编写其他任何内容之前编写它。
谢谢斯科特,但我必须跳过第一个子文件并写入所有其他文件。
您使用的是单一文件格式吗?也就是说,列和数据类型是否总是相同的?
谢谢斯科特·汉南。我已经想通了。 @aaron:是的,它是单一文件格式,所有列和数据类型始终相同
【参考方案1】:
我已经阅读了父母的标题
Dim header as string = parentsr.readline()
然后我在创建子文件时将其写回
Call WriteFile(Childfs, header & vbNewLine, FilesizeCounter)
【讨论】:
以上是关于根据父文件头的大小将大型 CSV 文件拆分为多个文件的主要内容,如果未能解决你的问题,请参考以下文章