根据父文件头的大小将大型 CSV 文件拆分为多个文件

Posted

技术标签:

【中文标题】根据父文件头的大小将大型 CSV 文件拆分为多个文件【英文标题】:Split large CSV file into multiple files based on the size with the parent file header 【发布时间】:2019-06-04 18:39:38 【问题描述】:

我有一个场景,我必须将一个大的 csv 文件拆分为多个 csv 文件,每个文件的大小应为 100 MB 或更小,并带有标题。

我在我的 ssis 包中尝试了下面的 VB.net 代码,但我没有得到子文件的标题行。

请帮忙..


Public Sub Main()


    Dim FileSize As Integer = 100000   'Specify In KB. Can Be Modified.

    Dim MasterPath As String = CStr(Dts.Variables("Filepath").Value) & "\Master.Csv"

    Dim ChildPath As String = CStr(Dts.Variables("Filepath").Value) & "\Child.Csv"

    Dim LogPath As String = CStr(Dts.Variables("Filepath").Value) & "\Log.Txt"

    Try

        Call SplitFile(MasterPath, ChildPath, LogPath, FileSize)

    Catch Ex As Exception

        MsgBox(Ex.Message)

    End Try

    Dts.TaskResult = ScriptResults.Success

End Sub
Sub SplitFile(ByVal MasterPath As String, ByVal ChildPath As String, ByVal Logpath As String, ByVal SizeKB As Integer)

    Dim FilesizeCounter As Integer

    Dim FileCounter As Integer = 0

    Dim RowCount As Integer = 0

    'Open The Stream And Read It Back.

    Dim Parentsr As StreamReader = File.OpenText(MasterPath)

    Dim Childfs As FileStream

    Dim Logfs As FileStream

    Call CreateFile(Logpath, Logfs)  'Create Log File

    Do While Parentsr.Peek() >= 0    'Looping Master File Stream

        If FilesizeCounter = 0 Then

            FileCounter = FileCounter + 1

            Call CreateFile(Replace(ChildPath, ".Csv", "_" & FileCounter & ".Csv"), Childfs)

        End If

        If FilesizeCounter < (SizeKB * 1024) Then

            Call WriteFile(Childfs, Parentsr.ReadLine() & vbNewLine, FilesizeCounter)

            If Parentsr.EndOfStream Then

                Childfs.Close()

                Call WriteFile(Logfs, "---------", 0)

                Call WriteFile(Logfs, "File Name:" & Replace(ChildPath, ".Csv", "_" & FileCounter & ".Csv") & vbNewLine & "Row Count:" & RowCount & vbNewLine & "Size(Bytes):" & FilesizeCounter & vbNewLine & "Extract End:" & Now().ToString, 0)

            End If

            RowCount = RowCount + 1

        Else

            Call WriteFile(Childfs, Parentsr.ReadLine() & vbNewLine, FilesizeCounter)

            Childfs.Close()   ' Close Child File

            Call WriteFile(Logfs, "---------", 0)

            Call WriteFile(Logfs, "File Name:" & Replace(ChildPath, ".Csv", "_" & FileCounter & ".Csv") & vbNewLine & "Row Count:" & RowCount & vbNewLine & "Size(Bytes):" & FilesizeCounter & vbNewLine & "Extract End:" & Now().ToString, 0)

            RowCount = RowCount + 1

            FilesizeCounter = 0    ' Reset File Size Counter

        End If

    Loop

    Parentsr.Close()  ' Close Master File

    Logfs.Close()     ' Close Log File

End Sub

Sub CreateFile(ByVal Path As String, ByRef Fs As FileStream)

    If File.Exists(Path) Then File.Delete(Path)  'Delete The If Already Exist.

    Fs = File.Create(Path)

End Sub
Sub WriteFile(ByRef Fs As FileStream, ByVal LineInfo As String, ByRef FilesizeCounter As Integer)

    Dim Info As Byte() = New Text.UTF8Encoding(True).GetBytes(LineInfo & vbNewLine)

    Fs.Write(Info, 0, Info.Length)    ' Add Some Information To The File.

    FilesizeCounter = FilesizeCounter + Info.Length

End Sub


#Region "ScriptResults declaration"

'This enum provides a convenient shorthand within the scope of this class for setting the
'result of the script.

'This code was generated automatically.
Enum ScriptResults
    Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
    Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
End Enum

#End Region

End Class

我需要在子文件中包含主文件头。

【问题讨论】:

我没有看到您正在阅读Parentsr 的第一行的任何地方。这就是你需要做的。阅读第一行,然后每次创建新文件时,在编写其他任何内容之前编写它。 谢谢斯科特,但我必须跳过第一个子文件并写入所有其他文件。 您使用的是单一文件格式吗?也就是说,列和数据类型是否总是相同的? 谢谢斯科特·汉南。我已经想通了。 @aaron:是的,它是单一文件格式,所有列和数据类型始终相同 【参考方案1】:

我已经阅读了父母的标题

Dim header as string = parentsr.readline()

然后我在创建子文件时将其写回

Call WriteFile(Childfs, header & vbNewLine, FilesizeCounter)

【讨论】:

以上是关于根据父文件头的大小将大型 CSV 文件拆分为多个文件的主要内容,如果未能解决你的问题,请参考以下文章

根据行值python将大型csv文件拆分为多个文件

根据列值拆分大型 csv 文本文件

r - 将一个 csv 文件拆分为多个 txt 文件

Scala:我如何根据行数将数据帧拆分为多个 csv 文件

PIG 脚本根据特定单词将大型文本文件拆分为多个部分

Python拆分大型CSV文件(亲测拆分178G)注释超全