使用 .NET core 3.0/System.text.Json 解析 JSON 文件

Posted

技术标签:

【中文标题】使用 .NET core 3.0/System.text.Json 解析 JSON 文件【英文标题】:Parsing a JSON file with .NET core 3.0/System.text.Json 【发布时间】:2019-03-04 12:42:57 【问题描述】:

我正在尝试使用 .NET Core 3.0 中的新 JSON 阅读器 System.Text.Json 读取和解析无法放入内存的大型 JSON 文件。

Microsoft 的示例代码采用ReadOnlySpan<byte> 作为输入

    public static void Utf8JsonReaderLoop(ReadOnlySpan<byte> dataUtf8)
    
        var json = new Utf8JsonReader(dataUtf8, isFinalBlock: true, state: default);

        while (json.Read())
        
            JsonTokenType tokenType = json.TokenType;
            ReadOnlySpan<byte> valueSpan = json.ValueSpan;
            switch (tokenType)
            
                case JsonTokenType.StartObject:
                case JsonTokenType.EndObject:
                    break;
                case JsonTokenType.StartArray:
                case JsonTokenType.EndArray:
                    break;
                case JsonTokenType.PropertyName:
                    break;
                case JsonTokenType.String:
                    string valueString = json.GetString();
                    break;
                case JsonTokenType.Number:
                    if (!json.TryGetInt32(out int valueInteger))
                    
                        throw new FormatException();
                    
                    break;
                case JsonTokenType.True:
                case JsonTokenType.False:
                    bool valueBool = json.GetBoolean();
                    break;
                case JsonTokenType.Null:
                    break;
                default:
                    throw new ArgumentException();
            
        

        dataUtf8 = dataUtf8.Slice((int)json.BytesConsumed);
        JsonReaderState state = json.CurrentState;
    

我正在努力寻找的是如何实际使用此代码和FileStream,将FileStream 转换为ReadOnlySpan&lt;byte&gt;

我尝试使用以下代码和ReadAndProcessLargeFile("latest-all.json");读取文件

    const int megabyte = 1024 * 1024;
    public static void ReadAndProcessLargeFile(string theFilename, long whereToStartReading = 0)
    
        FileStream fileStram = new FileStream(theFilename, FileMode.Open, FileAccess.Read);
        using (fileStram)
        
            byte[] buffer = new byte[megabyte];
            fileStram.Seek(whereToStartReading, SeekOrigin.Begin);
            int bytesRead = fileStram.Read(buffer, 0, megabyte);
            while (bytesRead > 0)
            
                ProcessChunk(buffer, bytesRead);
                bytesRead = fileStram.Read(buffer, 0, megabyte);
            

        
    

    private static void ProcessChunk(byte[] buffer, int bytesRead)
    
        var span = new ReadOnlySpan<byte>(buffer);
        Utf8JsonReaderLoop(span);
    

它会因错误消息而崩溃

System.Text.Json.JsonReaderException: 'Expected end of string, but instead reached end of data. LineNumber: 8 | BytePositionInLine: 123335.'

作为参考,这是我使用 Newtonsoft.Json 的工作代码

        dynamic o;
        var serializer = new Newtonsoft.Json.JsonSerializer();
        using (FileStream s = File.Open("latest-all.json", FileMode.Open))
        using (StreamReader sr = new StreamReader(s))
        using (JsonReader reader = new JsonTextReader(sr))
        
            while (reader.Read())
            
                if (reader.TokenType == JsonToken.StartObject)
                
                    o = serializer.Deserialize(reader);
                 
            
        

【问题讨论】:

ProcessChunk 不使用bytesRead。我认为您还需要将 state 从之前的 Utf8JsonReader 传递到 Utf8JsonReader ctor,并正确表明您是否将其作为最后一个块。 另外,Stream.Read 可以采用 Span&lt;byte&gt;byte[] 那么...你为什么不使用Utf8JsonReader.Parse(Stream,JsonReaderOptions)?我想,无论您如何提供 数据,问题是最终生成的对象是否适合您的记忆。如果是这样,流解析器也应该可以工作。 JSON 文件是 WikiData 的转储文件,大小约为 800GB。我要解析的每个实体都很小,如mediawiki.org/wiki/Wikibase/DataModel/JSON 所述。我似乎找不到 Utf8JsonReader.Parse 吗? 【参考方案1】:

2019-10-13 更新:重写 Utf8JsonStreamReader 以在内部使用 ReadOnlySequences,为 JsonSerializer.Deserialize 方法添加了包装器。


我已经为 Utf8JsonReader 创建了一个包装器来实现这个目的:

public ref struct Utf8JsonStreamReader

    private readonly Stream _stream;
    private readonly int _bufferSize;

    private SequenceSegment? _firstSegment;
    private int _firstSegmentStartIndex;
    private SequenceSegment? _lastSegment;
    private int _lastSegmentEndIndex;

    private Utf8JsonReader _jsonReader;
    private bool _keepBuffers;
    private bool _isFinalBlock;

    public Utf8JsonStreamReader(Stream stream, int bufferSize)
    
        _stream = stream;
        _bufferSize = bufferSize;

        _firstSegment = null;
        _firstSegmentStartIndex = 0;
        _lastSegment = null;
        _lastSegmentEndIndex = -1;

        _jsonReader = default;
        _keepBuffers = false;
        _isFinalBlock = false;
    

    public bool Read()
    
        // read could be unsuccessful due to insufficient bufer size, retrying in loop with additional buffer segments
        while (!_jsonReader.Read())
        
            if (_isFinalBlock)
                return false;

            MoveNext();
        

        return true;
    

    private void MoveNext()
    
        var firstSegment = _firstSegment;
        _firstSegmentStartIndex += (int)_jsonReader.BytesConsumed;

        // release previous segments if possible
        if (!_keepBuffers)
        
            while (firstSegment?.Memory.Length <= _firstSegmentStartIndex)
            
                _firstSegmentStartIndex -= firstSegment.Memory.Length;
                firstSegment.Dispose();
                firstSegment = (SequenceSegment?)firstSegment.Next;
            
        

        // create new segment
        var newSegment = new SequenceSegment(_bufferSize, _lastSegment);

        if (firstSegment != null)
        
            _firstSegment = firstSegment;
            newSegment.Previous = _lastSegment;
            _lastSegment?.SetNext(newSegment);
            _lastSegment = newSegment;
        
        else
        
            _firstSegment = _lastSegment = newSegment;
            _firstSegmentStartIndex = 0;
        

        // read data from stream
        _lastSegmentEndIndex = _stream.Read(newSegment.Buffer.Memory.Span);
        _isFinalBlock = _lastSegmentEndIndex < newSegment.Buffer.Memory.Length;
        _jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment, _firstSegmentStartIndex, _lastSegment, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
    

    public T Deserialize<T>(JsonSerializerOptions? options = null)
    
        // JsonSerializer.Deserialize can read only a single object. We have to extract
        // object to be deserialized into separate Utf8JsonReader. This incures one additional
        // pass through data (but data is only passed, not parsed).
        var tokenStartIndex = _jsonReader.TokenStartIndex;
        var firstSegment = _firstSegment;
        var firstSegmentStartIndex = _firstSegmentStartIndex;

        // loop through data until end of object is found
        _keepBuffers = true;
        int depth = 0;

        if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
            depth++;

        while (depth > 0 && Read())
        
            if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
                depth++;
            else if (TokenType == JsonTokenType.EndObject || TokenType == JsonTokenType.EndArray)
                depth--;
        

        _keepBuffers = false;

        // end of object found, extract json reader for deserializer
        var newJsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(firstSegment!, firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex).Slice(tokenStartIndex, _jsonReader.Position), true, default);

        // deserialize value
        var result = JsonSerializer.Deserialize<T>(ref newJsonReader, options);

        // release memory if possible
        firstSegmentStartIndex = _firstSegmentStartIndex + (int)_jsonReader.BytesConsumed;

        while (firstSegment?.Memory.Length < firstSegmentStartIndex)
        
            firstSegmentStartIndex -= firstSegment.Memory.Length;
            firstSegment.Dispose();
            firstSegment = (SequenceSegment?)firstSegment.Next;
        

        if (firstSegment != _firstSegment)
        
            _firstSegment = firstSegment;
            _firstSegmentStartIndex = firstSegmentStartIndex;
            _jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment!, _firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
        

        return result;
    

    public void Dispose() =>_lastSegment?.Dispose();

    public int CurrentDepth => _jsonReader.CurrentDepth;
    public bool HasValueSequence => _jsonReader.HasValueSequence;
    public long TokenStartIndex => _jsonReader.TokenStartIndex;
    public JsonTokenType TokenType => _jsonReader.TokenType;
    public ReadOnlySequence<byte> ValueSequence => _jsonReader.ValueSequence;
    public ReadOnlySpan<byte> ValueSpan => _jsonReader.ValueSpan;

    public bool GetBoolean() => _jsonReader.GetBoolean();
    public byte GetByte() => _jsonReader.GetByte();
    public byte[] GetBytesFromBase64() => _jsonReader.GetBytesFromBase64();
    public string GetComment() => _jsonReader.GetComment();
    public DateTime GetDateTime() => _jsonReader.GetDateTime();
    public DateTimeOffset GetDateTimeOffset() => _jsonReader.GetDateTimeOffset();
    public decimal GetDecimal() => _jsonReader.GetDecimal();
    public double GetDouble() => _jsonReader.GetDouble();
    public Guid GetGuid() => _jsonReader.GetGuid();
    public short GetInt16() => _jsonReader.GetInt16();
    public int GetInt32() => _jsonReader.GetInt32();
    public long GetInt64() => _jsonReader.GetInt64();
    public sbyte GetSByte() => _jsonReader.GetSByte();
    public float GetSingle() => _jsonReader.GetSingle();
    public string GetString() => _jsonReader.GetString();
    public uint GetUInt32() => _jsonReader.GetUInt32();
    public ulong GetUInt64() => _jsonReader.GetUInt64();
    public bool TryGetDecimal(out byte value) => _jsonReader.TryGetByte(out value);
    public bool TryGetBytesFromBase64(out byte[] value) => _jsonReader.TryGetBytesFromBase64(out value);
    public bool TryGetDateTime(out DateTime value) => _jsonReader.TryGetDateTime(out value);
    public bool TryGetDateTimeOffset(out DateTimeOffset value) => _jsonReader.TryGetDateTimeOffset(out value);
    public bool TryGetDecimal(out decimal value) => _jsonReader.TryGetDecimal(out value);
    public bool TryGetDouble(out double value) => _jsonReader.TryGetDouble(out value);
    public bool TryGetGuid(out Guid value) => _jsonReader.TryGetGuid(out value);
    public bool TryGetInt16(out short value) => _jsonReader.TryGetInt16(out value);
    public bool TryGetInt32(out int value) => _jsonReader.TryGetInt32(out value);
    public bool TryGetInt64(out long value) => _jsonReader.TryGetInt64(out value);
    public bool TryGetSByte(out sbyte value) => _jsonReader.TryGetSByte(out value);
    public bool TryGetSingle(out float value) => _jsonReader.TryGetSingle(out value);
    public bool TryGetUInt16(out ushort value) => _jsonReader.TryGetUInt16(out value);
    public bool TryGetUInt32(out uint value) => _jsonReader.TryGetUInt32(out value);
    public bool TryGetUInt64(out ulong value) => _jsonReader.TryGetUInt64(out value);

    private sealed class SequenceSegment : ReadOnlySequenceSegment<byte>, IDisposable
    
        internal IMemoryOwner<byte> Buffer  get; 
        internal SequenceSegment? Previous  get; set; 
        private bool _disposed;

        public SequenceSegment(int size, SequenceSegment? previous)
        
            Buffer = MemoryPool<byte>.Shared.Rent(size);
            Previous = previous;

            Memory = Buffer.Memory;
            RunningIndex = previous?.RunningIndex + previous?.Memory.Length ?? 0;
        

        public void SetNext(SequenceSegment next) => Next = next;

        public void Dispose()
        
            if (!_disposed)
            
                _disposed = true;
                Buffer.Dispose();
                Previous?.Dispose();
            
        
    

您可以将其用作 Utf8JsonReader 的替代品,或将 json 反序列化为类型化对象(作为 System.Text.Json.JsonSerializer.Deserialize 的包装器)。

从巨大的 JSON 数组中反序列化对象的使用示例:

using var stream = new FileStream("LargeData.json", FileMode.Open, FileAccess.Read);
using var jsonStreamReader = new Utf8JsonStreamReader(stream, 32 * 1024);

jsonStreamReader.Read(); // move to array start
jsonStreamReader.Read(); // move to start of the object

while (jsonStreamReader.TokenType != JsonTokenType.EndArray)

    // deserialize object
    var obj = jsonStreamReader.Deserialize<TestData>();

    // JsonSerializer.Deserialize ends on last token of the object parsed,
    // move to the first token of next object
    jsonStreamReader.Read();

Deserialize 方法从流中读取数据,直到找到当前对象的结尾。然后它构造一个新的Utf8JsonReader,读取数据并调用JsonSerializer.Deserialize

其他方法传递给Utf8JsonReader

和往常一样,不要忘记在最后处理你的对象。

【讨论】:

Ref struct dispose 是 C# 8 的一项功能。在 C# 8 中,上述结构可用于 using 语句,而在 C# 7.2 中则不能。您必须在 C# 7.2 中手动处理它。 如何与JsonSerializer.Deserialize&lt;T&gt; 一起使用?我尝试从一个数组中一个一个地反序列化一个完整的复杂类型。 @mtosh 我说的是System.Text.Json.JsonSerializer.Deserialize&lt;T&gt;(...)。我想用 system.text.json 完全替换 Json.Net 但似乎我需要自己重写类型反序列化器 @Kalten 我在 JsonSerializer.Deserialize 周围添加了包装器以及答案的使用示例。请注意,JsonSerializer.Deserialize 一次只能反序列化一个对象,因此我们必须读取数据(但不解析)直到找到当前对象的结尾,然后在给定的数据段上构造 Utf8JsonReader。 @mtosh 感谢这个示例,它是一个很好的起点。我一直在将它与一些大型 json 样本一起使用,并发现了一些错误。我在github.com/evil-dr-nick/utf8jsonstreamreader/blob/master/… 发布了一个修复版本【参考方案2】:

如果你使用异步,有一个方法可以接受一个流(加上通用版本)

DeserializeAsync(Stream utf8Json, Type returnType, JsonSerializerOptions options = null, CancellationToken cancellationToken = default);

【讨论】:

以上是关于使用 .NET core 3.0/System.text.Json 解析 JSON 文件的主要内容,如果未能解决你的问题,请参考以下文章

使用将 net461 设置为唯一框架的 ASP.NET Core Web 应用程序 (.NET Core) 与使用 (.NET Framework) 模板之间的区别

2021-06-29 .NET高级班 75-ASP.NET Core Grpc在Core中的使用

Net Core 多项目开发(net core 类库项目使用)

NET Standard vs Net Core App:创建 .NET Core 项目时(使用控制台或类库)

ASP.NET Core (.NET Core) and ASP.NET Core (.NET Framework)区别

.NET Core跨平台:使用.NET Core开发一个初心源商城总括