使用 .NET core 3.0/System.text.Json 解析 JSON 文件
Posted
技术标签:
【中文标题】使用 .NET core 3.0/System.text.Json 解析 JSON 文件【英文标题】:Parsing a JSON file with .NET core 3.0/System.text.Json 【发布时间】:2019-03-04 12:42:57 【问题描述】:我正在尝试使用 .NET Core 3.0 中的新 JSON 阅读器 System.Text.Json
读取和解析无法放入内存的大型 JSON 文件。
Microsoft 的示例代码采用ReadOnlySpan<byte>
作为输入
public static void Utf8JsonReaderLoop(ReadOnlySpan<byte> dataUtf8)
var json = new Utf8JsonReader(dataUtf8, isFinalBlock: true, state: default);
while (json.Read())
JsonTokenType tokenType = json.TokenType;
ReadOnlySpan<byte> valueSpan = json.ValueSpan;
switch (tokenType)
case JsonTokenType.StartObject:
case JsonTokenType.EndObject:
break;
case JsonTokenType.StartArray:
case JsonTokenType.EndArray:
break;
case JsonTokenType.PropertyName:
break;
case JsonTokenType.String:
string valueString = json.GetString();
break;
case JsonTokenType.Number:
if (!json.TryGetInt32(out int valueInteger))
throw new FormatException();
break;
case JsonTokenType.True:
case JsonTokenType.False:
bool valueBool = json.GetBoolean();
break;
case JsonTokenType.Null:
break;
default:
throw new ArgumentException();
dataUtf8 = dataUtf8.Slice((int)json.BytesConsumed);
JsonReaderState state = json.CurrentState;
我正在努力寻找的是如何实际使用此代码和FileStream
,将FileStream
转换为ReadOnlySpan<byte>
。
我尝试使用以下代码和ReadAndProcessLargeFile("latest-all.json");
读取文件
const int megabyte = 1024 * 1024;
public static void ReadAndProcessLargeFile(string theFilename, long whereToStartReading = 0)
FileStream fileStram = new FileStream(theFilename, FileMode.Open, FileAccess.Read);
using (fileStram)
byte[] buffer = new byte[megabyte];
fileStram.Seek(whereToStartReading, SeekOrigin.Begin);
int bytesRead = fileStram.Read(buffer, 0, megabyte);
while (bytesRead > 0)
ProcessChunk(buffer, bytesRead);
bytesRead = fileStram.Read(buffer, 0, megabyte);
private static void ProcessChunk(byte[] buffer, int bytesRead)
var span = new ReadOnlySpan<byte>(buffer);
Utf8JsonReaderLoop(span);
它会因错误消息而崩溃
System.Text.Json.JsonReaderException: 'Expected end of string, but instead reached end of data. LineNumber: 8 | BytePositionInLine: 123335.'
作为参考,这是我使用 Newtonsoft.Json 的工作代码
dynamic o;
var serializer = new Newtonsoft.Json.JsonSerializer();
using (FileStream s = File.Open("latest-all.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
while (reader.Read())
if (reader.TokenType == JsonToken.StartObject)
o = serializer.Deserialize(reader);
【问题讨论】:
ProcessChunk
不使用bytesRead
。我认为您还需要将 state
从之前的 Utf8JsonReader
传递到 Utf8JsonReader
ctor,并正确表明您是否将其作为最后一个块。
另外,Stream.Read
可以采用 Span<byte>
和 byte[]
那么...你为什么不使用Utf8JsonReader.Parse(Stream,JsonReaderOptions)
?我想,无论您如何提供 数据,问题是最终生成的对象是否适合您的记忆。如果是这样,流解析器也应该可以工作。
JSON 文件是 WikiData 的转储文件,大小约为 800GB。我要解析的每个实体都很小,如mediawiki.org/wiki/Wikibase/DataModel/JSON 所述。我似乎找不到 Utf8JsonReader.Parse 吗?
【参考方案1】:
2019-10-13 更新:重写 Utf8JsonStreamReader
以在内部使用 ReadOnlySequences
,为 JsonSerializer.Deserialize
方法添加了包装器。
我已经为 Utf8JsonReader
创建了一个包装器来实现这个目的:
public ref struct Utf8JsonStreamReader
private readonly Stream _stream;
private readonly int _bufferSize;
private SequenceSegment? _firstSegment;
private int _firstSegmentStartIndex;
private SequenceSegment? _lastSegment;
private int _lastSegmentEndIndex;
private Utf8JsonReader _jsonReader;
private bool _keepBuffers;
private bool _isFinalBlock;
public Utf8JsonStreamReader(Stream stream, int bufferSize)
_stream = stream;
_bufferSize = bufferSize;
_firstSegment = null;
_firstSegmentStartIndex = 0;
_lastSegment = null;
_lastSegmentEndIndex = -1;
_jsonReader = default;
_keepBuffers = false;
_isFinalBlock = false;
public bool Read()
// read could be unsuccessful due to insufficient bufer size, retrying in loop with additional buffer segments
while (!_jsonReader.Read())
if (_isFinalBlock)
return false;
MoveNext();
return true;
private void MoveNext()
var firstSegment = _firstSegment;
_firstSegmentStartIndex += (int)_jsonReader.BytesConsumed;
// release previous segments if possible
if (!_keepBuffers)
while (firstSegment?.Memory.Length <= _firstSegmentStartIndex)
_firstSegmentStartIndex -= firstSegment.Memory.Length;
firstSegment.Dispose();
firstSegment = (SequenceSegment?)firstSegment.Next;
// create new segment
var newSegment = new SequenceSegment(_bufferSize, _lastSegment);
if (firstSegment != null)
_firstSegment = firstSegment;
newSegment.Previous = _lastSegment;
_lastSegment?.SetNext(newSegment);
_lastSegment = newSegment;
else
_firstSegment = _lastSegment = newSegment;
_firstSegmentStartIndex = 0;
// read data from stream
_lastSegmentEndIndex = _stream.Read(newSegment.Buffer.Memory.Span);
_isFinalBlock = _lastSegmentEndIndex < newSegment.Buffer.Memory.Length;
_jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment, _firstSegmentStartIndex, _lastSegment, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
public T Deserialize<T>(JsonSerializerOptions? options = null)
// JsonSerializer.Deserialize can read only a single object. We have to extract
// object to be deserialized into separate Utf8JsonReader. This incures one additional
// pass through data (but data is only passed, not parsed).
var tokenStartIndex = _jsonReader.TokenStartIndex;
var firstSegment = _firstSegment;
var firstSegmentStartIndex = _firstSegmentStartIndex;
// loop through data until end of object is found
_keepBuffers = true;
int depth = 0;
if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
depth++;
while (depth > 0 && Read())
if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
depth++;
else if (TokenType == JsonTokenType.EndObject || TokenType == JsonTokenType.EndArray)
depth--;
_keepBuffers = false;
// end of object found, extract json reader for deserializer
var newJsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(firstSegment!, firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex).Slice(tokenStartIndex, _jsonReader.Position), true, default);
// deserialize value
var result = JsonSerializer.Deserialize<T>(ref newJsonReader, options);
// release memory if possible
firstSegmentStartIndex = _firstSegmentStartIndex + (int)_jsonReader.BytesConsumed;
while (firstSegment?.Memory.Length < firstSegmentStartIndex)
firstSegmentStartIndex -= firstSegment.Memory.Length;
firstSegment.Dispose();
firstSegment = (SequenceSegment?)firstSegment.Next;
if (firstSegment != _firstSegment)
_firstSegment = firstSegment;
_firstSegmentStartIndex = firstSegmentStartIndex;
_jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment!, _firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
return result;
public void Dispose() =>_lastSegment?.Dispose();
public int CurrentDepth => _jsonReader.CurrentDepth;
public bool HasValueSequence => _jsonReader.HasValueSequence;
public long TokenStartIndex => _jsonReader.TokenStartIndex;
public JsonTokenType TokenType => _jsonReader.TokenType;
public ReadOnlySequence<byte> ValueSequence => _jsonReader.ValueSequence;
public ReadOnlySpan<byte> ValueSpan => _jsonReader.ValueSpan;
public bool GetBoolean() => _jsonReader.GetBoolean();
public byte GetByte() => _jsonReader.GetByte();
public byte[] GetBytesFromBase64() => _jsonReader.GetBytesFromBase64();
public string GetComment() => _jsonReader.GetComment();
public DateTime GetDateTime() => _jsonReader.GetDateTime();
public DateTimeOffset GetDateTimeOffset() => _jsonReader.GetDateTimeOffset();
public decimal GetDecimal() => _jsonReader.GetDecimal();
public double GetDouble() => _jsonReader.GetDouble();
public Guid GetGuid() => _jsonReader.GetGuid();
public short GetInt16() => _jsonReader.GetInt16();
public int GetInt32() => _jsonReader.GetInt32();
public long GetInt64() => _jsonReader.GetInt64();
public sbyte GetSByte() => _jsonReader.GetSByte();
public float GetSingle() => _jsonReader.GetSingle();
public string GetString() => _jsonReader.GetString();
public uint GetUInt32() => _jsonReader.GetUInt32();
public ulong GetUInt64() => _jsonReader.GetUInt64();
public bool TryGetDecimal(out byte value) => _jsonReader.TryGetByte(out value);
public bool TryGetBytesFromBase64(out byte[] value) => _jsonReader.TryGetBytesFromBase64(out value);
public bool TryGetDateTime(out DateTime value) => _jsonReader.TryGetDateTime(out value);
public bool TryGetDateTimeOffset(out DateTimeOffset value) => _jsonReader.TryGetDateTimeOffset(out value);
public bool TryGetDecimal(out decimal value) => _jsonReader.TryGetDecimal(out value);
public bool TryGetDouble(out double value) => _jsonReader.TryGetDouble(out value);
public bool TryGetGuid(out Guid value) => _jsonReader.TryGetGuid(out value);
public bool TryGetInt16(out short value) => _jsonReader.TryGetInt16(out value);
public bool TryGetInt32(out int value) => _jsonReader.TryGetInt32(out value);
public bool TryGetInt64(out long value) => _jsonReader.TryGetInt64(out value);
public bool TryGetSByte(out sbyte value) => _jsonReader.TryGetSByte(out value);
public bool TryGetSingle(out float value) => _jsonReader.TryGetSingle(out value);
public bool TryGetUInt16(out ushort value) => _jsonReader.TryGetUInt16(out value);
public bool TryGetUInt32(out uint value) => _jsonReader.TryGetUInt32(out value);
public bool TryGetUInt64(out ulong value) => _jsonReader.TryGetUInt64(out value);
private sealed class SequenceSegment : ReadOnlySequenceSegment<byte>, IDisposable
internal IMemoryOwner<byte> Buffer get;
internal SequenceSegment? Previous get; set;
private bool _disposed;
public SequenceSegment(int size, SequenceSegment? previous)
Buffer = MemoryPool<byte>.Shared.Rent(size);
Previous = previous;
Memory = Buffer.Memory;
RunningIndex = previous?.RunningIndex + previous?.Memory.Length ?? 0;
public void SetNext(SequenceSegment next) => Next = next;
public void Dispose()
if (!_disposed)
_disposed = true;
Buffer.Dispose();
Previous?.Dispose();
您可以将其用作 Utf8JsonReader
的替代品,或将 json 反序列化为类型化对象(作为 System.Text.Json.JsonSerializer.Deserialize
的包装器)。
从巨大的 JSON 数组中反序列化对象的使用示例:
using var stream = new FileStream("LargeData.json", FileMode.Open, FileAccess.Read);
using var jsonStreamReader = new Utf8JsonStreamReader(stream, 32 * 1024);
jsonStreamReader.Read(); // move to array start
jsonStreamReader.Read(); // move to start of the object
while (jsonStreamReader.TokenType != JsonTokenType.EndArray)
// deserialize object
var obj = jsonStreamReader.Deserialize<TestData>();
// JsonSerializer.Deserialize ends on last token of the object parsed,
// move to the first token of next object
jsonStreamReader.Read();
Deserialize 方法从流中读取数据,直到找到当前对象的结尾。然后它构造一个新的Utf8JsonReader
,读取数据并调用JsonSerializer.Deserialize
。
其他方法传递给Utf8JsonReader
。
和往常一样,不要忘记在最后处理你的对象。
【讨论】:
Ref struct dispose 是 C# 8 的一项功能。在 C# 8 中,上述结构可用于 using 语句,而在 C# 7.2 中则不能。您必须在 C# 7.2 中手动处理它。 如何与JsonSerializer.Deserialize<T>
一起使用?我尝试从一个数组中一个一个地反序列化一个完整的复杂类型。
@mtosh 我说的是System.Text.Json.JsonSerializer.Deserialize<T>(...)
。我想用 system.text.json 完全替换 Json.Net 但似乎我需要自己重写类型反序列化器
@Kalten 我在 JsonSerializer.Deserialize 周围添加了包装器以及答案的使用示例。请注意,JsonSerializer.Deserialize 一次只能反序列化一个对象,因此我们必须读取数据(但不解析)直到找到当前对象的结尾,然后在给定的数据段上构造 Utf8JsonReader。
@mtosh 感谢这个示例,它是一个很好的起点。我一直在将它与一些大型 json 样本一起使用,并发现了一些错误。我在github.com/evil-dr-nick/utf8jsonstreamreader/blob/master/… 发布了一个修复版本【参考方案2】:
如果你使用异步,有一个方法可以接受一个流(加上通用版本)
DeserializeAsync(Stream utf8Json, Type returnType, JsonSerializerOptions options = null, CancellationToken cancellationToken = default);
【讨论】:
以上是关于使用 .NET core 3.0/System.text.Json 解析 JSON 文件的主要内容,如果未能解决你的问题,请参考以下文章
使用将 net461 设置为唯一框架的 ASP.NET Core Web 应用程序 (.NET Core) 与使用 (.NET Framework) 模板之间的区别
2021-06-29 .NET高级班 75-ASP.NET Core Grpc在Core中的使用
Net Core 多项目开发(net core 类库项目使用)
NET Standard vs Net Core App:创建 .NET Core 项目时(使用控制台或类库)
ASP.NET Core (.NET Core) and ASP.NET Core (.NET Framework)区别