Failure to Detect Encoding in JSON

Problem

Spark job fails with an exception containing the message:

Invalid UTF-32 character 0x1414141(above 10ffff)  at char #1, byte #7)
At org.apache.spark.sql.catalyst.json.JacksonParser.parse

Cause

The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. However, BOM is not mandatory by Unicode standard and prohibited by RFC 7159 for example, section 8.1:

”...Implementations MUST NOT add a byte order mark to the beginning of a JSON text.”

As a consequence, in some cases Spark is not able to detect the charset correctly and read the JSON file.

Solution

To solve the issue, disable the charset auto-detection mechanism and explicitly set the charset via the option:

.option("encoding", "UTF-16LE")