System.Console Unexpectedly Uses A UTF-8 Encoding *with BOM ...

Note: #27258 (since fixed) addressed the same issue, but for Unix-like platforms only, because at the time I didn't realize that it should be fixed on Windows also.

On Windows too the preamble (BOM) should be removed from the UTF-8 encoding that the Console class uses when code page 65001 (chcp 65001) is in effect, given that cmd.exe - rightfully - has always operated without BOM in that case:

After all, any programs relying on the active code page should be able to blindly assume that stdin input / stdout output is encoded accordingly and shouldn't have to deal with a BOM (neither to redundantly imply the same code page nor to potentially signal a different encoding).

The presence of a BOM, in fact, breaks PowerShell's background-jobs feature (via Start-Job), as detailed here:

  • https://windowsserver.uservoice.com/forums/301869-powershell/suggestions/14915283-job-cmdlets-fail-with-utf-8-codepage

Follow-on bugs:

  • PowerShell task - remove BOM from [Console]::InputEncoding to prevent issues with Start-Job microsoft/azure-pipelines-tasks#9766
  • Powershell scripts that use Start-Job fail on Ansible 2.0+ ansible/ansible#15770

It's fair to assume that other applications/languages/frameworks may be affected as well, which can range from outright failure, as in PowerShell's case, to mistakenly considering the BOM part of the data.

Demonstration of cmd's current behavior:

cmd's stdout output has always created BOM-less output with chcp 65001 in effect; e.g. the following creates t.txt as a BOM-less UTF-8 file.

C:\> chcp 65001 C:\> (echo hü)> t.txt C:\> powershell -noprofile -c Format-Hex t.txt Path: C:\Users\jdoe\t.txt 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 00000000 68 C3 BC 0D 0A hü..

The above shows that the file contains no BOM (and that char. ü was correctly encoded as UTF-8 as 2-byte sequence 0xC3 0xBC).

Từ khóa » Chcp 65001 Bom