4

I have a flat-text data file containing records which fields are separated by the non printable character "File Separator" (0x1c). I am trying to use SQL Server's bcp utility to load this data into my database. Yet, when using the hex encoded value of the File Separator as TERMINATOR, I get a syntax error.

I have tried using

  • the hex encoded value : "0x1c"
  • the XML encoded value as hex: ""
  • the XML encoded value as decimal: ""

None of these work, yet when using the same encoding for a printable character, like tab, this does work: 	, 	 (0x9 doesn't. Not unsurprisingly, since this is an XML file.)

The conclusion seems to be that non-printable characters are not supported. Is this the case? That would be ironical, since the non-printable separator characters are created exactly for this purpose...

Below you can find all code to reproduce this issue:

XML format file: test.xml

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="&#x1c;" MAX_LENGTH="10" COLLATION="Latin1_General_CS_AS_WS"/>
  <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="41"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="COL1" xsi:type="SQLVARYCHAR"/>
  <COLUMN SOURCE="2" NAME="COL2" xsi:type="SQLNUMERIC" PRECISION="4" SCALE="0"/>
 </ROW>
</BCPFORMAT>

Data file: test.txt
This is just one row, as a test case. StackExchange doesn't show the separator in the row below, yet when you click "Edit" for this post, the separator is included, and you should be able to copy-paste this.

1111111112008

Commandline

bcp TEST_DB.dbo.UL_TEST in "test.txt" -T -f "test.xml"
Wouter
  • 780
  • 1
  • 8
  • 21
  • 1
    Weird, because [the documentation](https://docs.microsoft.com/en-us/sql/relational-databases/import-export/xml-format-files-sql-server?view=sql-server-2017#AttrOfFieldElement) states `This attribute specifies the terminator of a data field. The terminator can be any character.` – TT. Jun 07 '18 at 13:10
  • 1
    Maybe this could work? https://dba.stackexchange.com/a/64413/65699. The answer is for non-XML format files though. – TT. Jun 07 '18 at 13:16
  • @TT : Thanks for the suggestion. I've tested this, and it doesn't work for XML format files. i've also tested with normal printable characters, and with 2 characters per the example. – Wouter Jun 08 '18 at 07:44

2 Answers2

6

It seems, from the documentation about BCP - Specify Field and Row Terminators (SQL Server), that non-printable characters are not supported:

Characters Supported As Terminators

The bcp command, BULK INSERT statement, and the OPENROWSET bulk rowset provider support a variety of characters as field or row terminators and always look for the first instance of each terminator. The following table lists the supported characters for terminators.

enter image description here

As an option, you could use POWERSHELL to read in the source file and search\replace the 0x1C characters with some other character like a tilda (~) and output to a different file. Then use that character as your terminator.

(Get-Content c:\test\test.txt) | ForEach-Object {$_ -replace [char]0x1C,'~'} | Set-Content c:\test\testout.txt
Scott Hodgin - Retired
  • 23,119
  • 2
  • 21
  • 43
  • How arbitrary... Printable characters only, for something that has nothing to do with printing? – TT. Jun 07 '18 at 16:15
  • @Scott Hodgin : The documentation indeed implies that non-printable characters can't be used. It's weird, it sucks, and it should be fixed, but still this is a valid answer to the question :) Thanks – Wouter Jun 08 '18 at 07:49
  • Thx for the powershell. In the meantime I had already worked out something similar, yet slightly more elaborate. Posted it as a separate answer. – Wouter Jun 08 '18 at 08:02
  • After some testing, I've figured out that only printable ASCII characters are supported. So, up to HEX 0->127. Any other characters in the cp1252 set, like €, is not supported. Also any other UTF8 character is not supported. – Wouter Jun 08 '18 at 08:46
  • @Wouter - yeah, bummer! Glad you solved it another way :) – Scott Hodgin - Retired Jun 08 '18 at 08:47
1

Since @ScottHodgin pointed out that non-printable characters are not supported, replacing them in the source file is the only alternative.

Below you can find an example PowerShell script that can make this replacement on a file encoded in UTF8(with or without BOM header doesn't matter), and writes an output file encoded in UTF8 without a BOM header.

Additionally, it uses AppendAllLines to turn this into a streaming operation. This way it also works for huge files, since they don't have to be entirely loaded into memory before doing the replacements. -ReadCount 1000 speeds up the process drastically.

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-Content -Encoding UTF8 -ReadCount 1000 test.txt |
     Foreach-Object { 
        [System.IO.File]::AppendAllLines(
              [string]'test.txt'
            , [string[]]($_ -replace '\x1C', '')
            , $Utf8NoBomEncoding
        ) 
    }

Make sure you save the script above in a file that is encoded with UTF8 with BOM header, otherwise PowerShell doesn't process the characters as UTF-8.

Wouter
  • 780
  • 1
  • 8
  • 21