Fix PHP cURL: parser error: Document labelled UTF-16 but has UTF-8 content
What?
This is an article with notes for me on how to convert some received XML encoded in UTF-16 to some JSON in UTF-8. If it were entirely in UTF-8, I would simply load the received XML with SimpleXML and use the built-in PHP JSON_encode function. I ran into the following errors:
Warning: SimpleXMLElement::__construct() [<a href='simplexmlelement.--construct'>simplexmlelement.--construct</a>]: Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###
Warning: simplexml_load_string() [<a href='function.simplexml-load-string'>function.simplexml-load-string</a>]: Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###Why?
So I've googled, binged and yahoo'd for this and although there are some solutions that deal with loading UTF16 content into SimpleXMLElement or simplexml_load_string, it doesn't solve my problem. I'm receiving XML data within a cURL result but I get the above error with using either "SimpleXMLElement" or "simplexml_load_string". Returning the XML with cURL isn't a problem, but I want to convert it to JSON and I usually use a PHP function to load the data into an XML array and use the built-in PHP function: "json_encode".
How?
So here's what I tried and ended up with:
If your XML is UTF-8
This is the basic code and will work to fetch some XML and return it in JSON formatting as long as the XML is encoded in UTF-8.
copyraw
	
// set headers for JSON file
// header('Content-Type: application/json'); // seems to cause 500 Internal Server Error
header('Content-Type: text/javascript');
header('Access-Control-Allow-Origin: http://api.joellipman.com/');
header('Access-Control-Max-Age: 3628800');
header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');
// open connection
$ch = curl_init();
// set the cURL options
curl_setopt($ch, CURLOPT_URL, $api_url);                                // where to send the variables to
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
curl_setopt($ch, CURLOPT_HEADER, 0);                                    // hide header info !!SECURITY WARNING!!
curl_setopt($ch, CURLOPT_POST, TRUE);                                   // TRUE to do a regular HTTP POST.
curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);                 // In my case, the XML form that will be submitted
curl_setopt($ch, CURLOPT_TIMEOUT, 15);                                  // Target API has a 15 second timeout
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);                         // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
// store the response
$ch_result = curl_exec($ch);
// close connection
curl_close($ch);
// convert the response to xml
$xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");
// convert the xml to json
$json_result = json_encode($xml_result);
// print the json
echo $json_result;
// [OPTIONAL] convert it to an array
// $array = json_decode($json_result,TRUE);
// yields <?xml version="1.0" encoding="utf-8"?> ... ... ...
	- // set headers for JSON file
- // header('Content-Type: application/json'); // seems to cause 500 Internal Server Error
- header('Content-Type: text/javascript');
- header('Access-Control-Allow-Origin: http://api.joellipman.com/');
- header('Access-Control-Max-Age: 3628800');
- header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');
- // open connection
- $ch = curl_init();
- // set the cURL options
- curl_setopt($ch, CURLOPT_URL, $api_url);  // where to send the variables to
- curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
- curl_setopt($ch, CURLOPT_HEADER, 0);  // hide header info !!SECURITY WARNING!!
- curl_setopt($ch, CURLOPT_POST, true);  // TRUE to do a regular HTTP POST.
- curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);  // In my case, the XML form that will be submitted
- curl_setopt($ch, CURLOPT_TIMEOUT, 15);  // Target API has a 15 second timeout
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);  // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
- // store the response
- $ch_result = curl_exec($ch);
- // close connection
- curl_close($ch);
- // convert the response to xml
- $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");
- // convert the xml to json
- $json_result = json_encode($xml_result);
- // print the json
- echo $json_result;
- // [OPTIONAL] convert it to an array
- // $array = json_decode($json_result,true);
- // yields <?xml version="1.0" encoding="utf-8"?> ... ... ...
Without cURL
You'll have seen this all over the Internet as the accepted solution... Doesn't work for me because I'm using cURL but it's a first point of reference. This will work if the received XML is a string.
copyraw
	
// setting XML value
$string = '<?xml version="1.0" encoding="utf-16"?>
  <Response Version="1.0">
    <DateTime>2/13/2013 10:37:24 PM
	- // setting XML value
- $string = '<?xml version="1.0" encoding="utf-16"?>
- <Response Version="1.0">
- <DateTime>2/13/2013 10:37:24 PM
With cURL: Other things I tried
ERROR: Using the above preg_replace function
copyraw
	
/* Replace UTF-16 with UTF-8 */
$xml_utf8 = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $ch_result);
$xml_result = simplexml_load_string($xml_utf8);
// yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
// to catch error use: $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");
	- /* Replace UTF-16 with UTF-8 */
- $xml_utf8 = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $ch_result);
- $xml_result = simplexml_load_string($xml_utf8);
- // yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
- // to catch error use: $xml_result = simplexml_load_string($ch_result) or die("Error: Cannot create object");
ERROR: Using built-in function mb_convert_encoding
copyraw
	
/* Convert the UTF-16 to UTF-8: Using function mb_convert_encoding */ $xml_utf8 = mb_convert_encoding($ch_result, 'UTF-8', 'UTF-16'); // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###'
- /* Convert the UTF-16 to UTF-8: Using function mb_convert_encoding */
- $xml_utf8 = mb_convert_encoding($ch_result, 'UTF-8', 'UTF-16');
- // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###'
ERROR: Using built-in function utf8_encode
copyraw
	
/* Convert the UTF-16 to UTF-8 using a function */ $xml_utf8 = utf8_encode($ch_result); // yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
- /* Convert the UTF-16 to UTF-8 using a function */
- $xml_utf8 = utf8_encode($ch_result);
- // yields error 'Entity: line 1: parser error : Document labelled UTF-16 but has UTF-8 content in /public_html/.../.../my_script.php on line ###'
ERROR: A potential function to re-encode it from Craig Lotter
copyraw
	
/* Convert the UTF-16 to UTF-8 using a function */ $xml_utf8 = utf16_to_utf8($ch_result); // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###' // also yields: ??? 呭㤳䥆汶摓䉄套㑧唲噬䥅ㅬ䥑㴽
- /* Convert the UTF-16 to UTF-8 using a function */
- $xml_utf8 = utf16_to_utf8($ch_result);
- // yields error 'parser error : Start tag expected, '<' not found in /public_html/.../.../my_script.php on line ###'
- // also yields: ??? '㤳䥆汶'"䉄--'"噬......'㴽
ERRORS: A 2-Hour play around
copyraw
	
/* Encode received cURL result in a JSON feed */
$json_encoded_str = json_encode($ch_result);
/* Convert the UTF-16 to UTF-8 using a function */
$json_encoded_str_8 = (string) utf8_encode($json_encoded_str);
/* In the XML, replace the UTF-16 with UTF-8 */
$json_encoded_str = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $json_encoded_str_8);  
/* In the XML, replace the UTF-16 with UTF-8 */
$json_encoded = json_encode($json_encoded_str);
// yields escaped JSON: "<?xml version=\"1.0\" encoding=\"utf-16\"?><soap:Envelope
	- /* Encode received cURL result in a JSON feed */
- $json_encoded_str = json_encode($ch_result);
- /* Convert the UTF-16 to UTF-8 using a function */
- $json_encoded_str_8 = (string) utf8_encode($json_encoded_str);
- /* In the XML, replace the UTF-16 with UTF-8 */
- $json_encoded_str = preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $json_encoded_str_8);
- /* In the XML, replace the UTF-16 with UTF-8 */
- $json_encoded = json_encode($json_encoded_str);
- // yields escaped JSON: "<?xml version=\"1.0\" encoding=\"utf-16\"?><soap:Envelope
ERROR: Using built-in function iconv. Another 4-hour saga
copyraw
	
/* Convert the UTF-16 to UTF-8 using a function */
$xml_utf8 = iconv('UTF-16', 'UTF-8', $ch_result);
// $xml_utf8 = iconv('UTF-16BE', 'UTF-8', $ch_result); // same result specifying Big-Endian
// yields error 'error on line 1 at column 1: Document is empty'
// but view the source: 㼼浸敶獲潩㵮ㄢ〮•湥潣楤杮∽瑵ⵦ㘱㼢㰾潳灡䔺癮汥灯浸湬㩳獸㵩栢瑴㩰⼯睷㍷漮杲㈯
// OTHER ERRORS:
// error on line 1 at column 1: Document is empty
// error on line 2 at column 1: Extra content at the end of the document
// error on line 2 at column 1: Encoding error
// error on line 1 at column 491: xmlParseEntityRef: no name
// this is because you need to escape the 5 characters (", ', <, >, &) in XML
	- /* Convert the UTF-16 to UTF-8 using a function */
- $xml_utf8 = iconv('UTF-16', 'UTF-8', $ch_result);
- // $xml_utf8 = iconv('UTF-16BE', 'UTF-8', $ch_result); // same result specifying Big-Endian
- // yields error 'error on line 1 at column 1: Document is empty'
- // but view the source: 㼼浸敶獲潩㵮ㄢ〮•湥潣楤杮∽'ⵦ㘱㼢㰾潳灡"癮汥灯浸湬㩳獸㵩栢'㩰⼯睷㍷漮杲㈯
- // OTHER ERRORS:
- // error on line 1 at column 1: Document is empty
- // error on line 2 at column 1: Extra content at the end of the document
- // error on line 2 at column 1: Encoding error
- // error on line 1 at column 491: xmlParseEntityRef: no name
- // this is because you need to escape the 5 characters (", ', <, >, &) in XML
NOT-QUITE-RIGHT: Use a Parser and re-Output the XML
copyraw
	
// Create an XML parser
$parser = xml_parser_create();
// Stop returning elements in UPPERCASE
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
// Parse XML data into an array structure
xml_parse_into_struct($parser, str_replace(array("\n", "\r", "\t"), '', $ch_result), $structure);
// Free the XML parser
xml_parser_free($parser);
// create XML string from parsed XML
$xml_string = '';
$xml_escaped_chars = array('"', '\'', '<', '>', '&');
$xml_escaped_chars_rep = array('"', ''', '<', '>', '&');
foreach($structure as $xml_element){
        $this_value = (isset($xml_element['value'])) ? str_replace($xml_escaped_chars, $xml_escaped_chars_rep, trim($xml_element['value'])) : '';
        $this_attr = (isset($xml_element['attributes'])) ? $xml_element['attributes'] : array();
        $this_attr_str = '';
        if (count($this_attr)>0){
                foreach($this_attr as $attr_key => $attr_value){
                        $this_attr_str.= ' '.$attr_key.'="'.$attr_value.'"';
                }
        }
        if ($xml_element['type']=='open'){
                $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>';
        } else if ($xml_element['type']=='complete'){
                $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>'.$this_value.'</'.$xml_element['tag'].'>';
        } else if ($xml_element['type']=='close'){
                $xml_string.='</'.$xml_element['tag'].'>';
        }
}
// $simple_xml = simplexml_load_string($xml_string);  // still fails (not UTF-8)
 echo '<?xml version="1.0" encoding="utf-8"?>'.utf8_encode($xml_string);
// yields <?xml version="1.0" encoding="utf-8"?> ... ... ... (corrupted?)
	- // Create an XML parser
- $parser = xml_parser_create();
- // Stop returning elements in UPPERCASE
- xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
- // Parse XML data into an array structure
- xml_parse_into_struct($parser, str_replace(array("\n", "\r", "\t"), '', $ch_result), $structure);
- // Free the XML parser
- xml_parser_free($parser);
- // create XML string from parsed XML
- $xml_string = '';
- $xml_escaped_chars = array('"', '\'', '<', '>', '&');
- $xml_escaped_chars_rep = array('"', ''', '<', '>', '&');
- foreach($structure as $xml_element){
- $this_value = (isset($xml_element['value'])) ? str_replace($xml_escaped_chars, $xml_escaped_chars_rep, trim($xml_element['value'])) : '';
- $this_attr = (isset($xml_element['attributes'])) ? $xml_element['attributes'] : array();
- $this_attr_str = '';
- if (count($this_attr)>0){
- foreach($this_attr as $attr_key => $attr_value){
- $this_attr_str.= ' '.$attr_key.'="'.$attr_value.'"';
- }
- }
- if ($xml_element['type']=='open'){
- $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>';
- } else if ($xml_element['type']=='complete'){
- $xml_string.='<'.$xml_element['tag'].$this_attr_str.'>'.$this_value.'</'.$xml_element['tag'].'>';
- } else if ($xml_element['type']=='close'){
- $xml_string.='</'.$xml_element['tag'].'>';
- }
- }
- // $simple_xml = simplexml_load_string($xml_string); // still fails (not UTF-8)
- echo '<?xml version="1.0" encoding="utf-8"?>'.utf8_encode($xml_string);
- // yields <?xml version="1.0" encoding="utf-8"?> ... ... ... (corrupted?)
So...
With cURL - a solution with a compromise
After many more hours, a solution to convert XML in UTF-16 from a cURL source and convert it to JSON. The output isn't necessarily in UTF-8 so I'll update this article if the mobile app has problems reading the JSON feed. When writing the loop of the "not-quite-right" solution above, I found the following function in a discussion thread: Integrating symphony website with external api [whmcs]
copyraw
	
// set headers for JSON file
header('Content-Type: text/javascript; charset=utf8');
header('Access-Control-Allow-Origin: http://api.joellipman.com/');
header('Access-Control-Max-Age: 3628800');
header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');
// the function that will convert our XML to an array
function whmcsapi_xml_parser($rawxml) {
    $xml_parser = xml_parser_create();
    xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0);     // stop elements being converted to UPPERCASE
    xml_parse_into_struct($xml_parser, $rawxml, $vals, $index);
    xml_parser_free($xml_parser);
    $params = array();
    $level = array();
    $alreadyused = array();
    $x=0;
    foreach ($vals as $xml_elem) {
      if ($xml_elem['type'] == 'open') {
         if (in_array($xml_elem['tag'],$alreadyused)) {
            $x++;
            $xml_elem['tag'] = $xml_elem['tag'].$x;
         }
         $level[$xml_elem['level']] = $xml_elem['tag'];
         $alreadyused[] = $xml_elem['tag'];
      }
      if ($xml_elem['type'] == 'complete') {
       $start_level = 1;
       $php_stmt = '$params';
       while($start_level < $xml_elem['level']) {
         $php_stmt .= '[$level['.$start_level.']]';
         $start_level++;
       }
       $php_stmt .= '[$xml_elem[\'tag\']] = $xml_elem[\'value\'];';
       @eval($php_stmt);
      }
    }
    return($params);
}
// open connection
$ch = curl_init();
// set the cURL options
curl_setopt($ch, CURLOPT_URL, $api_url);                                // where to send the variables to
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
curl_setopt($ch, CURLOPT_HEADER, 0);                                    // hide header info !!SECURITY WARNING!!
curl_setopt($ch, CURLOPT_POST, TRUE);                                   // TRUE to do a regular HTTP POST.
curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);                 // In my case, the XML form that will be submitted
curl_setopt($ch, CURLOPT_TIMEOUT, 15);                                  // Target API has a 15 second timeout
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);                         // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
// store the response
$ch_result = curl_exec($ch);
// close connection
curl_close($ch);
// parse XML with the whmcsapi_xml_parser function
$whmcsapi_arr = whmcsapi_xml_parser($ch_result); 
// Output returned value as Array
// print_r($whmcsapi_arr); 
// Encode in JSON
$json_whmcsapi = json_encode((array) $whmcsapi_arr);
echo $json_whmcsapi;
	- // set headers for JSON file
- header('Content-Type: text/javascript; charset=utf8');
- header('Access-Control-Allow-Origin: http://api.joellipman.com/');
- header('Access-Control-Max-Age: 3628800');
- header('Access-Control-Allow-Methods: GET, POST, PUT, DELETE');
- // the function that will convert our XML to an array
- function whmcsapi_xml_parser($rawxml) {
- $xml_parser = xml_parser_create();
- xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 0);  // stop elements being converted to UPPERCASE
- xml_parse_into_struct($xml_parser, $rawxml, $vals, $index);
- xml_parser_free($xml_parser);
- $params = array();
- $level = array();
- $alreadyused = array();
- $x=0;
- foreach ($vals as $xml_elem) {
- if ($xml_elem['type'] == 'open') {
- if (in_array($xml_elem['tag'],$alreadyused)) {
- $x++;
- $xml_elem['tag'] = $xml_elem['tag'].$x;
- }
- $level[$xml_elem['level']] = $xml_elem['tag'];
- $alreadyused[] = $xml_elem['tag'];
- }
- if ($xml_elem['type'] == 'complete') {
- $start_level = 1;
- $php_stmt = '$params';
- while($start_level < $xml_elem['level']) {
- $php_stmt .= '[$level['.$start_level.']]';
- $start_level++;
- }
- $php_stmt .= '[$xml_elem[\'tag\']] = $xml_elem[\'value\'];';
- @eval($php_stmt);
- }
- }
- return($params);
- }
- // open connection
- $ch = curl_init();
- // set the cURL options
- curl_setopt($ch, CURLOPT_URL, $api_url);  // where to send the variables to
- curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));  // specify content type of what we're sending
- curl_setopt($ch, CURLOPT_HEADER, 0);  // hide header info !!SECURITY WARNING!!
- curl_setopt($ch, CURLOPT_POST, true);  // TRUE to do a regular HTTP POST.
- curl_setopt($ch, CURLOPT_POSTFIELDS, $api_message_xml);  // In my case, the XML form that will be submitted
- curl_setopt($ch, CURLOPT_TIMEOUT, 15);  // Target API has a 15 second timeout
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);  // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
- // store the response
- $ch_result = curl_exec($ch);
- // close connection
- curl_close($ch);
- // parse XML with the whmcsapi_xml_parser function
- $whmcsapi_arr = whmcsapi_xml_parser($ch_result);
- // Output returned value as Array
- // print_r($whmcsapi_arr);
- // Encode in JSON
- $json_whmcsapi = json_encode((array) $whmcsapi_arr);
- echo $json_whmcsapi;
Off-Topic
But good snippet for cURL by David Walsh
copyraw
	
// set POST variables
$url = 'http://domain.com/get-post.php';
$fields = array(
        'lname' => urlencode($last_name),
        'fname' => urlencode($first_name),
        'title' => urlencode($title),
        'company' => urlencode($institution),
        'age' => urlencode($age),
        'email' => urlencode($email),
        'phone' => urlencode($phone)
);
// url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
rtrim($fields_string, '&');
// open connection
$ch = curl_init();
// set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_POST, count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string);
// execute post
$result = curl_exec($ch);
// close connection
curl_close($ch);
	- // set POST variables
- $url = 'http://domain.com/get-post.php';
- $fields = array(
- 'lname' => urlencode($last_name),
- 'fname' => urlencode($first_name),
- 'title' => urlencode($title),
- 'company' => urlencode($institution),
- 'age' => urlencode($age),
- 'email' => urlencode($email),
- 'phone' => urlencode($phone)
- );
- // url-ify the data for the POST
- foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
- rtrim($fields_string, '&');
- // open connection
- $ch = curl_init();
- // set the url, number of POST vars, POST data
- curl_setopt($ch,CURLOPT_URL, $url);
- curl_setopt($ch,CURLOPT_POST, count($fields));
- curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string);
- // execute post
- $result = curl_exec($ch);
- // close connection
- curl_close($ch);
Things I stumbled upon regarding SSL and cURL
Posted data for third-party apps is often required via SSL so this may come in handy
copyraw
	
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); // TRUE to output SSL certification information to STDERR on secure transfers. curl_setopt($ch, CURLOPT_CERTINFO, TRUE); curl_setopt($ch, CURL_SSLVERSION_SSLv3, TRUE);
- curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
- curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
- // TRUE to output SSL certification information to STDERR on secure transfers.
- curl_setopt($ch, CURLOPT_CERTINFO, true);
- curl_setopt($ch, CURL_SSLVERSION_SSLv3, true);
Future Considerations
The data still hasn't been properly decoded from UTF-16 and encoded to UTF-8
- Test writing to a file, re-encoding the file then reading from it.
Helpful Links Well this is my stop. It's being several hours that for others could have taken a several minutes if you knew where to look. My aim was to convert UTF-16 received XML to UTF-8 in order to convert XML to JSON and that has been achieved in part. It's 6am and I'm off to bed.
Category: Personal Home Page :: Article: 605
	

 
						  
                 
						  
                 
						  
                 
						  
                 
						  
                 
 
 

 
 
Add comment