mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encoding — Détecte un encodage

Description

mb_detect_encoding(string $string, array|string|null $encodings = null, bool $strict = false): string|false

Détecte l'encodage le plus probable pour la chaîne de caractères string depuis une liste ordonnée de candidats.

La détection automatique du jeu d'encodage prévu n'est jamais totalement fiable ; sans information additionnelles, c'est similaire à décoder une chaîne chiffré sans la clé. Il est toujours préférable d'utiliser une indication du jeu d'encodage stocké ou transmis avec les données, tel que l'en-tête HTTP "Content-Type".

Cette fonction est le plus utilise avec les encodages multi-octets, où pas toutes les séquences d'octets forment une chaîne valide. Si la chaîne d'entrée contient un telle séquence, cet encodage sera rejeté, et le prochain encodage sera vérifié.

Liste de paramètres

string

La string étant inspecté.

encodings

Une liste d'encodage de caractères à essayer, dans l'ordre. Cette liste peut être spécifier comme un tableau de chaîne de caractères, en tant qu'une chaîne de caractères unique séparé par des virgules.

Si encodings est omis ou null, le detect_order actuel (défini avec l'option de configuration mbstring.detect_order, ou la fonction mb_detect_order()) sera utilisé.

strict

Contrôle le comportement quand string n'est valide dans aucun des encodings listé. Si strict est défini à false, l'encodage qui correspond le plus sera retourné ; si strict est défini à true, false sera retourné.

La valeur par défaut de strict peut être définie avec l'option de configuration mbstring.strict_detection.

Valeurs de retour

L'encodage de caractère détecté, ou false si la chaîne n'est pas valide dans un seul des encodages listé.

Exemples

Exemple #1 Exemple avec mb_detect_encoding()


<?php
// Détecte l'encodage avec le detect_order actuel
echo mb_detect_encoding($str);

// "auto" est modifié selon mbstring.language
echo mb_detect_encoding($str, "auto");

// Spécifie le paramètre "encodings" avec une liste à virgules
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

// Utilisation d'un tableau pour spécifie le paramètre "encodings"
$encodings = [
  "ASCII",
  "JIS",
  "EUC-JP"
];
echo mb_detect_encoding($str, $encodings);
?>

Exemple #2 Effet du paramètre strict


<?php
// 'áéóú' encoded in ISO-8859-1
$str = "\xE1\xE9\xF3\xFA";

// The string is not valid ASCII or UTF-8, but UTF-8 is considered a closer match
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// If a valid encoding is found, the strict parameter does not change the result
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
?>

L'exemple ci-dessus va afficher :

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

Dans certain cas, la même séquences d'octet peut former une chaîne valide dans différents encodages de caractères, et il est impossible de déterminer quelle interprétation était prévu. Par example, parmi tant d'autres, séquance d'octets "\xC4\xA2" pourrait être :

"Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS suivi de U+00A2 CENT SIGN) encodé dans un de ISO-8859-1, ISO-8859-15, ou Windows-1252
"ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF suivi de U+0402 CYRILLIC CAPITAL LETTER DJE) encodé en ISO-8859-5
"Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) encodé en UTF-8

Exemple #3 Effet de l'ordre quand plusieurs encodages correspondent


<?php
$str = "\xC4\xA2";

// The string is valid in all three encodings, so the first one listed will be returned
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5']));
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
?>

L'exemple ci-dessus va afficher :

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

Voir aussi

mb_detect_order() - Lit/modifie l'ordre de détection des encodages

add a note

User Contributed Notes 23 notes

down

Gerg Tisza ¶

13 years ago


If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

<?php
    $str = 'áéóú'; // ISO-8859-1
    mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
    mb_detect_encoding($str, 'UTF-8', true); // false
?>

down

Chrigu ¶

19 years ago


If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

down

rl at itfigures dot nl ¶

16 years ago


I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset. 

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){
  $str=str_replace("\xE2\x82\xAC","&euro;",$str); 
  $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
  $str=str_replace("&euro;","\x80",$str); 
}

If html-output is needed the last line is not necessary (and even unwanted).

down

chris AT w3style.co DOT uk ¶

17 years ago


Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.

<?php

function detectUTF8($string)
{
        return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
        )+%xs', $string);
}

?>

down

dennis at nikolaenko dot ru ¶

15 years ago


Beware of bug to detect Russian encodings
http://bugs.php.net/bug.php?id=38138

down

nat3738 at gmail dot com ¶

14 years ago


A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

<?php
// Unicode BOM is U+FEFF, but after encoded, it will look like this.
define ('UTF32_BIG_ENDIAN_BOM'   , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM'   , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM'               , chr(0xEF) . chr(0xBB) . chr(0xBF));

function detect_utf_encoding($filename) {

    $text = file_get_contents($filename);
    $first2 = substr($text, 0, 2);
    $first3 = substr($text, 0, 3);
    $first4 = substr($text, 0, 3);
    
    if ($first3 == UTF8_BOM) return 'UTF-8';
    elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
    elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
    elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
    elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>

down

recentUser at example dot com ¶

6 years ago


In my environment (PHP 7.1.12),
"mb_detect_encoding()" doesn't work
     where "mb_detect_order()" is not set appropriately.

To enable "mb_detect_encoding()" to work in such a case,
     simply put "mb_detect_order('...')"
     before "mb_detect_encoding()" in your script file.

Both 
     "ini_set('mbstring.language', '...');"
     and
     "ini_set('mbstring.detect_order', '...');"
DON'T work in script files for this purpose
whereas setting them in PHP.INI file may work.

down

hmdker at gmail dot com ¶

15 years ago


Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

<?php
function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}
?>

down

php-note-2005 at ryandesign dot com ¶

19 years ago


Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
    
    // From http://w3.org/International/questions/qa-forms-utf-8.html
    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);
    
} // function is_utf8

?>

down

eyecatchup at gmail dot com ¶

10 years ago


Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
  if (preg_match("//u", $string)) {
      // $string is valid UTF-8
  }

down

garbage at iglou dot eu ¶

7 years ago


For detect UTF-8, you can use:

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori

down

telemach ¶

18 years ago


beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentu�e' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while 

mb_detect_encoding('accentu�' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '�' (and probably other accentuated chars) mislead mb_detect_encoding

down

emoebel at web dot de ¶

10 years ago


if the  function " mb_detect_encoding" does not exist  ... 

... try: 

<?php 
// ---------------------------------------------------- 
if ( !function_exists('mb_detect_encoding') ) { 

// ---------------------------------------------------------------- 
function mb_detect_encoding ($string, $enc=null, $ret=null) { 
       
        static $enclist = array( 
            'UTF-8', 'ASCII', 
            'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 
            'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 
            'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 
            'Windows-1251', 'Windows-1252', 'Windows-1254', 
            );
        
        $result = false; 
        
        foreach ($enclist as $item) { 
            $sample = iconv($item, $item, $string); 
            if (md5($sample) == md5($string)) { 
                if ($ret === NULL) { $result = $item; } else { $result = true; } 
                break; 
            }
        }
        
    return $result; 
} 
// ---------------------------------------------------------------- 

} 
// ---------------------------------------------------- 
?>

example / usage of: mb_detect_encoding() 

<?php 
// ------------------------------------------------------ 
function str_to_utf8 ($str) { 
    
    if (mb_detect_encoding($str, 'UTF-8', true) === false) { 
    $str = utf8_encode($str); 
    }

    return $str;
}
// ------------------------------------------------------ 
?>

$txtstr = str_to_utf8($txtstr);

down

bmrkbyet at web dot de ¶

11 years ago


a) if the FUNCTION mb_detect_encoding is not available: 

### mb_detect_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_detect_encoding')) { 
function mb_detect_encoding($string, $enc=null) { 
    
    static $list = array('utf-8', 'iso-8859-1', 'windows-1251');
    
    foreach ($list as $item) {
        $sample = iconv($item, $item, $string);
        if (md5($sample) == md5($string)) { 
            if ($enc == $item) { return true; }    else { return $item; } 
        }
    }
    return null;
}
}

// -------------------------------------------
?>

b) if the FUNCTION mb_convert_encoding is not available: 

### mb_convert_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_convert_encoding')) { 
function mb_convert_encoding($string, $target_encoding, $source_encoding) { 
    $string = iconv($source_encoding, $target_encoding, $string); 
    return $string; 
}
}

// -------------------------------------------
?>

down

maarten ¶

19 years ago


Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
//    utf8 encoding validation developed based on Wikipedia entry at:
//    http://en.wikipedia.org/wiki/UTF-8
//
//    Implemented as a recursive descent parser based on a simple state machine
//    copyright 2005 Maarten Meijer
//
//    This cries out for a C-implementation to be included in PHP core
//
    function valid_1byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0x80) == 0x00;
    }
    
    function valid_2byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xE0) == 0xC0;
    }

    function valid_3byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF0) == 0xE0;
    }

    function valid_4byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF8) == 0xF0;
    }
    
    function valid_nextbyte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xC0) == 0x80;
    }
    
    function valid_utf8($string) {
        $len = strlen($string);
        $i = 0;    
        while( $i < $len ) {
            $char = ord(substr($string, $i++, 1));
            if(valid_1byte($char)) {    // continue
                continue;
            } else if(valid_2byte($char)) { // check 1 byte
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_3byte($char)) { // check 2 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_4byte($char)) { // check 3 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } // goto next char
        }
        return true; // done
    }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

down

-5

Anonymous ¶

10 years ago


// ----------------------------------------------------------- 

if(!function_exists('mb_detect_encoding')) {

function mb_detect_encoding($string, $enc=null, $ret=true) {
    $out=$enc; 
    static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251');
        foreach ($list as $item) {
            $sample = iconv($item, $item, $string);
            if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; } 
        } 
    return $out;
}

}

// -----------------------------------------------------------

down

-1

lotushzy at gmail dot com ¶

6 years ago


About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:
mb_detect_encoding('áéóú', 'UTF-8', true); // false
but now the result is not false, can you give me reason, thanks!

down

-2

lexonight at yahoo dot com ¶

7 years ago


<?php
$file = file_get_contents("somefile.txt");
$encodings = implode(',', mb_list_encodings());
echo mb_detect_encoding($file, $encodings, true);
?>
seems to work

down

-3

yaqy at qq dot com ¶

15 years ago


<?php

/*

*QQ: 290359552

* conver to Utf8 if $str is not equals to 'UTF-8'

*/

function convToUtf8($str)

{

if( mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" )

{



return  iconv("gbk","utf-8",$str);



}

else

{

return $str;

}



}

?>

down

-4

matthijs at ischen dot nl ¶

15 years ago


I seriously underestimated the importance of setlocale...
<?php
$strings = array(
    "mais coisas a pensar sobre diário ou dois!",
    "plus de choses à penser à journalier ou à deux !",
    "¡más cosas a pensar en diario o dos!",
    "più cose da pensare circa giornaliere o due!",
    "flere ting å tenke på hver dag eller to!",
    "Další věcí, přemýšlet o každý den nebo dva!",
    "mehr über Spaß spät schönen",
    "më vonë gjatë fun bukur",
    "több mint szórakozás késő csodálatos kenyér"
);

$convert = array();
setlocale(LC_CTYPE, 'de_DE.UTF-8');
foreach( $strings as $string )
        $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
?>

Produces the following: 

Array
(
    [0] => mais coisas a pensar sobre diario ou dois!
    [1] => plus de choses a penser a journalier ou a deux !
    [2] => ?mas cosas a pensar en diario o dos!
    [3] => piu cose da pensare circa giornaliere o due!
    [4] => flere ting aa tenke paa hver dag eller to!
    [5] => Dalsi veci, premyslet o kazdy den nebo dva!
    [6] => mehr ueber Spass spaet schoenen
    [7] => me vone gjate fun bukur
    [8] => toebb mint szorakozas keso csodalatos kenyer
)

whereas 

<?php
$convert = array();
setlocale(LC_CTYPE, 'nl_NL.UTF-8');
foreach( $strings as $string )
        $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
?>

produces:
Array
(
    [0] => mais coisas a pensar sobre di?rio ou dois!
    [1] => plus de choses ? penser ? journalier ou ? deux !
    [2] => ?m?s cosas a pensar en diario o dos!
    [3] => pi? cose da pensare circa giornaliere o due!
    [4] => flere ting ? tenke p? hver dag eller to!
    [5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva!
    [6] => mehr ?ber Spass sp?t sch?nen
    [7] => m? von? gjat? fun bukur
    [8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r
)

This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.

down

-6

jaaks at playtech dot com ¶

19 years ago


Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.

Replace
         } // goto next char
with
         } else {
           return false; // 10xxxxxx occuring alone
         } // goto next char

down

-11

sunggsun ¶

17 years ago


from PHPDIG

    function isUTF8($str) {
        if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
            return true;
        } else {
            return false;
        }
    }

down

-10

prgss at bk dot ru ¶

15 years ago


Another light way to detect character encoding:
<?php
function detect_encoding($string) {  
  static $list = array('utf-8', 'windows-1251');
  
  foreach ($list as $item) {
    $sample = iconv($item, $item, $string);
    if (md5($sample) == md5($string))
      return $item;
  }
  return null;
}
?>

add a note