3 `9Y0@sdZddlZddlZddlZddlmZddlmZmZm Z ddl m Z ddl m Z ddlmZdd lmZGd d d eZdS) a Module containing the UniversalDetector detector class, which is the primary class a user of ``chardet`` should use. :author: Mark Pilgrim (initial port to Python) :author: Shy Shalom (original C code) :author: Dan Blanchard (major refactoring for 3.0) :author: Ian Cordasco N)CharSetGroupProber) InputStateLanguageFilter ProbingState)EscCharSetProber) Latin1Prober)MBCSGroupProber)SBCSGroupProberc @sneZdZdZdZejdZejdZejdZ dddd d d d d dZ e j fddZ ddZddZddZdS)UniversalDetectoraq The ``UniversalDetector`` class underlies the ``chardet.detect`` function and coordinates all of the different charset probers. To get a ``dict`` containing an encoding and its confidence, you can simply run: .. code:: u = UniversalDetector() u.feed(some_bytes) u.close() detected = u.result g?s[-]s(|~{)s[-]z Windows-1252z Windows-1250z Windows-1251z Windows-1256z Windows-1253z Windows-1255z Windows-1254z Windows-1257)z iso-8859-1z iso-8859-2z iso-8859-5z iso-8859-6z iso-8859-7z iso-8859-8z iso-8859-9z iso-8859-13cCsNd|_g|_d|_d|_d|_d|_d|_||_tj t |_ d|_ |j dS)N)_esc_charset_prober_charset_probersresultdone _got_data _input_state _last_char lang_filterloggingZ getLogger__name__logger_has_win_bytesreset)selfrr'/usr/lib/python3.6/universaldetector.py__init__Qs zUniversalDetector.__init__cCsZdddd|_d|_d|_d|_tj|_d|_|jr>|jj x|j D] }|j qFWdS)z Reset the UniversalDetector and all of its probers back to their initial states. This is called by ``__init__``, so you only need to call this directly in between analyses of different documents. Ng)encoding confidencelanguageF) rrrrr PURE_ASCIIrrr rr )rproberrrrr^s  zUniversalDetector.resetcCs>|jr dSt|sdSt|ts(t|}|js|jtjrJdddd|_nv|jtj tj frldddd|_nT|jdrdddd|_n:|jd rd ddd|_n |jtj tj frd ddd|_d |_|jd dk rd |_dS|j tjkr.|jj|rtj|_ n*|j tjkr.|jj|j|r.tj|_ |dd|_|j tjkr|js^t|j|_|jj|tjkr:|jj|jj|jjd|_d |_n|j tjkr:|jst |jg|_|jt!j"@r|jj#t$|jj#t%x@|jD]6}|j|tjkr|j|j|jd|_d |_PqW|j&j|r:d |_'dS)a Takes a chunk of a document and feeds it through all of the relevant charset probers. After calling ``feed``, you can check the value of the ``done`` attribute to see if you need to continue feeding the ``UniversalDetector`` more data, or if it has made a prediction (in the ``result`` attribute). .. note:: You should always call ``close`` when you're done feeding in your document if ``done`` is not already ``True``. Nz UTF-8-SIGg?)rrrzUTF-32szX-ISO-10646-UCS-4-3412szX-ISO-10646-UCS-4-2143zUTF-16Trr)(rlen isinstance bytearrayr startswithcodecsBOM_UTF8r BOM_UTF32_LE BOM_UTF32_BEBOM_LEBOM_BErrr!HIGH_BYTE_DETECTORsearch HIGH_BYTE ESC_DETECTORrZ ESC_ASCIIr rrfeedrZFOUND_IT charset_nameget_confidencerr r rZNON_CJKappendr rWIN_BYTE_DETECTORr)rZbyte_strr"rrrr3os|              zUniversalDetector.feedc Cs|jr |jSd|_|js&|jjdn|jtjkrBdddd|_n|jtjkrd}d}d}x,|j D]"}|slqb|j }||krb|}|}qbW|r||j kr|j }|j j }|j }|jd r|jr|jj||}|||jd|_|jjtjkrz|jd dkrz|jjd xn|j D]d}|s qt|trZxF|jD] }|jjd |j |j|j q4Wn|jjd |j |j|j qW|jS) z Stop analyzing the current document and come up with a final prediction. :returns: The ``result`` attribute, a ``dict`` with the keys `encoding`, `confidence`, and `language`. Tzno data received!asciig?r#)rrrNgziso-8859rz no probers hit minimum thresholdz%s %s confidence = %s)rrrrdebugrrr!r1r r5MINIMUM_THRESHOLDr4lowerr(r ISO_WIN_MAPgetrZgetEffectiveLevelrDEBUGr&rZprobers) rZprober_confidenceZmax_prober_confidenceZ max_proberr"r4Zlower_charset_namerZ group_proberrrrcloses`            zUniversalDetector.closeN)r __module__ __qualname____doc__r:recompiler/r2r7r<rZALLrrr3r?rrrrr 3s"    mr )rBr)rrCZcharsetgroupproberrZenumsrrrZ escproberrZ latin1proberrZmbcsgroupproberr Zsbcsgroupproberr objectr rrrr$s