U e1@sddlZddlmZddlmZddlmZddlmZm Z m Z m Z m Z ddl mZmZmZmZmZddlmZdd lmZdd lmZmZmZmZmZee ed d d Zee edddZeee ed ddZ eee ed ddZ!eedee e"e"fdddZ#d-e ee"e edddZ$ee ee%dddZ&ee ed d!d"Z'e eed#d$d%Z(eed#d&d'Z)ed(dd.ee%e eed*d+d,Z*dS)/N)IncrementalDecoder)Counter) lru_cache)rDictListOptionalTuple) FREQUENCIESKO_NAMESLANGUAGE_SUPPORTED_COUNTTOO_SMALL_SEQUENCEZH_NAMES) is_suspiciously_successive_range)CoherenceMatches)is_accentuatedis_latinis_multi_byte_encodingis_unicode_range_secondary unicode_range) iana_namereturncst|rtdtd|j}|dd}idtddD]^}|t|g}|r>t |}|dkrhq>t |d kr>|krd|<|d 7<d 7q>t fd d DS) zF Return associated unicode ranges in a single byte code page. z.Function not supported on multi-byte code pagez encodings.{}ignore)errorsr@NFr cs g|]}|dkr|qS)g333333?).0character_rangecharacter_countZ seen_rangesrE/opt/hc_python/lib64/python3.8/site-packages/charset_normalizer/cd.py 8sz*encoding_unicode_range..) rIOError importlib import_moduleformatrrangedecodebytesrrsorted)rdecoderpichunkrrrr!encoding_unicode_ranges0    r/) primary_rangercCs>g}tD],\}}|D]}t||kr||q qq |S)z> Return inferred languages used with a unicode range. )r itemsrappend)r0 languageslanguage characters characterrrr!unicode_range_languages@s  r7cCs<t|}d}|D]}d|kr|}q&q|dkr4dgSt|S)z Single-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. NZLatin Latin Based)r/r7)rZunicode_rangesr0Zspecified_rangerrr!encoding_languagesOsr9cCs`|ds&|ds&|ds&|dkr,dgS|ds>|tkrDdgS|dsV|tkr\d gSgS) z Multi-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. Zshift_ iso2022_jpZeuc_jcp932JapanesegbChinese iso2022_krKorean) startswithrr )rrrr!mb_encoding_languagescsrB)maxsize)r4rcCsBd}d}t|D](}|s$t|r$d}|rt|dkrd}q||fS)zg Determine main aspects from a supported language if it contains accents and if is pure Latin. FT)r rr)r4target_have_accentstarget_pure_latinr6rrr!get_target_featuresxs  rFF)r5ignore_non_latinrc sg}tddD}tD]l\}}t|\}}|r@|dkr@q|dkrN|rNqt|}tfdd|D} | |} | dkr||| fqt|ddd d }d d|DS) zE Return associated languages associated to given characters. css|]}t|VqdSN)r)rr6rrr! sz%alphabet_languages..Fcsg|]}|kr|qSrr)rcr5rr!r"sz&alphabet_languages..g?cSs|dSNr rxrrr!z$alphabet_languages..TkeyreversecSsg|] }|dqS)rr)rZcompatible_languagerrr!r"s)anyr r1rFlenr2r*) r5rGr3Zsource_have_accentsr4Zlanguage_charactersrDrEr Zcharacter_match_countratiorrKr!alphabet_languagess"   rW)r4ordered_charactersrcCs|tkrtd|d}tt|}t|}tt|}|dk}t|td|D]D\}}||krfqRt||} ||} t|| } |dkrt | | dkrqR|dkrt | | |dkr|d7}qRt|d| } t|| d } |d|}||d }tt|t| @}tt|t| @}t| dkrJ|dkrJ|d7}qRt| dkrl|dkrl|d7}qR|t| d ks|t| d krR|d7}qRqR|t|S) aN Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language. The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit). Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.) z{} not availablerFTr Ng?) r ValueErrorr&setrUzipr'indexintabs)r4rXZcharacter_approved_countZFREQUENCIES_language_setZordered_characters_countZ target_language_characters_countZlarge_alphabetr6Zcharacter_rankZcharacter_rank_in_languageZexpected_projection_ratioZcharacter_rank_projectionZcharacters_before_sourceZcharacters_after_sourceZcharacters_beforeZcharacters_afterZbefore_match_countZafter_match_countrrr!characters_popularity_comparest      rb)decoded_sequencercCsi}|D]~}|dkrqt|}|dkr,qd}|D]}t||dkr4|}qPq4|dkr\|}||krr|||<q|||7<qt|S)a Given a decoded text sequence, return a list of str. Unicode range / alphabet separation. Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list; One containing the latin letters and the other hebrew. FN)isalpharrlowerlistvalues)rcZlayersr6rZlayer_target_rangeZdiscovered_rangerrr!alpha_unicode_splits,  rh)resultsrcsfi|D]8}|D].}|\}}|kr0|g|<q||qqfddD}t|ddddS)z This function merge results previously given by the function coherence_ratio. The return type is the same as coherence_ratio. cs.g|]&}|tt|t|dfqS)rZ)roundsumrU)rr4Zper_language_ratiosrr!r"1sz*merge_coherence_ratios..cSs|dSrLrrMrrr!rO<rPz(merge_coherence_ratios..TrQ)r2r*)riresultZ sub_resultr4rVmergerrlr!merge_coherence_ratios#s   rocst|D]6}|\}}|dd}|kr2g|<||q tfddDrg}D]}||t|fq`|S|S)u We shall NOT return "English—" in CoherenceMatches because it is an alternative of "English". This function only keeps the best match and remove the em-dash in it. u—c3s|]}t|dkVqdS)r N)rU)reZ index_resultsrr!rIOsz/filter_alt_coherence_matches..)dictreplacer2rTmax)rirmr4rVZ no_em_nameZfiltered_resultsrrrr!filter_alt_coherence_matches?s rvi皙?)rc threshold lg_inclusionrcCsg}d}d}|dk r|dng}d|kr8d}|dt|D]}t|}|} tdd| D} | tkrpq@d d | D} |pt| |D]J} t| | } | |krqn| d kr|d 7}| | t | d f|dkrq@qq@t t |ddddS)z Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers. A layer = Character extraction by alphabets/ranges. FrN,r8Tcss|]\}}|VqdSrHrrrJorrr!rIqsz"coherence_ratio..cSsg|] \}}|qSrrr{rrr!r"vsz#coherence_ratio..g?r rZr[cSs|dSrLrrMrrr!rOrPz!coherence_ratio..rQ) splitremoverhr most_commonrkr rWrbr2rjr*rv)rcrxryrirGZsufficient_match_countZlg_inclusion_listlayerZsequence_frequenciesrr Zpopular_character_orderedr4rVrrr!coherence_ratioZsD   r)F)rwN)+r$codecsr collectionsr functoolsrtypingZ TypeCounterrrrrZconstantr r r r rmdrmodelsrutilsrrrrrstrr/r7r9rBboolrFrWfloatrbrhrorvrrrrr!sL      ' $ P'