jfD+ nddlZddlmZddlmZmZddlmZddlm Z m Z m Z m Z ddl mZddlmZmZmZmZdd lmZdd lmZdd lmZmZmZmZmZd ed e efdZded e efdZ ed ed e efdZ!ed ed e efdZ"eeded e e#e#ffdZ$ d%de ede#d e efdZ%dede ed e&fdZ'ded e efdZ(de ed efdZ)ed  d&ded"e&d#e ed efd$Z*dS)'N)IncrementalDecoder)Counter OrderedDict) lru_cache)DictListOptionalTuple) FREQUENCIES)KO_NAMESLANGUAGE_SUPPORTED_COUNTTOO_SMALL_SEQUENCEZH_NAMES) is_suspiciously_successive_range)CoherenceMatches)is_accentuatedis_latinis_multi_byte_encodingis_unicode_range_secondary unicode_range iana_namereturnct|rtdtjd|j}|d}idt ddD]h}|t|g}|rAt|}|9t|d ur|vrd|<|xxd z cc<d z itfd DS) zF Return associated unicode ranges in a single byte code page. z.Function not supported on multi-byte code pagez encodings.{}ignore)errorsr@NFr c2g|]}|z dk|S)g333333?).0character_rangecharacter_count seen_rangess u/builddir/build/BUILD/imunify360-venv-2.3.5/opt/imunify360/venv/lib/python3.11/site-packages/charset_normalizer/cd.py z*encoding_unicode_range..1s8   ?+o=EE EEE) rIOError importlib import_moduleformatrrangedecodebytesrrsorted)rdecoderpichunkr"r#r$s @@r%encoding_unicode_ranger4s?i((HFGGG%n&;&;I&F&FGGZGx   AKO 4   % %s$$  %+E22O&)/::eCC"+5534K0O,,,1,,,1$      #.     r' primary_rangecg}tjD]4\}}|D],}t||kr||n-5|S)z> Return inferred languages used with a unicode range. )r itemsrappend)r5 languageslanguage characters characters r%unicode_range_languagesr=9soI + 1 3 3*#  IY''=88  ***9 r'cft|}d}|D] }d|vr|}n |dgSt|S)z Single-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. NLatin Latin Based)r4r=)runicode_rangesr5specified_ranges r%encoding_languagesrCHs] ,I66NM) / ) )+M E * "= 1 11r'c|ds0|ds|ds|dkrdgS|ds |tvrddgS|d s |tvrd gSgS) z Multi-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. shift_ iso2022_jpeuc_jcp932JapanesegbChinesezClassical Chinese iso2022_krKorean) startswithrr )rs r%mb_encoding_languagesrO\s X&&    - -    ( (   |D!!0Y(%:%:.//L))Y(-B-Bz Ir')maxsizer:cd}d}t|D]*}|st|rd}|rt|durd}+||fS)zg Determine main aspects from a supported language if it contains accents and if is pure Latin. FT)r rr)r:target_have_accentstarget_pure_latinr<s r%get_target_featuresrTqsn   *&& " '~i'@'@ '"&   &)!4!4!=!= %   1 11r'Fr;ignore_non_latincg}tdD}tjD]q\}}t|\}}|r|dur|dur|r%t |}t fd|D} | |z } | dkr||| frt |dd}d|DS) zE Return associated languages associated to given characters. c34K|]}t|VdSN)r)r!r<s r% z%alphabet_languages..s*TTInY77TTTTTTr'Fcg|]}|v| Sr r )r!cr;s r%r&z&alphabet_languages..s ? ? ?1qJQr'g?c|dSNr r xs r%z$alphabet_languages..s !r'Tkeyreversecg|] }|d S)rr )r!compatible_languages r%r&z&alphabet_languages..s H H H':  " H H Hr')anyr r7rTlenr8r/) r;rUr9source_have_accentsr:language_charactersrRrSr#character_match_countratios ` r%alphabet_languagesrls ITTTTTTT)4):)<)<00%%1DX1N1N..   1U : :  % ' ',? ' 122 # ? ? ? ?+ ? ? ?! ! &7 C<<   h. / / /ynndCCCI H Hi H H HHr'ordered_charactersc\ |tvr"td|d}|D]h}|t|vrt|dt||}t|t||d}|d|| |||d fd|Dd}fd|Dd}t |dkr |dkr|dz }t |dkr |dkr|dz }5|t |z d ks|t |z d kr|dz }hj|t |z S) aN Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language. The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit). Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.) z{} not availablerNcg|]}|vSr r )r!echaracters_befores r%r&z1characters_popularity_compare..s,   '(A" "   r'Tcg|]}|vSr r )r!rpcharacters_afters r%r&z1characters_popularity_compare..s,   &'A! !   r'r g?)r ValueErrorr+indexcountrg) r:rmcharacter_approved_countr<characters_before_sourcecharacters_after_sourcebefore_match_countafter_match_countrsrqs @@r%characters_popularity_comparer}s+{""+228<<=== '** K1 1 1 #.x#8 H%++I66 6$  #.h"7  ! ' ' 2 2 4 4# / "((33 3 .  $ $Y / / 1 1     ,D   %         +B   %     ' ( (A - -2D2I2I $ ) $  & ' '1 , ,1Ba1G1G $ ) $  %=!>!> ># E E 3'>#?#??3FF $ ) $ G $c*<&=&= ==r'decoded_sequenceczt}|D]}|durt|}|+d}|D]}t||dur|}n||}||vr|||<h||xx|z cc<t |S)a Given a decoded text sequence, return a list of str. Unicode range / alphabet separation. Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list; One containing the latin letters and the other hebrew. FN)risalpharrlowerlistvalues)r~layersr<r"layer_target_rangediscovered_ranges r%alpha_unicode_splitrs ]]F%88     % ' ' ' 22  " ! &   01A?SS&6"   %!0  V + +)2):):F% & !"""ioo&7&77""""    r'resultsct|D]2}|D]-}|\}}|vr|g|<||.3fdD}t|ddS)z This function merge results previously given by the function coherence_ratio. The return type is the same as coherence_ratio. c g|]=}|tt|t|z df>S)rt)roundsumrg)r!r:per_language_ratioss r%r&z*merge_coherence_ratios..sd      '122S9LX9V5W5WW      r'c|dSr]r r^s r%r`z(merge_coherence_ratios.."s qtr'Tra)rr8r/)rresult sub_resultr:rkmergers @r%merge_coherence_ratiosr s &--88  8 8J(OHe22216#H-  ) 0 0 7 7 7 7  8    ,   E %^^T : : ::r'i皙? threshold lg_inclusionc 0g}d}d}||dng}d|vrd}|dt|D]}t|}|} t d| D} | t krJd| D} |pt| |D]Q} t| | } | |kr| d kr|d z }| | t| d f|d krnRt|d dS)z Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers. A layer = Character extraction by alphabets/ranges. FrN,r@Tc3 K|] \}}|V dSrXr r!r[os r%rYz"coherence_ratio..<s&88DAqa888888r'cg|]\}}|Sr r rs r%r&z#coherence_ratio..As$?$?$?41aQ$?$?$?r'g?r rtc|dSr]r r^s r%r`z!coherence_ratio..Ts 1r'ra) splitremoverr most_commonrrrlr}r8rr/)r~rrrrUsufficient_match_countlg_inclusion_listlayersequence_frequenciesrr#popular_character_orderedr:rks r%coherence_ratior%s~G3?3K **3///QS)))  ///$%566&u~~*6688 88K88888 0 0 0 $?$?;$?$?$?!) -? %'7. .   H23Ey  #&!+& NNHeE1oo6 7 7 7%**+ '~~t < < <rs@%%%%%%,,,,,,,,............VVVVVVVVVVVV000000$$$$$$"c"d3i""""J 3 49     2#2$s)222 2& ST#Y ( +,,, 2# 2%d *; 2 2 2-, 2"5:!I!IS !I-1!I #Y!I!I!I!IH9>9>'+Cy9> 9>9>9>9>x$!#$!$s)$!$!$!$!N;D)9$:;?O;;;;8 4QU.=.=.=&+.=AI#.=.=.=.=.=.=.=r'