|
Server : Apache System : Linux server.mata-lashes.com 3.10.0-1160.90.1.el7.x86_64 #1 SMP Thu May 4 15:21:22 UTC 2023 x86_64 User : matalashes ( 1004) PHP Version : 8.1.29 Disable Function : NONE Directory : /usr/local/lib/python3.6/site-packages/charset_normalizer/__pycache__/ |
Upload File : |
3
G�KdOO � @ s� d dl Z d dlmZmZ d dlmZmZmZmZ yd dl m
Z
W n ek
rX eZ
Y nX ddl
mZmZmZmZ ddlmZmZmZmZ ddlmZ dd lmZmZ dd
lmZmZmZm Z m!Z!m"Z" e j#d�Z$e j%� Z&e&j'e j(d�� de)e*e*e+ee ee e,e,ed� dd�Z-dee*e*e+ee ee e,e,ed� dd�Z.d e
e*e*e+ee ee e,e,ed� dd�Z/d!e
e*e*e+ee ee e,ed�dd�Z0dS )"� N)�basename�splitext)�BinaryIO�List�Optional�Set)�PathLike� )�coherence_ratio�encoding_languages�mb_encoding_languages�merge_coherence_ratios)�IANA_SUPPORTED�TOO_BIG_SEQUENCE�TOO_SMALL_SEQUENCE�TRACE)�
mess_ratio)�CharsetMatch�CharsetMatches)�any_specified_encoding� iana_name�identify_sig_or_bom�
is_cp_similar�is_multi_byte_encoding�should_strip_sig_or_bomZcharset_normalizerz)%(asctime)s | %(levelname)s | %(message)s� � 皙�����?TF) � sequences�steps�
chunk_size� threshold�cp_isolation�cp_exclusion�preemptive_behaviour�explain�returnc 1 . C s� t | ttf�s tdjt| ����|r>tj}tjt � tj
t� t| �} | dkr�tj
d� |rvtjt � tj
|prtj� tt| dddg d�g�S |dk r�tjtd d
j|�� dd� |D �}ng }|dk r�tjtd
d
j|�� dd� |D �}ng }| || k�rtjtd||| � d}| }|dk�r:| | |k �r:t| | �}t| �tk }
t| �tk}|
�rltjtdj| �� n|�r�tjtdj| �� g }|�r�t| �nd}
|
dk �r�|j|
� tjtd|
� t� }g }g }d}d}d}t� }t| �\}}|dk �r|j|� tjtdt|�|� |jd� d|k�r.|jd� �xv|t D �]h}|�rT||k�rT�q:|�rh||k�rh�q:||k�rv�q:|j|� d}||k}|�o�t|�}|d<k�r�| �r�tjtd|� �q:yt|�}W n, t t!fk
�r� tjtd|� �w:Y nX yr|�rB|dk�rBt"|dk�r&| dtd�� n| t|�td�� |d� n&t"|dk�rR| n| t|�d� |d�}W nV t#t$fk
�r� } z4t |t$��s�tjtd|t"|�� |j|� �w:W Y dd}~X nX d}x |D ]}t%||��r�d}P �q�W |�rtjtd||� �q:t&|�sdnt|�| t| | ��}|�o>|dk �o>t|�| k } | �rTtjtd|� tt|�d �}!t'|!d!�}!d}"d}#g }$g }%�x�|D �]�}&|&| | d" k�r��q�| |&|&| � }'|�r�|dk�r�||' }'y|'j(||�r�d#nd$d%�}(W nB t#k
�r( } z$tjtd&|t"|�� |!}"d}#P W Y dd}~X nX |�r�|&dk�r�| |& d'k�r�t)|d(�})|�r�|(d|)� |k�r�xdt&|&|&d d=�D ]P}*| |*|&| � }'|�r�|dk�r�||' }'|'j(|d#d%�}(|(d|)� |k�r|P �q|W |$j|(� |%jt*|(|�� |%d> |k�r |"d7 }"|"|!k�s|�r�|dk�r�P �q�W |# �r�|�r�| �r�y| td)�d� j(|d$d%� W nF t#k
�r� } z(tjtd*|t"|�� |j|� �w:W Y dd}~X nX |%�r�t+|%�t|%� nd}+|+|k�s�|"|!k�rF|j|� tjtd+||"t,|+d, d-d.�� |dd|
gk�r:|# �r:t| ||dg |�},||
k�r.|,}n|dk�r>|,}n|,}�q:tjtd/|t,|+d, d-d.�� |�srt-|�}-nt.|�}-|-�r�tjtd0j|t"|-��� g }.|dk�r�x4|$D ],}(t/|(d1|-�r�d2j|-�nd�}/|.j|/� �q�W t0|.�}0|0�r�tjtd3j|0|�� |jt| ||+||0|�� ||
ddgk�rd|+d1k �rdtj
d4|� |�rVtjt � tj
|� t|| g�S ||k�r:tj
d5|� |�r�tjt � tj
|� t|| g�S �q:W t|�dk� rX|�s�|�s�|�r�tjtd6� |�r�tj
d7|j1� |j|� nd|� r|dk� s(|� r|� r|j2|j2k� s(|dk � r>tj
d8� |j|� n|� rXtj
d9� |j|� |� r|tj
d:|j3� j1t|�d � n
tj
d;� |� r�tjt � tj
|� |S )?ae
Given a raw bytes sequence, return the best possibles charset usable to render str objects.
If there is no results, it is a strong indicator that the source is binary/not text.
By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence.
And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page
but never take it for granted. Can improve the performance.
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that
purpose.
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain'
toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging.
Custom logging format and handler can be set manually.
z4Expected object of type bytes or bytearray, got: {0}r z<Encoding detection on empty bytes, assuming utf_8 intention.�utf_8g F� Nz`cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : %s.z, c S s g | ]}t |d ��qS )F)r )�.0�cp� r+ �D/tmp/pip-build-8nxjc3nm/charset-normalizer/charset_normalizer/api.py�
<listcomp>] s zfrom_bytes.<locals>.<listcomp>zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.c S s g | ]}t |d ��qS )F)r )r) r* r+ r+ r, r- h s z^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.r z>Trying to detect encoding from a tiny portion of ({}) byte(s).zIUsing lazy str decoding because the payload is quite large, ({}) byte(s).z@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.�ascii�utf_16�utf_32z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecoderg ��A)�encodingz9Code page %s does not fit given bytes sequence at ALL. %sTzW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!zpCode page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.� � � �ignore�strict)�errorszaLazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %s� � g j�@z^LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %szc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.�d � )�ndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}g�������?�,z We detected language {} using {}z.Encoding detection: %s is most likely the one.zoEncoding detection: %s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.zONothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.z7Encoding detection: %s will be used as a fallback matchz:Encoding detection: utf_8 will be used as a fallback matchz:Encoding detection: ascii will be used as a fallback matchz]Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.z=Encoding detection: Unable to determine any suitable charset.> r0 r/ ���r>