|
Metadata-Version: 2.1 |
|
Name: charset-normalizer |
|
Version: 3.3.2 |
|
Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet. |
|
Home-page: https://github.com/Ousret/charset_normalizer |
|
Author: Ahmed TAHRI |
|
Author-email: [email protected] |
|
License: MIT |
|
Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues |
|
Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest |
|
Keywords: encoding,charset,charset-detector,detector,normalization,unicode,chardet,detect |
|
Classifier: Development Status :: 5 - Production/Stable |
|
Classifier: License :: OSI Approved :: MIT License |
|
Classifier: Intended Audience :: Developers |
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules |
|
Classifier: Operating System :: OS Independent |
|
Classifier: Programming Language :: Python |
|
Classifier: Programming Language :: Python :: 3 |
|
Classifier: Programming Language :: Python :: 3.7 |
|
Classifier: Programming Language :: Python :: 3.8 |
|
Classifier: Programming Language :: Python :: 3.9 |
|
Classifier: Programming Language :: Python :: 3.10 |
|
Classifier: Programming Language :: Python :: 3.11 |
|
Classifier: Programming Language :: Python :: 3.12 |
|
Classifier: Programming Language :: Python :: Implementation :: PyPy |
|
Classifier: Topic :: Text Processing :: Linguistic |
|
Classifier: Topic :: Utilities |
|
Classifier: Typing :: Typed |
|
Requires-Python: >=3.7.0 |
|
Description-Content-Type: text/markdown |
|
License-File: LICENSE |
|
Provides-Extra: unicode_backport |
|
|
|
<h1 align="center">Charset Detection, for Everyone π</h1> |
|
|
|
<p align="center"> |
|
<sup>The Real First Universal Charset Detector</sup><br> |
|
<a href="https://pypi.org/project/charset-normalizer"> |
|
<img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" /> |
|
</a> |
|
<a href="https://pepy.tech/project/charset-normalizer/"> |
|
<img alt="Download Count Total" src="https://static.pepy.tech/badge/charset-normalizer/month" /> |
|
</a> |
|
<a href="https://bestpractices.coreinfrastructure.org/projects/7297"> |
|
<img src="https://bestpractices.coreinfrastructure.org/projects/7297/badge"> |
|
</a> |
|
</p> |
|
<p align="center"> |
|
<sup><i>Featured Packages</i></sup><br> |
|
<a href="https://github.com/jawah/niquests"> |
|
<img alt="Static Badge" src="https://img.shields.io/badge/Niquests-HTTP_1.1%2C%202%2C_and_3_Client-cyan"> |
|
</a> |
|
<a href="https://github.com/jawah/wassima"> |
|
<img alt="Static Badge" src="https://img.shields.io/badge/Wassima-Certifi_Killer-cyan"> |
|
</a> |
|
</p> |
|
<p align="center"> |
|
<sup><i>In other language (unofficial port - by the community)</i></sup><br> |
|
<a href="https://github.com/nickspring/charset-normalizer-rs"> |
|
<img alt="Static Badge" src="https://img.shields.io/badge/Rust-red"> |
|
</a> |
|
</p> |
|
|
|
> A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`, |
|
> I'm trying to resolve the issue by taking a new approach. |
|
> All IANA character set names for which the Python core library provides codecs are supported. |
|
|
|
<p align="center"> |
|
>>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">π Try Me Online Now, Then Adopt Me π </a> <<<<< |
|
</p> |
|
|
|
This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**. |
|
|
|
| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) | |
|
|--------------------------------------------------|:---------------------------------------------:|:--------------------------------------------------------------------------------------------------:|:-----------------------------------------------:| |
|
| `Fast` | β | β
| β
| |
|
| `Universal**` | β | β
| β | |
|
| `Reliable` **without** distinguishable standards | β | β
| β
| |
|
| `Reliable` **with** distinguishable standards | β
| β
| β
| |
|
| `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ | |
|
| `Native Python` | β
| β
| β | |
|
| `Detect spoken language` | β | β
| N/A | |
|
| `UnicodeDecodeError Safety` | β | β
| β | |
|
| `Whl Size (min)` | 193.6 kB | 42 kB | ~200 kB | |
|
| `Supported Encoding` | 33 | π [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html |
|
|
|
<p align="center"> |
|
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/> |
|
</p> |
|
|
|
*\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br> |
|
Did you got there because of the logs? See [https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html](https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html) |
|
|
|
|
|
|
|
This package offer better performance than its counterpart Chardet. Here are some numbers. |
|
|
|
| Package | Accuracy | Mean per file (ms) | File per sec (est) | |
|
|-----------------------------------------------|:--------:|:------------------:|:------------------:| |
|
| [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec | |
|
| charset-normalizer | **98 %** | **10 ms** | 100 file/sec | |
|
|
|
| Package | 99th percentile | 95th percentile | 50th percentile | |
|
|-----------------------------------------------|:---------------:|:---------------:|:---------------:| |
|
| [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms | |
|
| charset-normalizer | 100 ms | 50 ms | 5 ms | |
|
|
|
Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload. |
|
|
|
> Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows. |
|
> And yes, these results might change at any time. The dataset can be updated to include more files. |
|
> The actual delays heavily depends on your CPU capabilities. The factors should remain the same. |
|
> Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability |
|
> (eg. Supported Encoding) Challenge-them if you want. |
|
|
|
|
|
|
|
Using pip: |
|
|
|
```sh |
|
pip install charset-normalizer -U |
|
``` |
|
|
|
|
|
|
|
|
|
This package comes with a CLI. |
|
|
|
``` |
|
usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD] |
|
file [file ...] |
|
|
|
The Real First Universal Charset Detector. Discover originating encoding used |
|
on text file. Normalize text to unicode. |
|
|
|
positional arguments: |
|
files File(s) to be analysed |
|
|
|
optional arguments: |
|
-h, --help show this help message and exit |
|
-v, --verbose Display complementary information about file if any. |
|
Stdout will contain logs about the detection process. |
|
-a, --with-alternative |
|
Output complementary possibilities if any. Top-level |
|
JSON WILL be a list. |
|
-n, --normalize Permit to normalize input file. If not set, program |
|
does not write anything. |
|
-m, --minimal Only output the charset detected to STDOUT. Disabling |
|
JSON output. |
|
-r, --replace Replace file when trying to normalize it instead of |
|
creating a new one. |
|
-f, --force Replace file without asking if you are sure, use this |
|
flag with caution. |
|
-t THRESHOLD, --threshold THRESHOLD |
|
Define a custom maximum amount of chaos allowed in |
|
decoded content. 0. <= chaos <= 1. |
|
--version Show version information and exit. |
|
``` |
|
|
|
```bash |
|
normalizer ./data/sample.1.fr.srt |
|
``` |
|
|
|
or |
|
|
|
```bash |
|
python -m charset_normalizer ./data/sample.1.fr.srt |
|
``` |
|
|
|
π Since version 1.4.0 the CLI produce easily usable stdout result in JSON format. |
|
|
|
```json |
|
{ |
|
"path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt", |
|
"encoding": "cp1252", |
|
"encoding_aliases": [ |
|
"1252", |
|
"windows_1252" |
|
], |
|
"alternative_encodings": [ |
|
"cp1254", |
|
"cp1256", |
|
"cp1258", |
|
"iso8859_14", |
|
"iso8859_15", |
|
"iso8859_16", |
|
"iso8859_3", |
|
"iso8859_9", |
|
"latin_1", |
|
"mbcs" |
|
], |
|
"language": "French", |
|
"alphabets": [ |
|
"Basic Latin", |
|
"Latin-1 Supplement" |
|
], |
|
"has_sig_or_bom": false, |
|
"chaos": 0.149, |
|
"coherence": 97.152, |
|
"unicode_path": null, |
|
"is_preferred": true |
|
} |
|
``` |
|
|
|
|
|
*Just print out normalized text* |
|
```python |
|
from charset_normalizer import from_path |
|
|
|
results = from_path('./my_subtitle.srt') |
|
|
|
print(str(results.best())) |
|
``` |
|
|
|
*Upgrade your code without effort* |
|
```python |
|
from charset_normalizer import detect |
|
``` |
|
|
|
The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible. |
|
|
|
See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/) |
|
|
|
|
|
|
|
When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a |
|
reliable alternative using a completely different method. Also! I never back down on a good challenge! |
|
|
|
I **don't care** about the **originating charset** encoding, because **two different tables** can |
|
produce **two identical rendered string.** |
|
What I want is to get readable text, the best I can. |
|
|
|
In a way, **I'm brute forcing text decoding.** How cool is that ? π |
|
|
|
Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode. |
|
|
|
|
|
|
|
- Discard all charset encoding table that could not fit the binary content. |
|
- Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding. |
|
- Extract matches with the lowest mess detected. |
|
- Additionally, we measure coherence / probe for a language. |
|
|
|
**Wait a minute**, what is noise/mess and coherence according to **YOU ?** |
|
|
|
*Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then |
|
**I established** some ground rules about **what is obvious** when **it seems like** a mess. |
|
I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to |
|
improve or rewrite it. |
|
|
|
*Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought |
|
that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design. |
|
|
|
|
|
|
|
- Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters)) |
|
- Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content. |
|
|
|
|
|
|
|
**If you are running:** |
|
|
|
- Python >=2.7,<3.5: Unsupported |
|
- Python 3.5: charset-normalizer < 2.1 |
|
- Python 3.6: charset-normalizer < 3.1 |
|
- Python 3.7: charset-normalizer < 4.0 |
|
|
|
Upgrade your Python interpreter as soon as possible. |
|
|
|
|
|
|
|
Contributions, issues and feature requests are very much welcome.<br /> |
|
Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute. |
|
|
|
|
|
|
|
Copyright Β© [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br /> |
|
This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed. |
|
|
|
Characters frequencies used in this project Β© 2012 [Denny VrandeΔiΔ](http://simia.net/letters/) |
|
|
|
|
|
|
|
Professional support for charset-normalizer is available as part of the [Tidelift |
|
Subscription][1]. Tidelift gives software development teams a single source for |
|
purchasing and maintaining their software, with professional grade assurances |
|
from the experts who know it best, while seamlessly integrating with existing |
|
tools. |
|
|
|
[1]: https://tidelift.com/subscription/pkg/pypi-charset-normalizer?utm_source=pypi-charset-normalizer&utm_medium=readme |
|
|
|
|
|
All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). |
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). |
|
|
|
|
|
|
|
|
|
- Unintentional memory usage regression when using large payload that match several encoding ( |
|
- Regression on some detection case showcased in the documentation ( |
|
|
|
|
|
- Noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife) |
|
|
|
|
|
|
|
|
|
- Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8 |
|
- Improved the general detection reliability based on reports from the community |
|
|
|
|
|
|
|
|
|
- Allow to execute the CLI (e.g. normalizer) through `python -m charset_normalizer.cli` or `python -m charset_normalizer` |
|
- Support for 9 forgotten encoding that are supported by Python but unlisted in `encoding.aliases` as they have no alias ( |
|
|
|
|
|
- (internal) Redundant utils.is_ascii function and unused function is_private_use_only |
|
- (internal) charset_normalizer.assets is moved inside charset_normalizer.constant |
|
|
|
|
|
- (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection |
|
- Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8 |
|
|
|
|
|
- Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in \_\_lt\_\_ ( |
|
|
|
|
|
|
|
|
|
- Typehint for function `from_path` no longer enforce `PathLike` as its first argument |
|
- Minor improvement over the global detection reliability |
|
|
|
|
|
- Introduce function `is_binary` that relies on main capabilities, and optimized to detect binaries |
|
- Propagate `enable_fallback` argument throughout `from_bytes`, `from_path`, and `from_fp` that allow a deeper control over the detection (default True) |
|
- Explicit support for Python 3.12 |
|
|
|
|
|
- Edge case detection failure where a file would contain 'very-long' camel cased word (Issue |
|
|
|
|
|
|
|
|
|
- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR |
|
|
|
|
|
- Support for Python 3.6 (PR |
|
|
|
|
|
- Optional speedup provided by mypy/c 1.0.1 |
|
|
|
|
|
|
|
|
|
- Multi-bytes cutter/chunk generator did not always cut correctly (PR |
|
|
|
|
|
- Speedup provided by mypy/c 0.990 on Python >= 3.7 |
|
|
|
|
|
|
|
|
|
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results |
|
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES |
|
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio |
|
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl) |
|
|
|
|
|
- Build with static metadata using 'build' frontend |
|
- Make the language detection stricter |
|
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1 |
|
|
|
|
|
- CLI with opt --normalize fail when using full path for files |
|
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it |
|
- Sphinx warnings when generating the documentation |
|
|
|
|
|
- Coherence detector no longer return 'Simple English' instead return 'English' |
|
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese' |
|
- Breaking: Method `first()` and `best()` from CharsetMatch |
|
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII) |
|
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches |
|
- Breaking: Top-level function `normalize` |
|
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch |
|
- Support for the backport `unicodedata2` |
|
|
|
|
|
|
|
|
|
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results |
|
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES |
|
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio |
|
|
|
|
|
- Build with static metadata using 'build' frontend |
|
- Make the language detection stricter |
|
|
|
|
|
- CLI with opt --normalize fail when using full path for files |
|
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it |
|
|
|
|
|
- Coherence detector no longer return 'Simple English' instead return 'English' |
|
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese' |
|
|
|
|
|
|
|
|
|
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl) |
|
|
|
|
|
- Breaking: Method `first()` and `best()` from CharsetMatch |
|
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII) |
|
|
|
|
|
- Sphinx warnings when generating the documentation |
|
|
|
|
|
|
|
|
|
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1 |
|
|
|
|
|
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches |
|
- Breaking: Top-level function `normalize` |
|
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch |
|
- Support for the backport `unicodedata2` |
|
|
|
|
|
|
|
|
|
- Function `normalize` scheduled for removal in 3.0 |
|
|
|
|
|
- Removed useless call to decode in fn is_unprintable ( |
|
|
|
|
|
- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) ( |
|
|
|
|
|
|
|
|
|
- Output the Unicode table version when running the CLI with `--version` (PR |
|
|
|
|
|
- Re-use decoded buffer for single byte character sets from [@nijel](https://github.com/nijel) (PR |
|
- Fixing some performance bottlenecks from [@deedy5](https://github.com/deedy5) (PR |
|
|
|
|
|
- Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR |
|
- CLI default threshold aligned with the API threshold from [@oleksandr-kuzmenko](https://github.com/oleksandr-kuzmenko) (PR |
|
|
|
|
|
- Support for Python 3.5 (PR |
|
|
|
|
|
- Use of backport unicodedata from `unicodedata2` as Python is quickly catching up, scheduled for removal in 3.0 (PR |
|
|
|
|
|
|
|
|
|
- ASCII miss-detection on rare cases (PR |
|
|
|
|
|
|
|
|
|
- Explicit support for Python 3.11 (PR |
|
|
|
|
|
- The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR |
|
|
|
|
|
|
|
|
|
- Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR |
|
|
|
|
|
- Skipping the language-detection (CD) on ASCII (PR |
|
|
|
|
|
|
|
|
|
- Moderating the logging impact (since 2.0.8) for specific environments (PR |
|
|
|
|
|
- Wrong logging level applied when setting kwarg `explain` to True (PR |
|
|
|
|
|
|
|
- Improvement over Vietnamese detection (PR |
|
- MD improvement on trailing data and long foreign (non-pure latin) data (PR |
|
- Efficiency improvements in cd/alphabet_languages from [@adbar](https://github.com/adbar) (PR |
|
- call sum() without an intermediary list following PEP 289 recommendations from [@adbar](https://github.com/adbar) (PR |
|
- Code style as refactored by Sourcery-AI (PR |
|
- Minor adjustment on the MD around european words (PR |
|
- Remove and replace SRTs from assets / tests (PR |
|
- Initialize the library logger with a `NullHandler` by default from [@nmaynes](https://github.com/nmaynes) (PR |
|
- Setting kwarg `explain` to True will add provisionally (bounded to function lifespan) a specific stream handler (PR |
|
|
|
|
|
- Fix large (misleading) sequence giving UnicodeDecodeError (PR |
|
- Avoid using too insignificant chunk (PR |
|
|
|
|
|
- Add and expose function `set_logging_handler` to configure a specific StreamHandler from [@nmaynes](https://github.com/nmaynes) (PR |
|
- Add `CHANGELOG.md` entries, format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) (PR |
|
|
|
|
|
|
|
- Add support for Kazakh (Cyrillic) language detection (PR |
|
|
|
|
|
- Further, improve inferring the language from a given single-byte code page (PR |
|
- Vainly trying to leverage PEP263 when PEP3120 is not supported (PR |
|
- Refactoring for potential performance improvements in loops from [@adbar](https://github.com/adbar) (PR |
|
- Various detection improvement (MD+CD) (PR |
|
|
|
|
|
- Remove redundant logging entry about detected language(s) (PR |
|
|
|
|
|
- Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR |
|
|
|
|
|
|
|
- Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR |
|
- Fix CLI crash when using --minimal output in certain cases (PR |
|
|
|
|
|
- Minor improvement to the detection efficiency (less than 1%) (PR |
|
|
|
|
|
|
|
- The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR |
|
- The BC-support with v1.x was improved, the old staticmethods are restored (PR |
|
- The Unicode detection is slightly improved (PR |
|
- Add syntax sugar \_\_bool\_\_ for results CharsetMatches list-container (PR |
|
|
|
|
|
- The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR |
|
|
|
|
|
- In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR |
|
- Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR |
|
- The MANIFEST.in was not exhaustive (PR |
|
|
|
|
|
|
|
- The CLI no longer raise an unexpected exception when no encoding has been found (PR |
|
- Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR |
|
- The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR |
|
- Submatch factoring could be wrong in rare edge cases (PR |
|
- Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR |
|
- Fix line endings from CRLF to LF for certain project files (PR |
|
|
|
|
|
- Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR |
|
- Allow fallback on specified encoding if any (PR |
|
|
|
|
|
|
|
- Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR |
|
- According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR |
|
|
|
|
|
|
|
- Empty/Too small JSON payload miss-detection fixed. Report from [@tseaver](https://github.com/tseaver) (PR |
|
|
|
|
|
- Don't inject unicodedata2 into sys.modules from [@akx](https://github.com/akx) (PR |
|
|
|
|
|
|
|
- Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from [@sethmlarson](https://github.com/sethmlarson). (PR |
|
- Using explain=False permanently disable the verbose output in the current runtime (PR |
|
- One log entry (language target preemptive) was not show in logs when using explain=True (PR |
|
- Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR |
|
|
|
|
|
- Public function normalize default args values were not aligned with from_bytes (PR |
|
|
|
|
|
- You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR |
|
|
|
|
|
|
|
- 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet. |
|
- Accent has been made on UTF-8 detection, should perform rather instantaneous. |
|
- The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible. |
|
- The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time) |
|
- The program has been rewritten to ease the readability and maintainability. (+Using static typing)+ |
|
- utf_7 detection has been reinstated. |
|
|
|
|
|
- This package no longer require anything when used with Python 3.5 (Dropped cached_property) |
|
- Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, VolapΓΌk, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian. |
|
- The exception hook on UnicodeDecodeError has been removed. |
|
|
|
|
|
- Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0 |
|
|
|
|
|
- The CLI output used the relative path of the file(s). Should be absolute. |
|
|
|
|
|
|
|
- Logger configuration/usage no longer conflict with others (PR |
|
|
|
|
|
|
|
- Using standard logging instead of using the package loguru. |
|
- Dropping nose test framework in favor of the maintained pytest. |
|
- Choose to not use dragonmapper package to help with gibberish Chinese/CJK text. |
|
- Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version. |
|
- Stop support for UTF-7 that does not contain a SIG. |
|
- Dropping PrettyTable, replaced with pure JSON output in CLI. |
|
|
|
|
|
- BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process. |
|
- Not searching properly for the BOM when trying utf32/16 parent codec. |
|
|
|
|
|
- Improving the package final size by compressing frequencies.json. |
|
- Huge improvement over the larges payload. |
|
|
|
|
|
- CLI now produces JSON consumable output. |
|
- Return ASCII if given sequences fit. Given reasonable confidence. |
|
|
|
|
|
|
|
|
|
- In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR |
|
|
|
|
|
|
|
|
|
- Empty given payload for detection may cause an exception if trying to access the `alphabets` property. (PR |
|
|
|
|
|
|
|
|
|
- The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR |
|
|
|
|
|
|
|
|
|
- Amend the previous release to allow prettytable 2.0 (PR |
|
|
|
|
|
|
|
|
|
- Fix error while using the package with a python pre-release interpreter (PR |
|
|
|
|
|
- Dependencies refactoring, constraints revised. |
|
|
|
|
|
- Add python 3.9 and 3.10 to the supported interpreters |
|
|
|
MIT License |
|
|
|
Copyright (c) 2019 TAHRI Ahmed R. |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this software and associated documentation files (the "Software"), to deal |
|
in the Software without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Software, and to permit persons to whom the Software is |
|
furnished to do so, subject to the following conditions: |
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Software. |
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
SOFTWARE. |
|
|