diff options
author | Jehan <Jehan@web> | 2022-12-08 21:25:10 +0000 |
---|---|---|
committer | IkiWiki <ikiwiki.info> | 2022-12-08 21:25:10 +0000 |
commit | 3a0af4a4de3a6c18676803c8abbdc89369768d7e (patch) | |
tree | 25a302f0de12a86d7587254fc41d0715656ed2e4 | |
parent | 245c8ae7967220f4be4e96b21f47400790802fb1 (diff) |
-rw-r--r-- | Software/uchardet.mdwn | 68 |
1 files changed, 62 insertions, 6 deletions
diff --git a/Software/uchardet.mdwn b/Software/uchardet.mdwn index 57b6986c..cbd41450 100644 --- a/Software/uchardet.mdwn +++ b/Software/uchardet.mdwn @@ -4,13 +4,9 @@ uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation. -The original code of universalchardet is available at <http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/> - -Techniques used by universalchardet are described at <http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html> - Report bugs and contribute patches at (check opened bugs first): <https://gitlab.freedesktop.org/uchardet/uchardet/-/issues> -Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/uchardet/releases/) ([release note](https://gitlab.freedesktop.org/uchardet/uchardet/-/releases/v0.0.7), [git repository for dev code](https://gitlab.freedesktop.org/uchardet/uchardet.git)) +Last release: [uchardet version 0.0.8](https://www.freedesktop.org/software/uchardet/releases/) ([release note](https://gitlab.freedesktop.org/uchardet/uchardet/-/releases/v0.0.8), [git repository for dev code](https://gitlab.freedesktop.org/uchardet/uchardet.git)) ## Supported Languages/Encodings @@ -43,6 +39,7 @@ Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/ucha * IBM852 * MAC-CENTRALEUROPE * Danish + * IBM865 * ISO-8859-1 * ISO-8859-15 * WINDOWS-1252 @@ -107,6 +104,11 @@ Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/ucha * ISO-8859-13 * Maltese * ISO-8859-3 + * Norwegian + * IBM865 + * ISO-8859-1 + * ISO-8859-15 + * WINDOWS-1252 * Polish: * ISO-8859-2 * ISO-8859-13 @@ -186,6 +188,10 @@ Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/ucha brew install uchardet +or + + port install uchardet + ### Windows Binary packages are provided in Fedora and Msys2 repositories. There may @@ -198,7 +204,8 @@ to use MinGW-w64 instead of MinGW, in particular to build both 32 and 64-bit DLL libraries). Note also that it is very easily cross-buildable (for instance from a -GNU/Linux machine). +GNU/Linux machine); [crossroad](https://pypi.org/project/crossroad/) may ++help, this is what we use in our CI). ### Build from source @@ -233,6 +240,29 @@ Here is a working "module" section to include in your Flatpak's json manifest: } ] +### Build with CMake exported targets + +uchardet installs a standard pkg-config file which will make it easily +discoverable by any modern build system. Nevertheless if your project also uses +CMake and you want to discover uchardet installation using CMake exported +targets, you may find and link uchardet with: + + project(sample LANGUAGES C) + find_package ( uchardet ) + if (uchardet_FOUND) + add_executable( sample sample.c ) + target_link_libraries ( sample PRIVATE uchardet::libuchardet ) + endif () + +Note though that we recommend the library discovery with `pkg-config` because it +is standard and generic. Therefore it will always work, even if we decided to +change our own build system (which is not planned right now, but may always +happen). This is why we advise to use standard `pkg-config` discovery. + +Some more CMake specificities may be found in the [commit +message](https://gitlab.freedesktop.org/uchardet/uchardet/-/commit/d7dad549bd5a3442b92e861bcd2c5cda2adeea27) +which implemented such support. + ## Usage ### Command Line @@ -251,6 +281,32 @@ Here is a working "module" section to include in your Flatpak's json manifest: See [[uchardet.h|https://cgit.freedesktop.org/uchardet/uchardet/tree/src/uchardet.h]] +## History + +As said in introduction, this was initially a project of Mozilla to +allow better detection of page encodings, and it used to be part of +Firefox. If not mistaken, this is not the case anymore (probably because +nowadays most websites better announce their encoding, and also UTF-8 is +much more widely spread). + +Techniques used by universalchardet are described at https://www-archive.mozilla.org/projects/intl/universalcharsetdetection + +It is to be noted that a lot has changed since the original code, yet +the base concept is still around, basing detection not just on encoding +rules, but importantly on analysis of character statistics in languages. + +Original code of `universalchardet` by Mozilla can still be found from the +[Wayback machine](https://web.archive.org/web/20150730144356/http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/). + +Mozilla code was extracted and packaged into a standalone library under +the name `uchardet` by BYVoid in 2011, in a personal repository. +Starting 2015, I (i.e. Jehan) started contributing, "standardized" +the output to be iconv-compatible, added various encoding/language +support and streamlined generation of sources for new support of +encoding/languages by using texts from Wikipedia as statistics source on +languages through Python scripts. Then I soon became co-maintainer. +In 2016, `uchardet` became a freedesktop project. + ## Related Projects * [[python-chardet|https://github.com/chardet/chardet]] Python port |