summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJehan <Jehan@web>2022-12-08 21:25:10 +0000
committerIkiWiki <ikiwiki.info>2022-12-08 21:25:10 +0000
commit3a0af4a4de3a6c18676803c8abbdc89369768d7e (patch)
tree25a302f0de12a86d7587254fc41d0715656ed2e4
parent245c8ae7967220f4be4e96b21f47400790802fb1 (diff)
-rw-r--r--Software/uchardet.mdwn68
1 files changed, 62 insertions, 6 deletions
diff --git a/Software/uchardet.mdwn b/Software/uchardet.mdwn
index 57b6986c..cbd41450 100644
--- a/Software/uchardet.mdwn
+++ b/Software/uchardet.mdwn
@@ -4,13 +4,9 @@
uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation.
-The original code of universalchardet is available at <http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/>
-
-Techniques used by universalchardet are described at <http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>
-
Report bugs and contribute patches at (check opened bugs first): <https://gitlab.freedesktop.org/uchardet/uchardet/-/issues>
-Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/uchardet/releases/) ([release note](https://gitlab.freedesktop.org/uchardet/uchardet/-/releases/v0.0.7), [git repository for dev code](https://gitlab.freedesktop.org/uchardet/uchardet.git))
+Last release: [uchardet version 0.0.8](https://www.freedesktop.org/software/uchardet/releases/) ([release note](https://gitlab.freedesktop.org/uchardet/uchardet/-/releases/v0.0.8), [git repository for dev code](https://gitlab.freedesktop.org/uchardet/uchardet.git))
## Supported Languages/Encodings
@@ -43,6 +39,7 @@ Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/ucha
* IBM852
* MAC-CENTRALEUROPE
* Danish
+ * IBM865
* ISO-8859-1
* ISO-8859-15
* WINDOWS-1252
@@ -107,6 +104,11 @@ Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/ucha
* ISO-8859-13
* Maltese
* ISO-8859-3
+ * Norwegian
+ * IBM865
+ * ISO-8859-1
+ * ISO-8859-15
+ * WINDOWS-1252
* Polish:
* ISO-8859-2
* ISO-8859-13
@@ -186,6 +188,10 @@ Last release: [uchardet version 0.0.7](https://www.freedesktop.org/software/ucha
brew install uchardet
+or
+
+ port install uchardet
+
### Windows
Binary packages are provided in Fedora and Msys2 repositories. There may
@@ -198,7 +204,8 @@ to use MinGW-w64 instead of MinGW, in particular to build both 32 and
64-bit DLL libraries).
Note also that it is very easily cross-buildable (for instance from a
-GNU/Linux machine).
+GNU/Linux machine); [crossroad](https://pypi.org/project/crossroad/) may
++help, this is what we use in our CI).
### Build from source
@@ -233,6 +240,29 @@ Here is a working "module" section to include in your Flatpak's json manifest:
}
]
+### Build with CMake exported targets
+
+uchardet installs a standard pkg-config file which will make it easily
+discoverable by any modern build system. Nevertheless if your project also uses
+CMake and you want to discover uchardet installation using CMake exported
+targets, you may find and link uchardet with:
+
+ project(sample LANGUAGES C)
+ find_package ( uchardet )
+ if (uchardet_FOUND)
+ add_executable( sample sample.c )
+ target_link_libraries ( sample PRIVATE uchardet::libuchardet )
+ endif ()
+
+Note though that we recommend the library discovery with `pkg-config` because it
+is standard and generic. Therefore it will always work, even if we decided to
+change our own build system (which is not planned right now, but may always
+happen). This is why we advise to use standard `pkg-config` discovery.
+
+Some more CMake specificities may be found in the [commit
+message](https://gitlab.freedesktop.org/uchardet/uchardet/-/commit/d7dad549bd5a3442b92e861bcd2c5cda2adeea27)
+which implemented such support.
+
## Usage
### Command Line
@@ -251,6 +281,32 @@ Here is a working "module" section to include in your Flatpak's json manifest:
See [[uchardet.h|https://cgit.freedesktop.org/uchardet/uchardet/tree/src/uchardet.h]]
+## History
+
+As said in introduction, this was initially a project of Mozilla to
+allow better detection of page encodings, and it used to be part of
+Firefox. If not mistaken, this is not the case anymore (probably because
+nowadays most websites better announce their encoding, and also UTF-8 is
+much more widely spread).
+
+Techniques used by universalchardet are described at https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
+
+It is to be noted that a lot has changed since the original code, yet
+the base concept is still around, basing detection not just on encoding
+rules, but importantly on analysis of character statistics in languages.
+
+Original code of `universalchardet` by Mozilla can still be found from the
+[Wayback machine](https://web.archive.org/web/20150730144356/http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/).
+
+Mozilla code was extracted and packaged into a standalone library under
+the name `uchardet` by BYVoid in 2011, in a personal repository.
+Starting 2015, I (i.e. Jehan) started contributing, "standardized"
+the output to be iconv-compatible, added various encoding/language
+support and streamlined generation of sources for new support of
+encoding/languages by using texts from Wikipedia as statistics source on
+languages through Python scripts. Then I soon became co-maintainer.
+In 2016, `uchardet` became a freedesktop project.
+
## Related Projects
* [[python-chardet|https://github.com/chardet/chardet]] Python port