FreeRDP
unicode.c File Reference
#include <errno.h>
#include <wctype.h>
#include <winpr/crt.h>
#include <winpr/error.h>
#include <winpr/print.h>
#include "utf.h"
#include "../log.h"

Macros

#define TAG   WINPR_TAG("unicode")
 

Functions

int MultiByteToWideChar (UINT CodePage, DWORD dwFlags, LPCSTR lpMultiByteStr, int cbMultiByte, LPWSTR lpWideCharStr, int cchWideChar)
 
int WideCharToMultiByte (UINT CodePage, DWORD dwFlags, LPCWSTR lpWideCharStr, int cchWideChar, LPSTR lpMultiByteStr, int cbMultiByte, LPCSTR lpDefaultChar, LPBOOL lpUsedDefaultChar)
 
int ConvertToUnicode (UINT CodePage, DWORD dwFlags, LPCSTR lpMultiByteStr, int cbMultiByte, LPWSTR *lpWideCharStr, int cchWideChar)
 
int ConvertFromUnicode (UINT CodePage, DWORD dwFlags, LPCWSTR lpWideCharStr, int cchWideChar, LPSTR *lpMultiByteStr, int cbMultiByte, LPCSTR lpDefaultChar, LPBOOL lpUsedDefaultChar)
 
void ByteSwapUnicode (WCHAR *wstr, int length)
 

Macro Definition Documentation

#define TAG   WINPR_TAG("unicode")

WinPR: Windows Portable Runtime Unicode Conversion (CRT)

Copyright 2012 Marc-Andre Moreau marca.nosp@m.ndre.nosp@m..more.nosp@m.au@g.nosp@m.mail..nosp@m.com

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Function Documentation

void ByteSwapUnicode ( WCHAR *  wstr,
int  length 
)

Swap Unicode byte order (UTF16LE <-> UTF16BE)

Here is the call graph for this function:

Here is the caller graph for this function:

int ConvertFromUnicode ( UINT  CodePage,
DWORD  dwFlags,
LPCWSTR  lpWideCharStr,
int  cchWideChar,
LPSTR *  lpMultiByteStr,
int  cbMultiByte,
LPCSTR  lpDefaultChar,
LPBOOL  lpUsedDefaultChar 
)

ConvertFromUnicode is a convenience wrapper for WideCharToMultiByte:

If the lpMultiByteStr parameter for the converted string points to NULL or if the cbMultiByte parameter is set to 0 this function will automatically allocate the required memory which is guaranteed to be null-terminated after the conversion, even if the source unicode string isn't.

If the cchWideChar parameter is set to -1 the passed lpWideCharStr must be null-terminated and the required length for the converted string will be calculated accordingly.

Here is the call graph for this function:

int ConvertToUnicode ( UINT  CodePage,
DWORD  dwFlags,
LPCSTR  lpMultiByteStr,
int  cbMultiByte,
LPWSTR *  lpWideCharStr,
int  cchWideChar 
)

ConvertToUnicode is a convenience wrapper for MultiByteToWideChar:

If the lpWideCharStr parameter for the converted string points to NULL or if the cchWideChar parameter is set to 0 this function will automatically allocate the required memory which is guaranteed to be null-terminated after the conversion, even if the source c string isn't.

If the cbMultiByte parameter is set to -1 the passed lpMultiByteStr must be null-terminated and the required length for the converted string will be calculated accordingly.

Here is the call graph for this function:

int MultiByteToWideChar ( UINT  CodePage,
DWORD  dwFlags,
LPCSTR  lpMultiByteStr,
int  cbMultiByte,
LPWSTR  lpWideCharStr,
int  cchWideChar 
)

Notes on cross-platform Unicode portability:

Unicode has many possible Unicode Transformation Format (UTF) encodings, where some of the most commonly used are UTF-8, UTF-16 and sometimes UTF-32.

The number in the UTF encoding name (8, 16, 32) refers to the number of bits per code unit. A code unit is the minimal bit combination that can represent a unit of encoded text in the given encoding. For instance, UTF-8 encodes the English alphabet using 8 bits (or one byte) each, just like in ASCII.

However, the total number of code points (values in the Unicode codespace) only fits completely within 32 bits. This means that for UTF-8 and UTF-16, more than one code unit may be required to fully encode a specific value. UTF-8 and UTF-16 are variable-width encodings, while UTF-32 is fixed-width.

UTF-8 has the advantage of being backwards compatible with ASCII, and is one of the most commonly used Unicode encoding.

UTF-16 is used everywhere in the Windows API. The strategy employed by Microsoft to provide backwards compatibility in their API was to create an ANSI and a Unicode version of the same function, ending with A (ANSI) and W (Wide character, or UTF-16 Unicode). In headers, the original function name is replaced by a macro that defines to either the ANSI or Unicode version based on the definition of the _UNICODE macro.

UTF-32 has the advantage of being fixed width, but wastes a lot of space for English text (4x more than UTF-8, 2x more than UTF-16).

In C, wide character strings are often defined with the wchar_t type. Many functions are provided to deal with those wide character strings, such as wcslen (strlen equivalent) or wprintf (printf equivalent).

This may lead to some confusion, since many of these functions exist on both Windows and Linux, but they are not the same!

This sample hello world is a good example:

include <wchar.h>

wchar_t hello[] = L"Hello, World!\n";

int main(int argc, char** argv) { wprintf(hello); wprintf(L"sizeof(wchar_t): %d\n", sizeof(wchar_t)); return 0; }

There is a reason why the sample prints the size of the wchar_t type: On Windows, wchar_t is two bytes (UTF-16), while on most other systems it is 4 bytes (UTF-32). This means that if you write code on Windows, use L"" to define a string which is meant to be UTF-16 and not UTF-32, you will have a little surprise when trying to port your code to Linux.

Since the Windows API uses UTF-16, not UTF-32, WinPR defines the WCHAR type to always be 2-bytes long and uses it instead of wchar_t. Do not ever use wchar_t with WinPR unless you know what you are doing.

As for L"", it is unfortunately unusable in a portable way, unless a special option is passed to GCC to define wchar_t as being two bytes. For string constants that must be UTF-16, it is a pain, but they can be defined in a portable way like this:

WCHAR hello[] = { 'H','e','l','l','o','\0' };

Such strings cannot be passed to native functions like wcslen(), which may expect a different wchar_t size. For this reason, WinPR provides _wcslen, which expects UTF-16 WCHAR strings on all platforms.

Here is the call graph for this function:

Here is the caller graph for this function:

int WideCharToMultiByte ( UINT  CodePage,
DWORD  dwFlags,
LPCWSTR  lpWideCharStr,
int  cchWideChar,
LPSTR  lpMultiByteStr,
int  cbMultiByte,
LPCSTR  lpDefaultChar,
LPBOOL  lpUsedDefaultChar 
)

Here is the call graph for this function:

Here is the caller graph for this function: