Cross Platform Conversion Between string and wstring
This article describes a cross platform method of converting between
a STL string
and a STL wstring
. The technique described in this
article does not make use of any external libraries. Nor does it make
use of any operating system specific APIs. It uses only features that
are part of the standard template library.
Problem Description
The Standard Template Library provides the basic_string
template class to represents
a sequence of characters. It supports all the usual operations of a
sequence and standard string
operations such as search and concatenation.
There are two common specializations of the basic_string
template class. They are
string
, which is a typedef
for basic_string<char>
,
and wstring
, which is a typedef
for basic_string<wchar_t>
.
In addition there are two specializations of the basic_string
template class that are new
to C++11. The new specializations of the basic_string
template class are u16string
, which is a typedef
for basic_string<char16_t>
,
and u32string
, which is a typedef
for basic_string<char32_t>
.
A question that is often asked on the Internet is how to convert
between string
and wstring
. This may be necessary if calling
an API function that is designed for one type of string from code that
uses a different type of string. For example your code might use wstring
objects and you may need to call
an API function that uses string
objects. In this case you will need to convert between a wstring
to a string
before calling the API
function.
Unfortunately the Standard Template Library does not provide a simple means of doing this conversion. As a result, this is a commonly asked question on the Internet. A Google search for convert between string and wstring returns about 76,300 results. Unfortunately many of the answers that are provided are incorrect.
The Incorrect Answer
The most common incorrect answer found on the Internet can be found in the article Convert std::string to std::wstring on Mijalko.com. The solution provided in that article is as follows.
// std::string -> std::wstring
std::string s("string");
std::wstring ws;
.assign(s.begin(), s.end());
ws
// std::wstring -> std::string
std::wstring ws(L"wstring");
std::string s;
.assign(ws.begin(), ws.end()); s
The problem is that this solution compiles and runs and for the simple string constants used in this example code, appears to yield the correct results. Many sub standard computer programmers will assume that since it works for these very simple strings they have found the correct solution. Then they translate their application to other languages besides English and they are extremely surprised to discover that their application is riddled with bugs that only affect their non English speaking customers!
The problem is that this solution only works correctly for ASCII characters in the range of 0 to 127. If your strings contain even a single character with a numerical value greater than 127, this simple solution will yield incorrect results. In other words, this simple solution will yield incorrect results if your strings contain any characters other than a through z, A through Z, the numerals 0 through 9, and a few punctuation marks.
This means that any application that uses this technique will not support Chinese. It will not even properly support Spanish since the Spanish alphabet contains several characters that are outside of the ASCII character set such as ch (ce hache), ll (elle), and ñ (eñe).
The Deprecated Solution
In C++11 several classes were added to the Localization library to allow for easy string and stream conversions. These classes are as follows.
- std::wstring_convert
- std::codecvt_utf8 (for UTF-8/UCS2 and UTF-8/UCS4 conversions)
- [std::codecvt_utf8_utf16][] (for UTF-8/UTF-16 conversions)
These classes make it possible to do the necessary conversions as follows.
using convert_type = std::codecvt_utf8<wchar_t>;
using converter_type = std::wstring_convert<convert_type, wchar_t>;
// std::string -> std::wstring
std::string s("string");
std::wstring wsout = converter.from_bytes(str);
// std::wstring -> std::string
std::wstring ws(L"wstring");
std::string sout = converter.to_bytes(ws);
Unfortunately, these classes were deprecated in C++17.
The Correct Solution
This article describes a cross platform solution for this problem that has the following characteristics.
- Uses template functions and explicit template specialization.
- Uses a syntax very similar to static_cast operator.
- Uses no external libraries or operating system specific APIs.
- Uses only the Standard Template Library.
The Functions
namespace yekneb
{
template<typename Target, typename Source>
inline Target string_cast(const Source& source)
{
return source;
}
template<>
inline std::wstring string_cast(const std::string& source)
{
std::locale loc;
return ::yekneb::detail::string_cast::s2w(source, loc);
}
template<>
inline std::string string_cast(const std::wstring& source)
{
std::locale loc;
return ::yekneb::detail::string_cast::w2s(source, loc);
}
}
Three additional functions are provided that allow the user to specify the locale parameter.
The functions are used as follows.
std::wstring wIn(L"Hello World! It is truly a wonderful day to be alive.");
std::string sIn("Hello World! It is truly a wonderful day to be alive.");
std::string sOut = yekneb::string_cast<std::string>(wIn);
std::wstring wOut = yekneb::string_cast<std::wstring>(sIn);
The details
The functions s2w and w2s are implemented using the following features of the STL.
- std::codecvt, specifically the out function.
- std::ctype, specifically the narrow function and the widen function.
- std::has_facet
- std::use_facet
The implementation of these functions is as follows.
namespace yekneb
{
namespace detail
{
namespace string_cast
{
inline std::string w2s(const std::wstring& ws, const std::locale& loc)
{
typedef std::codecvt<wchar_t, char, std::mbstate_t> converter_type;
typedef std::ctype<wchar_t> wchar_facet;
std::string return_value;
if (ws.empty())
{
return "";
}
const wchar_t* from = ws.c_str();
size_t len = ws.length();
size_t converterMaxLength = 6;
size_t vectorSize = ((len + 6) * converterMaxLength);
if (std::has_facet<converter_type>(loc))
{
const converter_type& converter = std::use_facet<converter_type>(loc);
if (converter.always_noconv())
{
= converter.max_length();
converterMaxLength if (converterMaxLength != 6)
{
= ((len + 6) * converterMaxLength);
vectorSize }
std::mbstate_t state;
const wchar_t* from_next = nullptr;
std::vector<char> to(vectorSize, 0);
std::vector<char>::pointer toPtr = to.data();
std::vector<char>::pointer to_next = nullptr;
const converter_type::result result = converter.out(
, from, from + len, from_next,
state, toPtr + vectorSize, to_next);
toPtrif (
(converter_type::ok == result || converter_type::noconv == result)
&& 0 != toPtr[0]
)
{
.assign(toPtr, to_next);
return_value}
}
}
if (return_value.empty() && std::has_facet<wchar_facet>(loc))
{
std::vector<char> to(vectorSize, 0);
std::vector<char>::pointer toPtr = to.data();
const wchar_facet& facet = std::use_facet<wchar_facet>(loc);
if (facet.narrow(from, from + len, '?', toPtr) != 0)
{
= toPtr;
return_value }
}
return return_value;
}
inline std::wstring s2w(const std::string& s, const std::locale& loc)
{
typedef std::ctype<wchar_t> wchar_facet;
std::wstring return_value;
if (s.empty())
{
return L"";
}
if (std::has_facet<wchar_facet>(loc))
{
std::vector<wchar_t> to(s.size() + 2, 0);
std::vector<wchar_t>::pointer toPtr = to.data();
const wchar_facet& facet = std::use_facet<wchar_facet>(loc);
if (facet.widen(s.c_str(), s.c_str() + s.size(), toPtr) != 0)
{
= to.data();
return_value }
}
return return_value;
}
}
}
}
A GNU/Linux Specific Bug
The above functions function well on Microsoft Windows and MAC OS X. However they do not work as well on GNU/Linux. Specifically the w2s function will fail with one of two errors, depending on whether debugging is enabled.
If debugging is enabled the error will be a double free or corruption error that occurs when the output buffer is being deallocated. The error occurs due to the fact that the call to converter.out will cause the output buffer to be corrupted and far more bytes will be written to the output buffer than the number of bytes that were allocated for the buffer, even if you over allocate the output buffer by a power of 2.
If debugging is not enabled the error will be as follows.
../iconv/loop.c:448: internal_utf8_loop_single: Assertion `inptr - bytebuf > (state->__count & 7)' failed.
These bugs bug only occur when the active locale uses UTF-8.
I have been unable to resolve this issue using just the STL. However, the Boost C++ Libraries found at www.boost.org provide an acceptable solution, the boost::locale::conv::utf_to_utf function. If using Boost is an option for you this problem can be resolved as follows.
First, add the IsUTF8
function to
the ::yekneb::detail::string_cast
namespace. This function may be used to determine whether or not a given
locale uses UTF-8 and thus whether or not the w2s
function should use the boost::locale::conv::utf_to_utf
function.
The source code for the IsUTF8
function is as follows.
inline bool IsUTF8(const std::locale &loc)
{
std::string locName = loc.name();
if (!locName.empty() && locName.find("UTF-8") != std::string::npos)
{
return true;
}
return false;
}
Then simply add the following code to the w2s
function just after the if (ws.empty())
code block.
if (IsUTF8(loc))
{
= boost::locale::conv::utf_to_utf<char>(ws);
return_value if (! return_value.empty())
{
return return_value;
}
}
For consistency a similar code block should be added to the s2w
function as well.
Article Source Code
The complete source code for this article can be found on SullivanAndKey.com in the header file StringCast.h. You can also see the code in action on ideone.com.