Tuesday, December 25, 2012

Strings and String Manipulation in C++

C++ provides convenient and powerful tools to manipulate strings. This tutorial shows some of the basic string manipulation facilities, with examples to illustrate their use. It also shows some extensions the C++'s string capabilities by making use of some of the Boost Library facilities.

Strings and Basic String Operations

Putting aside any string-related facilities inherited from C, in C++, strings are not a built-in data type, but rather a Standard Library facility. Thus, whenever we want to use strings or string manipulation tools, we must provide the appropriate #include directive, as shown below:

#include
using namespace std; // Or using std::string;

We now use string in a similar way as built-in data types, as shown in the example below, declaring a variable name:

string name;

Unlike built-in data types (int, double, etc.), when we declare a string variable without initialization (as in the example above), we do have the guarantee that the variable will be initialized to an empty string — a string containing zero characters.
C++ strings allow you to directly initialize, assign, compare, and reassign with the intuitive operators, as well as printing and reading (e.g., from the user), as shown in the example below:

string name;
cout << "Enter your name: " << flush;
cin >> name;
    // read string until the next separator
    // (space, newline, tab)

    // Or, alternatively:
getline (cin, name);
    // read a whole line into the string name

if (name == "")
{
    cout << "You entered an empty string, "
         << "assigning default\n";
    name = "John";
}
else
{
    cout << "Thank you, " << name
         << "for running this simple program!"
         << endl;
}

C++ strings also provide many string manipulation facilities. The simplest string manipulation that we commonly use is concatenation, or addition of strings. In C++, we can use the + operator to concatenate (or “add”) two strings, as shown below:

string result;
string s1 = "hello ";
string s2 = "world";
result = s1 + s2;
// result now contains "hello world"

Notice that both s1 and s2 remain unchanged! The operation reads the values and produces a result corresponding to the concatenated strings, but doesn't modify any of the two original strings.
The += operator can also be used. In that case, one string is appended to another one:

string result;
string s1 = "hello";
// without the extra space at the end
string s2 = "world";
result = s1;
result += ' ';
// append a space at the end
result += s2;

After execution of the above fragment, result contains the string "hello world".
You can also use two or more + operators to concatenate several (more than 2) strings. The example below shows how to create a string that contains the full name from first name and last name (e.g., firstname = "John", lastname = "Smith", fullname = "Smith, John").

string firstname, lastname, fullname;

cout << "First name: ";
getline (cin, firstname);
cout << "Last name: ";
getline (cin, lastname);

fullname = lastname + ", " + firstname;
cout << "Fullname: " << fullname << endl;

Of course, we didn't need to do that; we could have printed it with several << operators to concatenate to the output. The example intends to illustrate the use of strings concatenation in situations where you need to store the result, as opposed to simply print it.
Now, let's review this example to have the full name in format "SMITH, John". Since we can only convert characters to upper case, and not strings, we have to handle the string one character at a time. To do that, we use the square brackets, as if we were dealing with an array of characters, or a vector of characters.
For example, we could convert the first character of a string to upper case with the following code:

str[0] = toupper (str[0]);

The function toupper is a Standard Library facility related to character processing; this means that when using it, we have to include the library header:

#include

If we want to change all of them, we would need to know the length of the string. To this end, strings have a method length, that tells us the length of the string (how many characters the string has).
Thus, we could use that method to control a loop that allows us to convert all the characters to upper case:

for (string::size_type i = 0; i < str.length(); i++)
{
str[i] = toupper (str[i]);
}

Notice that the subscripts for the individual characters of a string start at zero, and go from 0 to length−1.
Notice also the data type for the subscript, string::size_type; it is recommended that you always use this data type, provided by the string class, and adapted to the particular platform. All string facilities use this data type to represent positions and lengths when dealing with strings.
The example of the full name is slightly different from the one shown above, since we only want to change the first portion, corresponding to the last name, and we don't want to change the string that holds the last name — only the portion of the full name corresponding to the last name. Thus, we could do the following:

fullname = lastname + ", " + firstname;

for (string::size_type i = 0; i < lastname.length(); i++)
{
fullname[i] = toupper (fullname[i]);
}

Search Facilities

Another useful tool when working with strings is the find method. This can be used to find the position of a character in a string, or the position of a substring. For example, we could find the position of the first space in a string as follows:

position = str.find (' ');

If the string does not contain any space characters, the result of the find method will be the value string::npos. The example below illustrates the use of string::npos combined with the find method:

if (str.find (' ') != string::npos)
{
cout << "Contains at least one space!" << endl;
}
else
{
cout << "Does not contain any spaces!" << endl;
}

The find methods returns the position of the first occurence of the given character (or string::npos). We also have the related rfind method — the r stands for reverse search; in other words, rfind returns the position of the last occurence of the given character, or string::npos. You could also look at it as the first occurence while starting the search at the end of the string and moving backwards.
The find and rfind methods can also be used to find a substring; the following fragment of code can be used to determine if the word "the" is contained in a given string:

string text;
getline (cin, text);

if (text.find ("the") != string::npos)
{
// ...

For both cases (searching for a single character or searching for a substring), you can specify a starting position for the search; in that case, the find method will tell the position of the first occurrence of the search string or character after the position indicated; the rfind method will return the position of the last occurence before the position inticated — you could look at it as the first occurence while moving backwards starting at the specified position. The requirement for this optional parameter is that it must indicate a valid position within the string, which means that the value must be between 0
Brainteaser: there is an additional condition related to the optional starting position when using find or rfind; can you see it?
The following fragment of code shows how to test if a string contains at least two spaces. It performs one search for a space, and then it does a second search, starting at the position where the first one was found (actually, one element after — can you see why?):

string text;
getline (cin, text);

string::size_type position = text.find (' ');
if (position != string::npos)
{
    if  (text.find (' ', position+1) != string::npos)
    {
        cout << "Contains at least two spaces!" << endl;
    }
    else
    {
        cout << "Contains less than two spaces!" << endl;
    }
}
else
{
    cout << "Contains no spaces!" << endl;
}

Brainteaser: there is a bug in the fragment of code above — can you see it? even more fun: can you fix it?
There are several other string facilities that are related to find. Two of them are find_first_of, and find_first_not_of. Instead of finding the first occurrence of an exact string (as find does), find_first_of finds the first occurrence of any of the characters included in a specified string, and find_first_not_of finds the first occurrence of a character that is not any of the characters included in the specified string. An example of use is shown below:

string text;
getline (cin, text);

if (text.find_first_of ("aeiouAEIOU") == string::npos)
{
cout << "The text entered does not contain vowels!"
<< endl;
}

In this case the expression text.find_first_of ("aeiouAEIOU") returns the position of the first occurrence of any of the characters included in the string passed as parameter — a string containing all the vowels. find_first_of, like find, returns the position of the found character, or string::npos if no character matching the specified requirement was found. In this example, if find_first_of returns string::npos, that means that the string did not contain any vowel (lowercase or capital).
find_first_not_of works in a similar way, except that it finds the first character that is not one of the characters specified in the string, as shown in the example shown below:

string card_number;
cout << "Enter Credit Card Number: ";
getline (cin, card_number);

if (card_number.find_first_not_of ("1234567890– ") != string::npos)
{
cout << "The card number entered contains invalid characters"
<< endl;
}

Like find, both find_first_of and find_first_not_of accept also an extra parameter indicating the position where to start the search (naturally, if the parameter is omitted, the search starts at position 0).
Guess what? We also have the methods find_last_of and find_last_not_of. I'm certain that you will easily figure out how they work, and with your vivid imaginations will come up with examples of use.

Regular Expressions with the Boost Library

Notes:

This section assumes that you are familiar with regular expressions (at least at an intermediate level); I provide very brief explanations of the regular expressions used, mostly as a reminder; if you are unfamiliar with these, I could recommend this site, or Wikipedia's page on regular expressions.
Regular expressions are going to be part of the C++ language (part of the Standard Library), in the upcoming revision of the language, the so-called “C++0x”; however, this is still not the case (compilers provide experimental support for this and other new features in C++0x), so in this tutorial, I stick to the solid Boost library facilities. Still, C++ native regular expression facilities are very similar to (since they're based on) Boost's facilities, so what you learn from this tutorial will be useful when C++0x becomes official and compilers start supporting native regular expression facilities.

Regular expressions provide a mechanism for pattern matching between two strings; a regular expression is a fragment of text (a string) defining a pattern that describes a family of fragments of text. This pattern may be used to determine if another string matches it (i.e., if the other string belongs to the family of strings described by the regular expression).
This is a much more flexible and powerful tool than the basic search facilities that I presented in the previous section. A pattern with several components (or several conditions) can be specified in a single operation, providing extra flexibility.
A simple example of pattern matching is the following: suppose that a program asks the user to enter their fullname. The user could enter it in the form of first name followed by last name (e.g., "Carlos Moreno"), or in the form of last name, followed by comma, followed by first name (e.g., "Moreno, Carlos").
You want your program to determine which form the user entered the full name in, under the assumption that if there is no comma, then it is first name followed by last name. This is very simple to do with find, but I will use it for this example, since it illustrates quite nicely the use of regular expressions.
Regular expressions imply pattern matching, so we express the problem in terms of a pattern for which we want to determine if the given string (the name that the user entered) matches it.
The pattern in question is very simple: ".*, .*". In a regular expression, the character * means the preceding item any number of times (including possibly zero times). The preceding item in this case is a dot, which means any character. So, the pattern above is matched by any string that contains a sequence of any number of characters, followed by comma, followed by a space, followed by a sequence of any number of characters. In the case of my name, entered as "Moreno, Carlos", the string clearly matches the pattern.
This simple pattern matching based on the regular expression ".*, .*" is not very good for this particular purpose. We notice that there are many other examples of strings that match the pattern, even though they are not really intended to pass the validation: ", xxx" — a sequence of zero characters, comma-space, a sequence of three x characters; ", " — a sequence of zero characters, comma-space, a sequence of zero characters; ",,,,, " — a sequence of four commas, comma-space, a sequence of three space characters; "Aa, Bb, Cc, " — this one matches in three possible ways (right?).
The approach of a simple name.find (", ") != string::npos exhibits the same problem — all those strings pass the validation even though they're not supposed to. Making a sophisticated validation with the basic C++ string facilities would be quite tough. With regular expressions, however, it is as simple as coming up with a more precise regular expression describing more specifically the pattern that we want: [A–Z]+[a–z]*, [A–Z][a–z]*.
The program below uses the Boost Library regular expression facilities to validate a full name with the regular expression given above. The color highlight indicates the lines that are directly related to the Boost Library regular expressions facilities.

#include
#include
using namespace std;

#include

int main ()
{
    boost::regex fullname_regex ("[A–Z]+[a–z]*, [A–Z][a–z]*");

    string name;
    cout << "Enter you full name: " << flush;

    getline (cin, name);
    if (! regex_match (name, fullname_regex))
    {
        cout << "Error: name not entered correctly" << endl;
    }

    return 0;
}

In the above program, the first statement declares an object (a variable) that represents the regular expression and all the related functionality — we initialize it with the string containing the regular expression between the round brackets. In this particular example, the variable is called fullname_regex, indicating that it is the regular expression for fullname validation, and its data type is boost::regex.
The validation itself is done with the function regex_match — this function receives two parameters: the string to validate, and the regular expression object. It returns a boolean that can be used directly as a condition — true if the given string matches the regular expression, false otherwise.
When compiling this program, you have to instruct the compiler to link the Boost library together with the executable; this is due to the fact that we are using facilities that are not part of the C++ language, so the compiler needs to be specifically instructed on how to handle it.
On a Unix/Linux system, many of which come with the Boost library included as part of the distribution, you would do something like this:

c++ –o regex_test regex_test.c++ –lboost_regex

Depending on your system and the way the Boost library was installed, the details may vary. Before using it, you should check the documentation corresponding to the Boost library for your particular system configuration.

Substrings

We can extract one portion of a string with the method substr. This does not remove the portion from the original string; instead, it creates a new string that contains the specified portion of the original string. The required substring is specified by the starting position and the number of characters, taking into account that the position of the first character in the string is 0. The following example assigns the variable fragment with the sequence of 5 characters starting at position 6 of the variable text:

string text = "hello world, this is a test";
string fragment = text.substr (6, 5);
// start at 6, take 5 characters

The variable fragment will contain the string "world", and the variable text remains unchanged.
If we omit the second parameter, then substr will return a substring starting at the specified position and taking all of the characters after that one. For example:

string text = "hello world";
string subs = text.substr (3);

assigns the string "lo world" into the variable subs.
The first parameter must indicate a valid position within the string (similar to the case of the starting position for find, discussed earlier), or the behaviour is undefined. The second parameter, combined with the first one, can be such that the length of the string is exceeded. In such case, the returned substring will contain as many characters as possible (it will take all the characters from the specified starting position to the end of the string), as shown in the following fragment:

string text = "012345";
string fragment = text.substr (2,3);    // ok

fragment = text.substr (6,2);           // Not ok (1)

fragment = text.substr (3,10);          // ok (2)

(1) This statement invokes undefined behaviour, since the starting position falls outside the string.
(2) Returns all the available characters starting at 3. In this particular case, the statement has the same effect as fragment = text.substr (3,3).

Erasing and Replacing Substrings

In addition to extracting substrings, as we saw in the previous section, we can also modify a given string by manipulating a fragment of it; in particular, we can replace a given substring with another string, or erase the substring, causing the original string to “shrink”.
The erase method receives two parameters, specifying a substring (that is, starting position and number of characters) to be removed from the string, as shown below:

string text = "This is a test";
text.erase (5,5);

After executing the statement text.erase(5,5), the string contains "This test" — the substring starting at position 5 (the i in is) and taking 5 characters (up to and including the space after the a) is removed from the string, causing it to shrink (you may find a good analogy with using the Delete key in a text editor or word processor).
Instead of removing the substring, we could replace it with another string, as shown in the example below:

string text = "This is a test";
text.replace (5,2,"was");

After executing the statement text.replace(5,2,"was"), the string text contains "This was a test". Notice that we specified a substring of 2 characters and replaced it with a string of 3 characters. This is perfectly valid; the substring can be replaced with one of any length — if shorter, the original string will shrink; if longer, the original string will “expand” in length (similar to typing while in “insert mode” in a text editor or word processor); and if equal, the length of the string remains the same, and the substring is replaced letter-by-letter (similar to typing while in “overwrite mode” in a text editor or word processor)

Assembling Strings with String Streams

In the previous sections, I presented some of the basic facilities, including operators to concatenate strings. These facilities can be used to “assemble” a string piece by piece. However, they do not allow you to combine pieces of different data types. You can only concatenate strings with other strings, or with characters.
String streams provide an additional level of flexibility — they provide you with all the functionality of streams (e.g., cout) to assemble a string the same way you would “assemble” the output that you send to the console using cout.
Their use is quite straightforward: just keep in mind what you do with cout, plus a couple minor details. The example below builds a string that contains the following:

Name: Lastname, Firstname. Birthdate: YYYY–MM–DD, Age: XX

Assuming that we have the various pieces in variables firstname, lastname (strings), year, month, day, and current_year (integers).

#include
    // ...

ostringstream person_info;

person_info << "Name: "
            << lastname << ", " << firstname << ". ";
person_info << "Birthdate: "
            << year << '–' << month << '–' << day << ", "
            << "Age: " << current_year – year;

    //  At this point, person_info.str() provides the result.
    //  You could do, for instance:

cout << "Person info:\n" << person_info.str() << endl;

Notice that the use of string streams requires that you provide an #include directive.
The output produced by the above fragment of code would be something like:

Person info:
Name: Lastname, Firstname. Birthdate: YYYY–MM–DD, Age: XX

Of course, if we only had to output this information to the console, we could have done it all directly with cout. We use an ostringstream when we need to store the concatenated result in a string variable — in the example above, the string is given by person_info.str(), and I simply used that string to print it with cout to keep the example as simple as possible.
In cases where we need to reuse an ostringstream, we can “reset” it (clear its internal contents) to start over by passing an empty string to the str method, as shown below:

#include
// ...

ostringstream out;
out << ··· ;
// Done --- do something with out.str() ...

out.str(""); // Reset it
out << ··· ;

Parsing Strings with String Streams

As much as we can “assemble” strings from several pieces using ostringstream, we can do the opposite process, breaking up a string into several pieces, using istringstream (the "o" stands for output — as in, we output some data to a string; the "i", naturally, stands for input — as in, we input information from a string).
And not surprisingly, as much as ostringstream is used in a way similar to the way we use cout, we use istringstream in a way that is very similar to the way we use cin. The example below shows this by taking a string with a date in ISO format (YYYY-MM-DD) and extracting the values of year, month, day into three integer variables:

#include
// ...

string iso_date;
int year, month, day;
char dash1, dash2; // Dummy variables to read delimiters

cout << "Enter date (yyyy-mm-dd): " << flush;
getline (cin, iso_date);

istringstream (iso_date); // load value into string stream
iso_date >> year >> dash1 >> month >> dash2 >> day;

Again, similarly to the case of ostringstream, the example may look silly, since we may have done the last line directly with cin. The use of istringstream would be more reasonable when the given string does not come from the console or a file directly, but perhaps as an argument in a function, or from a database or graphical component, through a framework that passes the string as a whole.
However, unlike with the case of ostringstream, it can still be useful to resort to the seemingly silly technique above; keep in mind that if the given input does not exactly match what we request, cin reports an error, but it can also jam and could make the code to handle it a bit more involved. A commonly used trick when dealing directly with console input is to read with getline into a string variable (such that it reads anything), and then use an istringstream object to parse the line into the required pieces; if we get an error, we know that we can continue to read through cin without any issues.

Case Study: URL Encoding and Decoding

In this section, I will present a concrete application of the string manipulation facilities. The application uses both the C++ facilities, and the Boost Library regular expressions facilities.
I will present and discuss two programs, one that creates a URL-encoded version of a set of named parameters, and one that receives a URL-encoded string and decodes it (i.e., breaks it into individual pieces). In both cases, I use a simplified version of the URL-encoding mechanism, to keep the example simple.
URL-encoding is the mechanism used by web browsers to send form data to web servers for processing. The data consists of a set of values with given names. A simple example is an HTML login form, where we have two fields that could be named, for example, username and password. If, for instance, my username is carlos and my password is moreno, the browser would send the form data to the web server as the following URL-encoded string:

username=carlos&password=moreno

Spaces should be avoided (I will not discuss the reasons here, which are related to data communications principles), which means that if the values that the user types in contain spaces, these should be replaced by a non-whitespace character. In the case of URL-encoding, the character is the + sign. In the above example, if my password was moreno carlos, the URL-encoded string would be:

username=carlos&password=moreno+carlos

The simplified version of the URL-encoding mechanism that I will use for this case-study assumes that all input given by the user contains only alphanumeric characters in the US-English alphabet or spaces. This avoids the issue of special characters (for instance, what happens if the data entered by the user contains the character +? Or the character &, or =?). These issues are handled by the real URL-encoding technique, but I will avoid them in this case-study, to keep the example as simple as possible.

URL-Encoding Data Entered by the User

The first program prompts the user for pairs of data, indicating name of the field and value. The process stops when the user enters an empty name. At that point, the program outputs the URL-encoded data.
The basic structure of the program is a do-while loop, with stop condition given by an empty string read from the user. At each pass of the loop, the program reads the two pieces from the user and appends the corresponding item, in the form name=value, with proper encoding and separation with the character &. The program's structure for the URL-encoding is shown below:

string param_name, param_value, url_encoded;

do
{
    getline (cin, param_name);
    if (param_name != "")
    {
        getline (cin, param_value);
        //  ...
    }
}
while (param_name != "");

cout << "URL-encoded: " << url_encoded << endl;

Each individual encoding is done by iterating over the characters of the string and replacing it by its encoded version if necessary (replace spaces with +), as shown in the fragment below. We could also use the method find in a loop (while we keep finding spaces) and use the method replace at the positions where the space was found. However, this would be much more complicated than a simple for loop visiting each character of the string (and not much more efficient — surely, find will have to iterate over each character whilte searching for spaces).

for (string::size_type i = 0; i < text.length(); i++)
{
    if (text[i] == ' ')
    {
        text[i] = '+';
    }
}

The fragment of code below combines the two previous fragments (pseudocode and token encoding), completing the URL-encoding part of this case-study. It is assumed that the necessary variable declarations have been done (they are omitted below to save space). After the loop has completed, the variable url_encoded contains the result.

do
{
    getline (cin, param_name);
    if (param_name != "")
    {
        getline (cin, param_value);
        for (string::size_type i = 0; i < param_name.length(); i++)
        {
            if (param_name[i] == ' ')
            {
                param_name[i] = '+';
            }
        }
        for (string::size_type i = 0; i < param_value.length(); i++)
        {
            if (param_value[i] == ' ')
            {
                param_value = '+';
            }
        }
        if (url_encoded != "")    // Why this if?
        {
            url_encoded += '&';
        }
        url_encoded += param_name;
        url_encoded += '=';
        url_encoded += param_value;
    }
}
while (param_name != "");

Brainteaser One: What happens if we omit the line if (url_encoded != "") and concatenate '&' directly? (this is an extension of the question in the comment for that line)
Brainteaser Two: Why two for loops? Can you re-arrange the code such that there is only one for loop? (and still works correctly, naturally!)

URL-Decoding Previously-encoded Data

The second program receives a string containing URL-encoded data and breaks it into pieces, displaying the pairs (parameter name and parameter value).
The basic idea is that we loop through the string while we keep finding occurences of the character &, which means that we keep finding additional parameter pairs. We have to keep track of consecutive occurences of this character, as this will allow us to extract the appropriate portion of the string, containing the parameter pair in the form name=value. The following fragment shows the basic structure of the program:

Brainteaser: there is a bug in the code above, related to the last parameter pair in the string (and I bet at this point you know what I'm going to ask you to do).
We then break each parameter pair into individual components, name and value. To do this, we simply find the position of the equal sign, and then use it to determine the appropriate arguments for the two substrs, as shown below:

const string::size_type pos_eq = param.find ('=');
string name, value;
if (pos_eq  != string::npos)
{
    name = param.substr (0, pos_eq);
    value = param.substr (pos_eq + 1);
}
else
{
    // Error -- invalid parameter pair
}

We also have to replace the + characters back to spaces. We do this the same way as we did for token encoding, except that in this case we test if it is a + and replace it with a space.
Putting together all the pieces, we obtain the following program:

string encoded;

cout << "Enter a URL-encoded string: " << flush;
getline (cin, encoded);

string::size_type pos_start = 0, pos_end;
do
{
    pos_end = encoded.find ('&', pos_start);
    string param;
    if (pos_end != string::npos)
    {
        param = encoded.substr (pos_start, pos_end − pos_start);
        pos_start = pos_end + 1;
    }
    else
    {
        param = encoded.substr (pos_start);
    }

    for (string::size_type i = 0; i < param.length(); i++)
    {
        if (param[i] == '+')
        {
            param[i]  = ' ';
        }
    }

    const string::size_type pos_eq = param.find ('=');
    if (pos_eq  != string::npos)
    {
        const string name = param.substr (0, pos_eq);
        const string value = param.substr (pos_eq + 1);
        cout << name << " = " << value << endl;
    }
    else
    {
        cerr << "Invalid parameter found -- ignoring" << endl;
    }
}
while (pos_end != string::npos && pos_end != encoded.length() − 1);

The C-Style Character String

C++ provides following two types of string representations:

The C-style character string.
The string class type introduced with Standard C++.

The C-Style Character String:

The C-style character string originated within the C language and continues to be supported within C++. This string is actually a one-dimensional array of characters which is terminated by a null character '\0'. Thus a null-terminated string contains the characters that comprise the string followed by a null.
The following declaration and initialization create a string consisting of the word "Hello". To hold the null character at the end of the array, the size of the character array containing the string is one more than the number of characters in the word "Hello."

char greeting[6] = {'H', 'e', 'l', 'l', 'o', '\0'};

If you follow the rule of array initialization then you can write the above statement as follows:

char greeting[] = "Hello";

Following is the memory presentation of above defined string in C/C++:
String Presentation in C/C++

Actually, you do not place the null character at the end of a string constant. The C++ compiler automatically places the '\0' at the end of the string when it initializes the array. Let us try to print above mentioned string:

#include 

using namespace std;

int main ()
{
   char greeting[6] = {'H', 'e', 'l', 'l', 'o', '\0'};

   cout << "Greeting message: ";
   cout << greeting << endl;

   return 0;
}

When the above code is compiled and executed, it produces result something as follows:

Greeting message: Hello

C++ supports a wide range of functions that manipulate null-terminated strings:

S.N.	Function & Purpose
1	strcpy(s1, s2); Copies string s2 into string s1.
2	strcat(s1, s2); Concatenates string s2 onto the end of string s1.
3	strlen(s1); Returns the length of string s1.
4	strcmp(s1, s2); Returns 0 if s1 and s2 are the same; less than 0 if s1s2.
5	strchr(s1, ch); Returns a pointer to the first occurrence of character ch in string s1.
6	strstr(s1, s2); Returns a pointer to the first occurrence of string s2 in string s1.

Following example makes use of few of the above mentioned functions:

#include 
#include 

using namespace std;

int main ()
{
   char str1[10] = "Hello";
   char str2[10] = "World";
   char str3[10];
   int  len ;

   // copy str1 into str3
   strcpy( str3, str1);
   cout << "strcpy( str3, str1) : " << str3 << endl;

   // concatenates str1 and str2
   strcat( str1, str2);
   cout << "strcat( str1, str2): " << str1 << endl;

   // total lenghth of str1 after concatenation
   len = strlen(str1);
   cout << "strlen(str1) : " << len << endl;

   return 0;
}

When the above code is compiled and executed, it produces result something as follows:

strcpy( str3, str1) : Hello
strcat( str1, str2): HelloWorld
strlen(str1) : 10

The String Class in C++:

The standard C++ library provides a string class type that supports all the operations mentioned above, additionally much more functionality. We will study this class in C++ Standard Library but for now let us check following example:
At this point you may not understand this example because so far we have not discussed Classes and Objects. So can have a look and proceed until you have understanding on Object Oriented Concepts.

#include 
#include 

using namespace std;

int main ()
{
   string str1 = "Hello";
   string str2 = "World";
   string str3;
   int  len ;

   // copy str1 into str3
   str3 = str1;
   cout << "str3 : " << str3 << endl;

   // concatenates str1 and str2
   str3 = str1 + str2;
   cout << "str1 + str2 : " << str3 << endl;

   // total lenghth of str3 after concatenation
   len = str3.size();
   cout << "str3.size() :  " << len << endl;

   return 0;
}

When the above code is compiled and executed, it produces result something as follows:

str3 : Hello
str1 + str2 : HelloWorld
str3.size() :  10

Strings

Strings in C are represented by arrays of characters. The end of the string is marked with a special character, the null character, which is simply the character with the value 0. (The null character has no relation except in name to the null pointer. In the ASCII character set, the null character is named NUL.) The null or string-terminating character is represented by another character escape sequence, \0. (We've seen it once already, in the getline function of chapter 6.)
Because C has no built-in facilities for manipulating entire arrays (copying them, comparing them, etc.), it also has very few built-in facilities for manipulating strings.
In fact, C's only truly built-in string-handling is that it allows us to use string constants (also called string literals) in our code. Whenever we write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character. For example, we can declare and define an array of characters, and initialize it with a string constant:

 char string[] = "Hello, world!";

In this case, we can leave out the dimension of the array, since the compiler can compute it for us based on the size of the initializer (14, including the terminating \0). This is the only case where the compiler sizes a string array for us, however; in other cases, it will be necessary that we decide how big the arrays and other data structures we use to hold strings are. To do anything else with strings, we must typically call functions. The C library contains a few basic string manipulation functions, and to learn more about strings, we'll be looking at how these functions might be implemented.
Since C never lets us assign entire arrays, we use the strcpy function to copy one string to another:

 #include 

 char string1[] = "Hello, world!";
 char string2[20];

 strcpy(string2, string1);

The destination string is strcpy's first argument, so that a call to strcpy mimics an assignment expression (with the destination on the left-hand side). Notice that we had to allocate string2 big enough to hold the string that would be copied to it. Also, at the top of any source file where we're using the standard library's string-handling functions (such as strcpy) we must include the line

 #include

which contains external declarations for these functions. Since C won't let us compare entire arrays, either, we must call a function to do that, too. The standard library's strcmp function compares two strings, and returns 0 if they are identical, or a negative number if the first string is alphabetically ``less than'' the second string, or a positive number if the first string is ``greater.'' (Roughly speaking, what it means for one string to be ``less than'' another is that it would come first in a dictionary or telephone book, although there are a few anomalies.) Here is an example:

 char string3[] = "this is";
 char string4[] = "a test";

 if(strcmp(string3, string4) == 0)
  printf("strings are equal\n");
 else printf("strings are different\n");

This code fragment will print ``strings are different''. Notice that strcmp does not return a Boolean, true/false, zero/nonzero answer, so it's not a good idea to write something like

 if(strcmp(string3, string4))
  ...

because it will behave backwards from what you might reasonably expect. (Nevertheless, if you start reading other people's code, you're likely to come across conditionals like if(strcmp(a, b)) or even if(!strcmp(a, b)). The first does something if the strings are unequal; the second does something if they're equal. You can read these more easily if you pretend for a moment that strcmp's name were strdiff, instead.) Another standard library function is strcat, which concatenates strings. It does not concatenate two strings together and give you a third, new string; what it really does is append one string onto the end of another. (If it gave you a new string, it would have to allocate memory for it somewhere, and the standard library string functions generally never do that for you automatically.) Here's an example:

 char string5[20] = "Hello, ";
 char string6[] = "world!";

 printf("%s\n", string5);

 strcat(string5, string6);

 printf("%s\n", string5);

The first call to printf prints ``Hello, '', and the second one prints ``Hello, world!'', indicating that the contents of string6 have been tacked on to the end of string5. Notice that we declared string5 with extra space, to make room for the appended characters. If you have a string and you want to know its length (perhaps so that you can check whether it will fit in some other array you've allocated for it), you can call strlen, which returns the length of the string (i.e. the number of characters in it), not including the \0:

 char string7[] = "abc";
 int len = strlen(string7);
 printf("%d\n", len);

Finally, you can print strings out with printf using the %s format specifier, as we've been doing in these examples already (e.g. printf("%s\n", string5);).
Since a string is just an array of characters, all of the string-handling functions we've just seen can be written quite simply, using no techniques more complicated than the ones we already know. In fact, it's quite instructive to look at how these functions might be implemented. Here is a version of strcpy:

mystrcpy(char dest[], char src[])
{
int i = 0;

while(src[i] != '\0')
 {
 dest[i] = src[i];
 i++;
 }

dest[i] = '\0';
}

We've called it mystrcpy instead of strcpy so that it won't clash with the version that's already in the standard library. Its operation is simple: it looks at characters in the src string one at a time, and as long as they're not \0, assigns them, one by one, to the corresponding positions in the dest string. When it's done, it terminates the dest string by appending a \0. (After exiting the while loop, i is guaranteed to have a value one greater than the subscript of the last character in src.) For comparison, here's a way of writing the same code, using a for loop:

for(i = 0; src[i] != '\0'; i++)
 dest[i] = src[i];

dest[i] = '\0';

Yet a third possibility is to move the test for the terminating \0 character out of the for loop header and into the body of the loop, using an explicit if and break statement, so that we can perform the test after the assignment and therefore use the assignment inside the loop to copy the \0 to dest, too:

for(i = 0; ; i++)
 {
 dest[i] = src[i];
 if(src[i] == '\0')
  break;
 }

(There are in fact many, many ways to write strcpy. Many programmers like to combine the assignment and test, using an expression like (dest[i] = src[i]) != '\0'. This is actually the same sort of combined operation as we used in our getchar loop in chapter 6.)
Here is a version of strcmp:

mystrcmp(char str1[], char str2[])
{
int i = 0;

while(1)
 {
 if(str1[i] != str2[i])
  return str1[i] - str2[i];
 if(str1[i] == '\0' || str2[i] == '\0')
  return 0;
 i++;
 }
}

Characters are compared one at a time. If two characters in one position differ, the strings are different, and we are supposed to return a value less than zero if the first string (str1) is alphabetically less than the second string. Since characters in C are represented by their numeric character set values, and since most reasonable character sets assign values to characters in alphabetical order, we can simply subtract the two differing characters from each other: the expression str1[i] - str2[i] will yield a negative result if the i'th character of str1 is less than the corresponding character in str2. (As it turns out, this will behave a bit strangely when comparing upper- and lower-case letters, but it's the traditional approach, which the standard versions of strcmp tend to use.) If the characters are the same, we continue around the loop, unless the characters we just compared were (both) \0, in which case we've reached the end of both strings, and they were both equal. Notice that we used what may at first appear to be an infinite loop--the controlling expression is the constant 1, which is always true. What actually happens is that the loop runs until one of the two return statements breaks out of it (and the entire function). Note also that when one string is longer than the other, the first test will notice this (because one string will contain a real character at the [i] location, while the other will contain \0, and these are not equal) and the return value will be computed by subtracting the real character's value from 0, or vice versa. (Thus the shorter string will be treated as ``less than'' the longer.) Finally, here is a version of strlen:

int mystrlen(char str[])
{
int i;

for(i = 0; str[i] != '\0'; i++)
 {}

return i;
}

In this case, all we have to do is find the \0 that terminates the string, and it turns out that the three control expressions of the for loop do all the work; there's nothing left to do in the body. Therefore, we use an empty pair of braces {} as the loop body. Equivalently, we could use a null statement, which is simply a semicolon:

for(i = 0; str[i] != '\0'; i++)
 ;

Empty loop bodies can be a bit startling at first, but they're not unheard of. Everything we've looked at so far has come out of C's standard libraries. As one last example, let's write a substr function, for extracting a substring out of a larger string. We might call it like this:

 char string8[] = "this is a test";
 char string9[10];
 substr(string9, string8, 5, 4);
 printf("%s\n", string9);

The idea is that we'll extract a substring of length 4, starting at character 5 (0-based) of string8, and copy the substring to string9. Just as with strcpy, it's our responsibility to declare the destination string (string9) big enough. Here is an implementation of substr. Not surprisingly, it's quite similar to strcpy:

substr(char dest[], char src[], int offset, int len)
{
int i;
for(i = 0; i < len && src[offset + i] != '\0'; i++)
 dest[i] = src[i + offset];
dest[i] = '\0';
}

If you compare this code to the code for mystrcpy, you'll see that the only differences are that characters are fetched from src[offset + i] instead of src[i], and that the loop stops when len characters have been copied (or when the src string runs out of characters, whichever comes first). In this chapter, we've been careless about declaring the return types of the string functions, and (with the exception of mystrlen) they haven't returned values. The real string functions do return values, but they're of type ``pointer to character,'' which we haven't discussed yet.
When working with strings, it's important to keep firmly in mind the differences between characters and strings. We must also occasionally remember the way characters are represented, and about the relation between character values and integers.
As we have had several occasions to mention, a character is represented internally as a small integer, with a value depending on the character set in use. For example, we might find that 'A' had the value 65, that 'a' had the value 97, and that '+' had the value 43. (These are, in fact, the values in the ASCII character set, which most computers use. However, you don't need to learn these values, because the vast majority of the time, you use character constants to refer to characters, and the compiler worries about the values for you. Using character constants in preference to raw numeric values also makes your programs more portable.)
As we may also have mentioned, there is a big difference between a character and a string, even a string which contains only one character (other than the \0). For example, 'A' is not the same as "A". To drive home this point, let's illustrate it with a few examples.
If you have a string:

 char string[] = "hello, world!";

you can modify its first character by saying

 string[0] = 'H';

(Of course, there's nothing magic about the first character; you can modify any character in the string in this way. Be aware, though, that it is not always safe to modify strings in-place like this; we'll say more about the modifiability of strings in a later chapter on pointers.) Since you're replacing a character, you want a character constant, 'H'. It would not be right to write

 string[0] = "H";  /* WRONG */

because "H" is a string (an array of characters), not a single character. (The destination of the assignment, string[0], is a char, but the right-hand side is a string; these types don't match.) On the other hand, when you need a string, you must use a string. To print a single newline, you could call

 printf("\n");

It would not be correct to call

 printf('\n');   /* WRONG */

printf always wants a string as its first argument. (As one final example, putchar wants a single character, so putchar('\n') would be correct, and putchar("\n") would be incorrect.) We must also remember the difference between strings and integers. If we treat the character '1' as an integer, perhaps by saying

 int i = '1';

we will probably not get the value 1 in i; we'll get the value of the character '1' in the machine's character set. (In ASCII, it's 49.) When we do need to find the numeric value of a digit character (or to go the other way, to get the digit character with a particular value) we can make use of the fact that, in any character set used by C, the values for the digit characters, whatever they are, are contiguous. In other words, no matter what values '0' and '1' have, '1' - '0' will be 1 (and, obviously, '0' - '0' will be 0). So, for a variable c holding some digit character, the expression

 c - '0'

gives us its value. (Similarly, for an integer value i, i + '0' gives us the corresponding digit character, as long as 0 <= i <= 9.) Just as the character '1' is not the integer 1, the string "123" is not the integer 123. When we have a string of digits, we can convert it to the corresponding integer by calling the standard function atoi:

 char string[] = "123";
 int i = atoi(string);
 int j = atoi("456");

Later we'll learn how to go in the other direction, to convert an integer into a string. (One way, as long as what you want to do is print the number out, is to call printf, using %d in the format string.)

A Data Scientist's blog

Search This Blog

Tuesday, December 25, 2012

Strings and String Manipulation in C++

Strings and String Manipulation in C++

Strings and Basic String Operations

Search Facilities

Regular Expressions with the Boost Library

Substrings

Erasing and Replacing Substrings

Assembling Strings with String Streams

Parsing Strings with String Streams

Case Study: URL Encoding and Decoding

URL-Encoding Data Entered by the User

URL-Decoding Previously-encoded Data

The C-Style Character String

The C-Style Character String:

The String Class in C++:

Strings

Strings