Strings and String Manipulation in C++C++ provides convenient and powerful tools to manipulate strings. This tutorial shows some of the basic string manipulation facilities, with examples to illustrate their use. It also shows some extensions the C++'s string capabilities by making use of some of the Boost Library facilities.Strings and Basic String OperationsPutting aside any string-related facilities inherited from C, in C++, strings are not a built-in data type, but rather a Standard Library facility. Thus, whenever we want to use strings or string manipulation tools, we must provide the appropriate #include directive, as shown below:
#include
using namespace std; // Or using std::string;
string name;
C++ strings allow you to directly initialize, assign, compare, and reassign with the intuitive operators, as well as printing and reading (e.g., from the user), as shown in the example below:
string name;
cout << "Enter your name: " << flush; cin >> name; // read string until the next separator // (space, newline, tab) // Or, alternatively: getline (cin, name); // read a whole line into the string name if (name == "") { cout << "You entered an empty string, " << "assigning default\n"; name = "John"; } else { cout << "Thank you, " << name << "for running this simple program!" << endl; }
string result;
string s1 = "hello "; string s2 = "world"; result = s1 + s2; // result now contains "hello world" The += operator can also be used. In that case, one string is appended to another one:
string result;
string s1 = "hello"; // without the extra space at the end string s2 = "world"; result = s1; result += ' '; // append a space at the end result += s2; You can also use two or more + operators to concatenate several (more than 2) strings. The example below shows how to create a string that contains the full name from first name and last name (e.g., firstname = "John", lastname = "Smith", fullname = "Smith, John").
string firstname, lastname, fullname;
cout << "First name: "; getline (cin, firstname); cout << "Last name: "; getline (cin, lastname); fullname = lastname + ", " + firstname; cout << "Fullname: " << fullname << endl; Now, let's review this example to have the full name in format "SMITH, John". Since we can only convert characters to upper case, and not strings, we have to handle the string one character at a time. To do that, we use the square brackets, as if we were dealing with an array of characters, or a vector of characters. For example, we could convert the first character of a string to upper case with the following code:
str[0] = toupper (str[0]);
#include
Thus, we could use that method to control a loop that allows us to convert all the characters to upper case:
for (string::size_type i = 0; i < str.length(); i++)
{ str[i] = toupper (str[i]); } Notice also the data type for the subscript, string::size_type; it is recommended that you always use this data type, provided by the string class, and adapted to the particular platform. All string facilities use this data type to represent positions and lengths when dealing with strings. The example of the full name is slightly different from the one shown above, since we only want to change the first portion, corresponding to the last name, and we don't want to change the string that holds the last name — only the portion of the full name corresponding to the last name. Thus, we could do the following:
fullname = lastname + ", " + firstname;
for (string::size_type i = 0; i < lastname.length(); i++) { fullname[i] = toupper (fullname[i]); } Search FacilitiesAnother useful tool when working with strings is the find method. This can be used to find the position of a character in a string, or the position of a substring. For example, we could find the position of the first space in a string as follows:
position = str.find (' ');
if (str.find (' ') != string::npos)
{ cout << "Contains at least one space!" << endl; } else { cout << "Does not contain any spaces!" << endl; } The find and rfind methods can also be used to find a substring; the following fragment of code can be used to determine if the word "the" is contained in a given string:
string text;
getline (cin, text); if (text.find ("the") != string::npos) { // ... there is an additional condition related to the optional starting position when using find or rfind; can you see it? The following fragment of code shows how to test if a string contains at least two spaces. It performs one search for a space, and then it does a second search, starting at the position where the first one was found (actually, one element after — can you see why?):
string text;
getline (cin, text); string::size_type position = text.find (' '); if (position != string::npos) { if (text.find (' ', position+1) != string::npos) { cout << "Contains at least two spaces!" << endl; } else { cout << "Contains less than two spaces!" << endl; } } else { cout << "Contains no spaces!" << endl; } There are several other string facilities that are related to find. Two of them are find_first_of, and find_first_not_of. Instead of finding the first occurrence of an exact string (as find does), find_first_of finds the first occurrence of any of the characters included in a specified string, and find_first_not_of finds the first occurrence of a character that is not any of the characters included in the specified string. An example of use is shown below:
string text;
getline (cin, text); if (text.find_first_of ("aeiouAEIOU") == string::npos) { cout << "The text entered does not contain vowels!" << endl; } find_first_not_of works in a similar way, except that it finds the first character that is not one of the characters specified in the string, as shown in the example shown below:
string card_number;
cout << "Enter Credit Card Number: "; getline (cin, card_number); if (card_number.find_first_not_of ("1234567890– ") != string::npos) { cout << "The card number entered contains invalid characters" << endl; } Guess what? We also have the methods find_last_of and find_last_not_of. I'm certain that you will easily figure out how they work, and with your vivid imaginations will come up with examples of use. |
||
Regular Expressions with the Boost Library
Notes:
This is a much more flexible and powerful tool than the basic search facilities that I presented in the previous section. A pattern with several components (or several conditions) can be specified in a single operation, providing extra flexibility. A simple example of pattern matching is the following: suppose that a program asks the user to enter their fullname. The user could enter it in the form of first name followed by last name (e.g., "Carlos Moreno"), or in the form of last name, followed by comma, followed by first name (e.g., "Moreno, Carlos"). You want your program to determine which form the user entered the full name in, under the assumption that if there is no comma, then it is first name followed by last name. This is very simple to do with find, but I will use it for this example, since it illustrates quite nicely the use of regular expressions. Regular expressions imply pattern matching, so we express the problem in terms of a pattern for which we want to determine if the given string (the name that the user entered) matches it. The pattern in question is very simple: ".*, .*". In a regular expression, the character * means the preceding item any number of times (including possibly zero times). The preceding item in this case is a dot, which means any character. So, the pattern above is matched by any string that contains a sequence of any number of characters, followed by comma, followed by a space, followed by a sequence of any number of characters. In the case of my name, entered as "Moreno, Carlos", the string clearly matches the pattern. This simple pattern matching based on the regular expression ".*, .*" is not very good for this particular purpose. We notice that there are many other examples of strings that match the pattern, even though they are not really intended to pass the validation: ", xxx" — a sequence of zero characters, comma-space, a sequence of three x characters; ", " — a sequence of zero characters, comma-space, a sequence of zero characters; ",,,,, " — a sequence of four commas, comma-space, a sequence of three space characters; "Aa, Bb, Cc, " — this one matches in three possible ways (right?). The approach of a simple name.find (", ") != string::npos exhibits the same problem — all those strings pass the validation even though they're not supposed to. Making a sophisticated validation with the basic C++ string facilities would be quite tough. With regular expressions, however, it is as simple as coming up with a more precise regular expression describing more specifically the pattern that we want: [A–Z]+[a–z]*, [A–Z][a–z]*. The program below uses the Boost Library regular expression facilities to validate a full name with the regular expression given above. The color highlight indicates the lines that are directly related to the Boost Library regular expressions facilities.
#include
#include using namespace std; #include int main () { boost::regex fullname_regex ("[A–Z]+[a–z]*, [A–Z][a–z]*"); string name; cout << "Enter you full name: " << flush; getline (cin, name); if (! regex_match (name, fullname_regex)) { cout << "Error: name not entered correctly" << endl; } return 0; } The validation itself is done with the function regex_match — this function receives two parameters: the string to validate, and the regular expression object. It returns a boolean that can be used directly as a condition — true if the given string matches the regular expression, false otherwise. When compiling this program, you have to instruct the compiler to link the Boost library together with the executable; this is due to the fact that we are using facilities that are not part of the C++ language, so the compiler needs to be specifically instructed on how to handle it. On a Unix/Linux system, many of which come with the Boost library included as part of the distribution, you would do something like this:
c++ –o regex_test regex_test.c++ –lboost_regex
SubstringsWe can extract one portion of a string with the method substr. This does not remove the portion from the original string; instead, it creates a new string that contains the specified portion of the original string. The required substring is specified by the starting position and the number of characters, taking into account that the position of the first character in the string is 0. The following example assigns the variable fragment with the sequence of 5 characters starting at position 6 of the variable text:
string text = "hello world, this is a test";
string fragment = text.substr (6, 5); // start at 6, take 5 characters If we omit the second parameter, then substr will return a substring starting at the specified position and taking all of the characters after that one. For example:
string text = "hello world";
string subs = text.substr (3); The first parameter must indicate a valid position within the string (similar to the case of the starting position for find, discussed earlier), or the behaviour is undefined. The second parameter, combined with the first one, can be such that the length of the string is exceeded. In such case, the returned substring will contain as many characters as possible (it will take all the characters from the specified starting position to the end of the string), as shown in the following fragment:
string text = "012345";
string fragment = text.substr (2,3); // ok fragment = text.substr (6,2); // Not ok (1) fragment = text.substr (3,10); // ok (2) (2) Returns all the available characters starting at 3. In this particular case, the statement has the same effect as fragment = text.substr (3,3). Erasing and Replacing SubstringsIn addition to extracting substrings, as we saw in the previous section, we can also modify a given string by manipulating a fragment of it; in particular, we can replace a given substring with another string, or erase the substring, causing the original string to “shrink”.The erase method receives two parameters, specifying a substring (that is, starting position and number of characters) to be removed from the string, as shown below:
string text = "This is a test";
text.erase (5,5); Instead of removing the substring, we could replace it with another string, as shown in the example below:
string text = "This is a test";
text.replace (5,2,"was"); Assembling Strings with String StreamsIn the previous sections, I presented some of the basic facilities, including operators to concatenate strings. These facilities can be used to “assemble” a string piece by piece. However, they do not allow you to combine pieces of different data types. You can only concatenate strings with other strings, or with characters.String streams provide an additional level of flexibility — they provide you with all the functionality of streams (e.g., cout) to assemble a string the same way you would “assemble” the output that you send to the console using cout. Their use is quite straightforward: just keep in mind what you do with cout, plus a couple minor details. The example below builds a string that contains the following:
Name: Lastname, Firstname. Birthdate: YYYY–MM–DD, Age: XX
#include
// ... ostringstream person_info; person_info << "Name: " << lastname << ", " << firstname << ". "; person_info << "Birthdate: " << year << '–' << month << '–' << day << ", " << "Age: " << current_year – year; // At this point, person_info.str() provides the result. // You could do, for instance: cout << "Person info:\n" << person_info.str() << endl; The output produced by the above fragment of code would be something like:
Person info:
Name: Lastname, Firstname. Birthdate: YYYY–MM–DD, Age: XX In cases where we need to reuse an ostringstream, we can “reset” it (clear its internal contents) to start over by passing an empty string to the str method, as shown below:
#include
// ... ostringstream out; out << ··· ; // Done --- do something with out.str() ... out.str(""); // Reset it out << ··· ; Parsing Strings with String StreamsAs much as we can “assemble” strings from several pieces using ostringstream, we can do the opposite process, breaking up a string into several pieces, using istringstream (the "o" stands for output — as in, we output some data to a string; the "i", naturally, stands for input — as in, we input information from a string).And not surprisingly, as much as ostringstream is used in a way similar to the way we use cout, we use istringstream in a way that is very similar to the way we use cin. The example below shows this by taking a string with a date in ISO format (YYYY-MM-DD) and extracting the values of year, month, day into three integer variables:
#include
// ... string iso_date; int year, month, day; char dash1, dash2; // Dummy variables to read delimiters cout << "Enter date (yyyy-mm-dd): " << flush; getline (cin, iso_date); istringstream (iso_date); // load value into string stream iso_date >> year >> dash1 >> month >> dash2 >> day; However, unlike with the case of ostringstream, it can still be useful to resort to the seemingly silly technique above; keep in mind that if the given input does not exactly match what we request, cin reports an error, but it can also jam and could make the code to handle it a bit more involved. A commonly used trick when dealing directly with console input is to read with getline into a string variable (such that it reads anything), and then use an istringstream object to parse the line into the required pieces; if we get an error, we know that we can continue to read through cin without any issues. |
||
Case Study: URL Encoding and DecodingIn this section, I will present a concrete application of the string manipulation facilities. The application uses both the C++ facilities, and the Boost Library regular expressions facilities.I will present and discuss two programs, one that creates a URL-encoded version of a set of named parameters, and one that receives a URL-encoded string and decodes it (i.e., breaks it into individual pieces). In both cases, I use a simplified version of the URL-encoding mechanism, to keep the example simple. URL-encoding is the mechanism used by web browsers to send form data to web servers for processing. The data consists of a set of values with given names. A simple example is an HTML login form, where we have two fields that could be named, for example, username and password. If, for instance, my username is carlos and my password is moreno, the browser would send the form data to the web server as the following URL-encoded string:
username=carlos&password=moreno
username=carlos&password=moreno+carlos
URL-Encoding Data Entered by the UserThe first program prompts the user for pairs of data, indicating name of the field and value. The process stops when the user enters an empty name. At that point, the program outputs the URL-encoded data.The basic structure of the program is a do-while loop, with stop condition given by an empty string read from the user. At each pass of the loop, the program reads the two pieces from the user and appends the corresponding item, in the form name=value, with proper encoding and separation with the character &. The program's structure for the URL-encoding is shown below:
string param_name, param_value, url_encoded;
do { getline (cin, param_name); if (param_name != "") { getline (cin, param_value); // ... } } while (param_name != ""); cout << "URL-encoded: " << url_encoded << endl;
for (string::size_type i = 0; i < text.length(); i++)
{ if (text[i] == ' ') { text[i] = '+'; } }
do
{ getline (cin, param_name); if (param_name != "") { getline (cin, param_value); for (string::size_type i = 0; i < param_name.length(); i++) { if (param_name[i] == ' ') { param_name[i] = '+'; } } for (string::size_type i = 0; i < param_value.length(); i++) { if (param_value[i] == ' ') { param_value = '+'; } } if (url_encoded != "") // Why this if? { url_encoded += '&'; } url_encoded += param_name; url_encoded += '='; url_encoded += param_value; } } while (param_name != ""); Why two for loops? Can you re-arrange the code such that there is only one for loop? (and still works correctly, naturally!) URL-Decoding Previously-encoded DataThe second program receives a string containing URL-encoded data and breaks it into pieces, displaying the pairs (parameter name and parameter value).The basic idea is that we loop through the string while we keep finding occurences of the character &, which means that we keep finding additional parameter pairs. We have to keep track of consecutive occurences of this character, as this will allow us to extract the appropriate portion of the string, containing the parameter pair in the form name=value. The following fragment shows the basic structure of the program:
string encoded;
cout << "Enter a URL-encoded string: " << flush; getline (cin, encoded); string::size_type pos_start = 0, pos_end; do { pos_end = encoded.find ('&', pos_start); string param; if (pos_end != string::npos) { param = encoded.substr (pos_start, pos_end − pos_start); pos_start = pos_end + 1; } else { param = encoded.substr (pos_start); } // Break param into individual pieces } while (pos_end != string::npos); We then break each parameter pair into individual components, name and value. To do this, we simply find the position of the equal sign, and then use it to determine the appropriate arguments for the two substrs, as shown below:
const string::size_type pos_eq = param.find ('=');
string name, value; if (pos_eq != string::npos) { name = param.substr (0, pos_eq); value = param.substr (pos_eq + 1); } else { // Error -- invalid parameter pair } Putting together all the pieces, we obtain the following program:
string encoded;
cout << "Enter a URL-encoded string: " << flush; getline (cin, encoded); string::size_type pos_start = 0, pos_end; do { pos_end = encoded.find ('&', pos_start); string param; if (pos_end != string::npos) { param = encoded.substr (pos_start, pos_end − pos_start); pos_start = pos_end + 1; } else { param = encoded.substr (pos_start); } for (string::size_type i = 0; i < param.length(); i++) { if (param[i] == '+') { param[i] = ' '; } } const string::size_type pos_eq = param.find ('='); if (pos_eq != string::npos) { const string name = param.substr (0, pos_eq); const string value = param.substr (pos_eq + 1); cout << name << " = " << value << endl; } else { cerr << "Invalid parameter found -- ignoring" << endl; } } while (pos_end != string::npos && pos_end != encoded.length() − 1); |
I am born with potential, I am born with goodness, I am born with ideas and dreams, I am born with greatness, I have wings, I have two wings, I am meant for creativity because I have wings, I will fly, I will fly, I will fly !!! -- DR. A. P. J. Abdul Kalam
Search This Blog
Tuesday, December 25, 2012
Strings and String Manipulation in C++
The C-Style Character String
C++ provides following two types of string representations:
The following declaration and initialization create a string consisting of the word "Hello". To hold the null character at the end of the array, the size of the character array containing the string is one more than the number of characters in the word "Hello."
If you follow the rule of array initialization then you can write the above statement as follows:
Following is the memory presentation of above defined string in C/C++:
Actually, you do not place the null character at the end of a string
constant. The C++ compiler automatically places the '\0' at the end of
the string when it initializes the array. Let us try to print above
mentioned string:
When the above code is compiled and executed, it produces result something as follows:
C++ supports a wide range of functions that manipulate null-terminated strings:
Following example makes use of few of the above mentioned functions:
When the above code is compiled and executed, it produces result something as follows:
At this point you may not understand this example because so far we have not discussed Classes and Objects. So can have a look and proceed until you have understanding on Object Oriented Concepts.
When the above code is compiled and executed, it produces result something as follows:
- The C-style character string.
- The string class type introduced with Standard C++.
The C-Style Character String:
The C-style character string originated within the C language and continues to be supported within C++. This string is actually a one-dimensional array of characters which is terminated by a null character '\0'. Thus a null-terminated string contains the characters that comprise the string followed by a null.The following declaration and initialization create a string consisting of the word "Hello". To hold the null character at the end of the array, the size of the character array containing the string is one more than the number of characters in the word "Hello."
char greeting[6] = {'H', 'e', 'l', 'l', 'o', '\0'}; |
char greeting[] = "Hello"; |
#include |
Greeting message: Hello |
S.N. | Function & Purpose |
---|---|
1 | strcpy(s1, s2); Copies string s2 into string s1. |
2 | strcat(s1, s2); Concatenates string s2 onto the end of string s1. |
3 | strlen(s1); Returns the length of string s1. |
4 | strcmp(s1, s2); Returns 0 if s1 and s2 are the same; less than 0 if s1 |
5 | strchr(s1, ch); Returns a pointer to the first occurrence of character ch in string s1. |
6 | strstr(s1, s2); Returns a pointer to the first occurrence of string s2 in string s1. |
#include |
strcpy( str3, str1) : Hello strcat( str1, str2): HelloWorld strlen(str1) : 10 |
The String Class in C++:
The standard C++ library provides a string class type that supports all the operations mentioned above, additionally much more functionality. We will study this class in C++ Standard Library but for now let us check following example:At this point you may not understand this example because so far we have not discussed Classes and Objects. So can have a look and proceed until you have understanding on Object Oriented Concepts.
#include |
str3 : Hello str1 + str2 : HelloWorld str3.size() : 10
Strings
Strings
Strings in C are represented by arrays of characters. The end of the string is marked with a special character, the null character, which is simply the character with the value 0. (The null character has no relation except in name to the null pointer. In the ASCII character set, the null character is named NUL.) The null or string-terminating character is represented by another character escape sequence, \0. (We've seen it once already, in the getline function of chapter 6.)Because C has no built-in facilities for manipulating entire arrays (copying them, comparing them, etc.), it also has very few built-in facilities for manipulating strings.
In fact, C's only truly built-in string-handling is that it allows us to use string constants (also called string literals) in our code. Whenever we write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character. For example, we can declare and define an array of characters, and initialize it with a string constant:
char string[] = "Hello, world!";In this case, we can leave out the dimension of the array, since the compiler can compute it for us based on the size of the initializer (14, including the terminating \0). This is the only case where the compiler sizes a string array for us, however; in other cases, it will be necessary that we decide how big the arrays and other data structures we use to hold strings are. To do anything else with strings, we must typically call functions. The C library contains a few basic string manipulation functions, and to learn more about strings, we'll be looking at how these functions might be implemented.
Since C never lets us assign entire arrays, we use the strcpy function to copy one string to another:
#includeThe destination string is strcpy's first argument, so that a call to strcpy mimics an assignment expression (with the destination on the left-hand side). Notice that we had to allocate string2 big enough to hold the string that would be copied to it. Also, at the top of any source file where we're using the standard library's string-handling functions (such as strcpy) we must include the linechar string1[] = "Hello, world!"; char string2[20]; strcpy(string2, string1);
#includewhich contains external declarations for these functions. Since C won't let us compare entire arrays, either, we must call a function to do that, too. The standard library's strcmp function compares two strings, and returns 0 if they are identical, or a negative number if the first string is alphabetically ``less than'' the second string, or a positive number if the first string is ``greater.'' (Roughly speaking, what it means for one string to be ``less than'' another is that it would come first in a dictionary or telephone book, although there are a few anomalies.) Here is an example:
char string3[] = "this is"; char string4[] = "a test"; if(strcmp(string3, string4) == 0) printf("strings are equal\n"); else printf("strings are different\n");This code fragment will print ``strings are different''. Notice that strcmp does not return a Boolean, true/false, zero/nonzero answer, so it's not a good idea to write something like
if(strcmp(string3, string4)) ...because it will behave backwards from what you might reasonably expect. (Nevertheless, if you start reading other people's code, you're likely to come across conditionals like if(strcmp(a, b)) or even if(!strcmp(a, b)). The first does something if the strings are unequal; the second does something if they're equal. You can read these more easily if you pretend for a moment that strcmp's name were strdiff, instead.) Another standard library function is strcat, which concatenates strings. It does not concatenate two strings together and give you a third, new string; what it really does is append one string onto the end of another. (If it gave you a new string, it would have to allocate memory for it somewhere, and the standard library string functions generally never do that for you automatically.) Here's an example:
char string5[20] = "Hello, "; char string6[] = "world!"; printf("%s\n", string5); strcat(string5, string6); printf("%s\n", string5);The first call to printf prints ``Hello, '', and the second one prints ``Hello, world!'', indicating that the contents of string6 have been tacked on to the end of string5. Notice that we declared string5 with extra space, to make room for the appended characters. If you have a string and you want to know its length (perhaps so that you can check whether it will fit in some other array you've allocated for it), you can call strlen, which returns the length of the string (i.e. the number of characters in it), not including the \0:
char string7[] = "abc"; int len = strlen(string7); printf("%d\n", len);Finally, you can print strings out with printf using the %s format specifier, as we've been doing in these examples already (e.g. printf("%s\n", string5);).
Since a string is just an array of characters, all of the string-handling functions we've just seen can be written quite simply, using no techniques more complicated than the ones we already know. In fact, it's quite instructive to look at how these functions might be implemented. Here is a version of strcpy:
mystrcpy(char dest[], char src[]) { int i = 0; while(src[i] != '\0') { dest[i] = src[i]; i++; } dest[i] = '\0'; }We've called it mystrcpy instead of strcpy so that it won't clash with the version that's already in the standard library. Its operation is simple: it looks at characters in the src string one at a time, and as long as they're not \0, assigns them, one by one, to the corresponding positions in the dest string. When it's done, it terminates the dest string by appending a \0. (After exiting the while loop, i is guaranteed to have a value one greater than the subscript of the last character in src.) For comparison, here's a way of writing the same code, using a for loop:
for(i = 0; src[i] != '\0'; i++) dest[i] = src[i]; dest[i] = '\0';Yet a third possibility is to move the test for the terminating \0 character out of the for loop header and into the body of the loop, using an explicit if and break statement, so that we can perform the test after the assignment and therefore use the assignment inside the loop to copy the \0 to dest, too:
for(i = 0; ; i++) { dest[i] = src[i]; if(src[i] == '\0') break; }(There are in fact many, many ways to write strcpy. Many programmers like to combine the assignment and test, using an expression like (dest[i] = src[i]) != '\0'. This is actually the same sort of combined operation as we used in our getchar loop in chapter 6.)
Here is a version of strcmp:
mystrcmp(char str1[], char str2[]) { int i = 0; while(1) { if(str1[i] != str2[i]) return str1[i] - str2[i]; if(str1[i] == '\0' || str2[i] == '\0') return 0; i++; } }Characters are compared one at a time. If two characters in one position differ, the strings are different, and we are supposed to return a value less than zero if the first string (str1) is alphabetically less than the second string. Since characters in C are represented by their numeric character set values, and since most reasonable character sets assign values to characters in alphabetical order, we can simply subtract the two differing characters from each other: the expression str1[i] - str2[i] will yield a negative result if the i'th character of str1 is less than the corresponding character in str2. (As it turns out, this will behave a bit strangely when comparing upper- and lower-case letters, but it's the traditional approach, which the standard versions of strcmp tend to use.) If the characters are the same, we continue around the loop, unless the characters we just compared were (both) \0, in which case we've reached the end of both strings, and they were both equal. Notice that we used what may at first appear to be an infinite loop--the controlling expression is the constant 1, which is always true. What actually happens is that the loop runs until one of the two return statements breaks out of it (and the entire function). Note also that when one string is longer than the other, the first test will notice this (because one string will contain a real character at the [i] location, while the other will contain \0, and these are not equal) and the return value will be computed by subtracting the real character's value from 0, or vice versa. (Thus the shorter string will be treated as ``less than'' the longer.) Finally, here is a version of strlen:
int mystrlen(char str[]) { int i; for(i = 0; str[i] != '\0'; i++) {} return i; }In this case, all we have to do is find the \0 that terminates the string, and it turns out that the three control expressions of the for loop do all the work; there's nothing left to do in the body. Therefore, we use an empty pair of braces {} as the loop body. Equivalently, we could use a null statement, which is simply a semicolon:
for(i = 0; str[i] != '\0'; i++) ;Empty loop bodies can be a bit startling at first, but they're not unheard of. Everything we've looked at so far has come out of C's standard libraries. As one last example, let's write a substr function, for extracting a substring out of a larger string. We might call it like this:
char string8[] = "this is a test"; char string9[10]; substr(string9, string8, 5, 4); printf("%s\n", string9);The idea is that we'll extract a substring of length 4, starting at character 5 (0-based) of string8, and copy the substring to string9. Just as with strcpy, it's our responsibility to declare the destination string (string9) big enough. Here is an implementation of substr. Not surprisingly, it's quite similar to strcpy:
substr(char dest[], char src[], int offset, int len) { int i; for(i = 0; i < len && src[offset + i] != '\0'; i++) dest[i] = src[i + offset]; dest[i] = '\0'; }If you compare this code to the code for mystrcpy, you'll see that the only differences are that characters are fetched from src[offset + i] instead of src[i], and that the loop stops when len characters have been copied (or when the src string runs out of characters, whichever comes first). In this chapter, we've been careless about declaring the return types of the string functions, and (with the exception of mystrlen) they haven't returned values. The real string functions do return values, but they're of type ``pointer to character,'' which we haven't discussed yet.
When working with strings, it's important to keep firmly in mind the differences between characters and strings. We must also occasionally remember the way characters are represented, and about the relation between character values and integers.
As we have had several occasions to mention, a character is represented internally as a small integer, with a value depending on the character set in use. For example, we might find that 'A' had the value 65, that 'a' had the value 97, and that '+' had the value 43. (These are, in fact, the values in the ASCII character set, which most computers use. However, you don't need to learn these values, because the vast majority of the time, you use character constants to refer to characters, and the compiler worries about the values for you. Using character constants in preference to raw numeric values also makes your programs more portable.)
As we may also have mentioned, there is a big difference between a character and a string, even a string which contains only one character (other than the \0). For example, 'A' is not the same as "A". To drive home this point, let's illustrate it with a few examples.
If you have a string:
char string[] = "hello, world!";you can modify its first character by saying
string[0] = 'H';(Of course, there's nothing magic about the first character; you can modify any character in the string in this way. Be aware, though, that it is not always safe to modify strings in-place like this; we'll say more about the modifiability of strings in a later chapter on pointers.) Since you're replacing a character, you want a character constant, 'H'. It would not be right to write
string[0] = "H"; /* WRONG */because "H" is a string (an array of characters), not a single character. (The destination of the assignment, string[0], is a char, but the right-hand side is a string; these types don't match.) On the other hand, when you need a string, you must use a string. To print a single newline, you could call
printf("\n");It would not be correct to call
printf('\n'); /* WRONG */printf always wants a string as its first argument. (As one final example, putchar wants a single character, so putchar('\n') would be correct, and putchar("\n") would be incorrect.) We must also remember the difference between strings and integers. If we treat the character '1' as an integer, perhaps by saying
int i = '1';we will probably not get the value 1 in i; we'll get the value of the character '1' in the machine's character set. (In ASCII, it's 49.) When we do need to find the numeric value of a digit character (or to go the other way, to get the digit character with a particular value) we can make use of the fact that, in any character set used by C, the values for the digit characters, whatever they are, are contiguous. In other words, no matter what values '0' and '1' have, '1' - '0' will be 1 (and, obviously, '0' - '0' will be 0). So, for a variable c holding some digit character, the expression
c - '0'gives us its value. (Similarly, for an integer value i, i + '0' gives us the corresponding digit character, as long as 0 <= i <= 9.) Just as the character '1' is not the integer 1, the string "123" is not the integer 123. When we have a string of digits, we can convert it to the corresponding integer by calling the standard function atoi:
char string[] = "123"; int i = atoi(string); int j = atoi("456");Later we'll learn how to go in the other direction, to convert an integer into a string. (One way, as long as what you want to do is print the number out, is to call printf, using %d in the format string.)
Subscribe to:
Posts (Atom)