Closed
Bug 28952
Opened 25 years ago
Closed 23 years ago
nsCRT::strtok broken for characters > 0x7f
Categories
(Core :: XPCOM, defect, P3)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
People
(Reporter: Bienvenu, Assigned: scc)
References
Details
Attachments
(1 file)
(deleted),
patch
|
Details | Diff | Splinter Review |
nsCRT::strtok gives false positives when the string being searched contains
characters > 0x7f - for example, when searching for '/', it thinks that ç is
'/'.
It's also very inefficient for the simple case of looking for a single
delimiter, since it initializes two 32 byte arrays every time.
I've stopped using it in the place that was causing me trouble, but I think we
should fix this since it can cause hard to reproduce bugs. I would try to fix it
myself, but it's not clear to me why it works (or doesn't work).
Comment 1•25 years ago
|
||
I went looking for where I had copied this strtok routine from and couldn't
find it. But I did find this one in the 4.x codebase in lib/xp/xp_str.c:
1.31 (kmcentee 15-Jan-97):
/*************************************************
1.31 (kmcentee 15-Jan-97): The following functions are used to
implement
1.31 (kmcentee 15-Jan-97): a thread safe strtok
1.31 (kmcentee 15-Jan-97):
*************************************************/
1.31 (kmcentee 15-Jan-97): /*
1.31 (kmcentee 15-Jan-97): * Get next token from string *stringp,
where tokens are (possibly empty)
1.31 (kmcentee 15-Jan-97): * strings separated by characters from
delim. Tokens are separated
1.31 (kmcentee 15-Jan-97): * by exactly one delimiter iff the skip
parameter is false; otherwise
1.31 (kmcentee 15-Jan-97): * they are separated by runs of characters
from delim, because we
1.31 (kmcentee 15-Jan-97): * skip over any initial `delim' characters.
1.31 (kmcentee 15-Jan-97): *
1.31 (kmcentee 15-Jan-97): * Writes NULs into the string at *stringp
to end tokens.
1.31 (kmcentee 15-Jan-97): * delim will usually, but need not, remain
CONSTant from call to call.
1.31 (kmcentee 15-Jan-97): * On return, *stringp points past the last
NUL written (if there might
1.31 (kmcentee 15-Jan-97): * be further tokens), or is NULL (if there
are definitely no more tokens).
1.31 (kmcentee 15-Jan-97): *
1.31 (kmcentee 15-Jan-97): * If *stringp is NULL, strtoken returns
NULL.
1.31 (kmcentee 15-Jan-97): */
1.31 (kmcentee 15-Jan-97): static
1.31 (kmcentee 15-Jan-97): char *strtoken_r(char ** stringp, const char
*delim, int skip)
1.31 (kmcentee 15-Jan-97): {
1.31 (kmcentee 15-Jan-97): char *s;
1.31 (kmcentee 15-Jan-97): const char *spanp;
1.31 (kmcentee 15-Jan-97): int c, sc;
1.31 (kmcentee 15-Jan-97): char *tok;
1.31 (kmcentee 15-Jan-97):
1.31 (kmcentee 15-Jan-97): if ((s = *stringp) == NULL)
1.31 (kmcentee 15-Jan-97): return (NULL);
1.31 (kmcentee 15-Jan-97):
1.31 (kmcentee 15-Jan-97): if (skip) {
1.31 (kmcentee 15-Jan-97): /*
1.31 (kmcentee 15-Jan-97): * Skip (span) leading
delimiters (s += strspn(s, delim)).
1.31 (kmcentee 15-Jan-97): */
1.31 (kmcentee 15-Jan-97): cont:
1.31 (kmcentee 15-Jan-97): c = *s;
1.31 (kmcentee 15-Jan-97): for (spanp = delim; (sc =
*spanp++) != 0;) {
1.31 (kmcentee 15-Jan-97): if (c == sc) {
1.31 (kmcentee 15-Jan-97): s++;
1.31 (kmcentee 15-Jan-97): goto cont;
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97): if (c == 0) { /* no
token found */
1.31 (kmcentee 15-Jan-97): *stringp = NULL;
1.31 (kmcentee 15-Jan-97): return (NULL);
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97):
1.31 (kmcentee 15-Jan-97): /*
1.31 (kmcentee 15-Jan-97): * Scan token (scan for delimiters: s
+= strcspn(s, delim), sort of).
1.31 (kmcentee 15-Jan-97): * Note that delim must have one NUL;
we stop if we see that, too.
1.31 (kmcentee 15-Jan-97): */
1.31 (kmcentee 15-Jan-97): for (tok = s;;) {
1.31 (kmcentee 15-Jan-97): c = *s++;
1.31 (kmcentee 15-Jan-97): spanp = delim;
1.31 (kmcentee 15-Jan-97): do {
1.31 (kmcentee 15-Jan-97): if ((sc = *spanp++) ==
c) {
1.31 (kmcentee 15-Jan-97): if (c == 0)
1.31 (kmcentee 15-Jan-97): s =
NULL;
1.31 (kmcentee 15-Jan-97): else
1.31 (kmcentee 15-Jan-97): s[-1] =
0;
1.31 (kmcentee 15-Jan-97): *stringp = s;
1.31 (kmcentee 15-Jan-97): return( (char
*) tok );
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97): } while (sc != 0);
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97): /* NOTREACHED */
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97):
1.31 (kmcentee 15-Jan-97):
1.31 (kmcentee 15-Jan-97): char *XP_STRTOK_R(char *s1, const char *s2,
char **lasts)
1.31 (kmcentee 15-Jan-97): {
1.31 (kmcentee 15-Jan-97): if (s1)
1.31 (kmcentee 15-Jan-97): *lasts = s1;
1.31 (kmcentee 15-Jan-97): return (strtoken_r(lasts, s2, 1));
1.31 (kmcentee 15-Jan-97): }
1.31 (kmcentee 15-Jan-97):
1.31 (kmcentee 15-Jan-97):
Updated•25 years ago
|
Target Milestone: M15
Comment 2•25 years ago
|
||
Scott: Can you own this problem along with all the other string/intl cleanup?
Maybe this bug should just become "eliminate nsCRT::strtok". Thanks,
Assignee: warren → scc
Updated•25 years ago
|
Status: NEW → ASSIGNED
Updated•25 years ago
|
Target Milestone: M15 → M16
Updated•25 years ago
|
Target Milestone: M16 → M18
Comment 3•24 years ago
|
||
mass re-assigning to my new bugzilla account
Assignee: scc → scc
Status: ASSIGNED → NEW
Assignee | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Updated•24 years ago
|
Component: XP Miscellany → String
OS: Windows NT → All
Hardware: PC → All
Target Milestone: M18 → ---
Assignee | ||
Updated•24 years ago
|
Target Milestone: --- → mozilla0.9.1
Assignee | ||
Updated•24 years ago
|
Target Milestone: mozilla0.9.1 → mozilla0.9
Assignee | ||
Updated•24 years ago
|
Target Milestone: mozilla0.9 → mozilla0.9.1
Assignee | ||
Comment 4•24 years ago
|
||
re-targeting milestones, starting from a clean slate
Target Milestone: mozilla0.9.1 → ---
Comment 5•23 years ago
|
||
The function was indeed very broken. Not only the 8-bit cleanliness (resulting
from using a character from a string as an index, think about the sign of
'char'!) there was also some bit ro tin the parsing of the delims string. Why
the length of delims should have anything to do with DELIM_TABLE_SIZE is beyond me.
Anyhow, I'll attach a patch which I've tested (I'm writing this in the resulting
browser).
Comment 6•23 years ago
|
||
Comment 7•23 years ago
|
||
We need to make sure this is well tested for side effects. Not so much for
broken functionality but because of places that might depend on this broken
functionality. This is used throughout the code.
Also, why is the table size even in there in the first place? I noticed that
Ulrich has removed it in his patch.
Comment 8•23 years ago
|
||
What changed are two things:
- on systems with signed char type the array access does not leave the bounds
of the array. If code relies on the old behavior it is broken since the
compiler is free to place the array variable where it wants and therefore the
memory outside the array is completely undefined
- the removal of the i < DELIM_TABLE_SIZE test can potentially have consequences
if some call to nsCRT::strtok is using more then 32 delimiter characters.
This is highly unlikely, though.
If you want to be safe place an assert after the loop checking that i is
< DELIM_TABLE_SIZE. I've looked through a great deal of the strtok uses
and haven't found a place where this can be a problem. There are in total
no more than two or three dozen uses of this function so an extensive review
is possible.
Comment 9•23 years ago
|
||
Actually, thinking more about my last comment, there cannot be any problem at
all. If there would be a delim string > 32 chars this would already have been
caught.
Which leaves only people relying on uninitialized memory which is highly
unlikely to work (to say the least) in a cross-platform product like Mozilla.
Assignee | ||
Comment 10•23 years ago
|
||
Ulrich: jst and I have reviewed your patch and |nsCRT::strtok| in detail. We
think there are other good things that can be done to make |nsCRT::strtok|
better, but your patch is a good start, so I'll check it in immediately.
r=jst, sr=scc.
Things we think are bad about |nsCRT::strtok|: initialization of the delimiter
table could be done with a static string, e.g.,
char delim[ DELIM_TABLE_SIZE ] =
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
We think more comments are needed. It took us a bit to figure out that the
table was a bit map of all of ASCII. The macros need commenting to explain
this, as do the magic numbers |DELIM_TABLE_SIZE|, 3, and 7. The assertion is
nonsense, since the routine actually supports having up to 256 different
delimiters (though that wouldn't be useful). The existing code uses
post-increment in places where it shouldn't. The code is tricky, and that's not
a good thing, especially when the tricks aren't commented.
Assignee | ||
Comment 11•23 years ago
|
||
Thanks for your help, Ulrich. Your fix is checked in. I'm closing this bug.
If anyone thinks the other comments about |nsCRT::strtok| in this bug are worth
fixing, we can file a new bug.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Comment 12•23 years ago
|
||
> Things we think are bad about |nsCRT::strtok|: initialization of the delimiter
> table could be done with a static string, e.g.,
> char delim[ DELIM_TABLE_SIZE ] =
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
Be careful. You'll have to look at the generated code on several platforms.
Initializers like the above are generally a problem on some of them. gcc on
SPARC had/has? problem, means, generates bad code. If the compiler is not
generating an inlined memset() call for a construct like that above it will
implement it as single register moves to fill the stack slots which is worse.
If you take a look at the implementation I have in glibc you'd see that you'll
see that I haven't taken the effort to make a lot of optimizations. I have a
special version coded in asm for some platforms but it generally seems not to
be necessary. Unless very nicely tuned the startup costs associated with
creating the bitmap are higher than the gains. You can perhaps try to keep
track of the calls to this function and see how much work is actually done in
each calls (how much the input pointer is advanced).
So my advice would be to not do anything except for using the system's strtok_r
implementation if it exists. The LDAP c-sdk code already does this (and it
works after applying the patch I filed).
Assignee | ||
Comment 13•23 years ago
|
||
If you're concerned about initialization costs, here's a data-structure that has
a lower initialization cost than the bitmap, and I think an equivalent lookup
cost (though I'd have to look at the disassembly to know for sure)
PRUint8 set_size = 0; // number of delimiters in |set|
unsigned char set[256]; // set[0..set_size-1] contains the delimiters
// for which we are searching
PRUint8 map[256]; // array of indexes into |set|
// test if a delimiter |d| is in the set
pos_in_set = map[PRUint8(d)]
pos_in_set < set_size && set[pos_in_set] == d
// add a delimiter |d| to the set (if it's not already there)
set[map[PRUint8(d)] = set_size++] = d
The initiliazation cost in this scheme for zero delimiters is only setting
|set_size|. Neither array requires any initialization. Removal from the set is
easy as well, but no need to show it here since it we have no use for it in this
case.
Updated•4 years ago
|
Component: String → XPCOM
You need to log in
before you can comment on or make changes to this bug.
Description
•