The String structure provides a typical collection of string operations. The documentation is straightforward.
Note that all subscripts are range checked. This can be a big performance hit if you use, for example, the sub function to search through a string. Where possible you should use some of the other functions. Alternatively you may find that using explode followed by list operations is faster, at the cost of using more memory. If you really need to do a lot of fast indexing into a string you can find a subscript function without range checking in the Unsafe.CharVector.sub function, which you can find in the boot/Unsafe directory of the compiler source.
The current version of SML/NJ (110.0.7) does not implement wide strings.
The Char structure provides a typical collection of character operations, again straightforward. The character classification functions like isSpace and case conversion are here.
Here are some examples.
fun uppercase s = String.map Char.toUpper s (* Apply HTML quoting. *) fun quoteHTML v = let fun quote #"\"" = """ | quote #"&" = "&" | quote #"<" = "<" | quote c = str c in String.translate quote v end (* Break a string into words at white space. *) fun split s = String.tokens Char.isSpace s |
For more elaborate string parsing on large strings the functions in the Substring structure will be more efficient. A substring is represented as a pointer to a range of characters in an underlying string. So you can work on pieces of the string without any copying. Here are some examples.
fun skipWhiteSpace s = Substring.string( Substring.dropl Char.isSpace (Substring.all s)) fun countLines s = Substring.foldl (fn (ch,n) => if ch = #"\n" then n+1 else n) 1 (Substring.all s) |
The StringCvt structure provides the infrastructure for reading values out of text. The infrastructure is based around the idea of a reader function that can split a value off of the beginning of a stream. Then there are transformers that build up more complex readers from simpler readers.
You have complete freedom in how you represent and implement streams just as long as they have a functional style. This means that if you have a stream strm and a get function that gets the first value of the stream then the expression (get strm) can be evaluated as many times as you like and it will always return the same value since it has no side-effects on the stream. The type of a reader function is
type ('a, 'b) reader = 'b -> ('a * 'b) option |
In this, type 'b represents the stream and 'a is a value from the stream. The reader function returns one value from the front of the stream along with the remainder of the stream. An option type is used with NONE representing failure to find a value in the stream. So for example if strm contains the characters "abc" and get is a character reader then (get strm) returns the first character from the stream along with the rest of the stream as the pair SOME(#"a", "bc").
Transformer functions convert a stream of some type T to a stream of another type U. In the infrastructure they are applied on the fly. You have to at some point arrive at a stream of characters for the text scanning. Figure 3-1 shows an example of transformations for reading a stream of integers from character buffers. You write your transformer that can deliver characters one by one from the buffers. This will involve maintaining an offset into the buffer for where the next character will come from. Then you can use the Int.scan function to transform this stream of characters into a stream of integers.
Some possible sources of character streams are:
a string. The StringCvt.scanString will transform a string to a character stream and deliver it to a transformer, and return the first value from the transformed stream. It is designed for one-off use to implement fromString functions.
an input stream from a file. The TextIO.scanStream function will deliver a stream of characters from a file to a transformer and then return the first value from the transformed stream. Since I/O streams are imperative the stream will be updated but it will be pure enough to complete the reading of the value.
a list of characters. The List.getItems function can deliver list elements matching the requirements of a reader.
a substring of a string using Substring.getc.
Here is an example of a function that transforms a character stream by splitting the stream into lines and extracting a triple of an integer, boolean and a real from each line.
fun ibr rdr the_cstrm = let (* Read all characters to the end of the string or a newline. This returns the line and the rest of the stream. *) fun get_line cstrm rev_line = ( case rdr cstrm of NONE => (cstrm, implode(rev rev_line)) (* ran out of chars *) | SOME (c, rest) => ( if c = #"\n" then (rest, implode(rev rev_line)) else get_line rest (c::rev_line) ) ) val (strm_out, line) = get_line the_cstrm [] val l1 = Substring.all line val (i, l2) = valOf(Int.scan StringCvt.DEC Substring.getc l1) val (b, l3) = valOf(Bool.scan Substring.getc l2) val (r, l4) = valOf(Real.scan Substring.getc l3) in SOME((i, b, r), strm_out) end handle Option => NONE |
The transformer must take a character reader as an argument so I want the expression (ibr rdr) to be a reader that reads triples when rdr is a character reader. But a reader is a function taking a stream as an argument. So if I define the function as fun ibr rdr the_cstrm where the_cstrm is the character stream to read then, using currying, the expression (ibr rdr) will be of the correct type, i.e. a function taking a character stream.
The first thing the transformer does is read characters from the stream until it has a complete line. The characters are accumulated into a list in reverse. At the end they are joined into a string. This is the fastest way to accumulate a string from characters.
Next I want to get the line into a stream that can be scanned. The Substring.getc function satisfies the requirements for a reader function if the stream is a substring. Then I can call the scan functions for each type to get the values on the line. Note that I get back an updated substring in the l2, l3 and l4 variables. I use valOf to get the result out from under the SOME. If the scan fails then it will return NONE which will cause valOf to raise the Option exception. A reader function indicates failure by returning NONE so that's what the exception handler does.
The main program shows how StringCvt.scanString applies the transformer to a string returning exactly one result.
fun main(arg0, argv) = let val text = "\ \ 123 true 23.4 \n\ \ -1 false -1.3e3 \n\ \" in case StringCvt.scanString ibr text of NONE => print "ibr failed\n" | SOME (i, b, r) => print(concat[ Int.toString i, " ", Bool.toString b, " ", Real.toString r, "\n" ]); OS.Process.success end |
See the documentation for StringCvt for more details.
Bytes are represented by the type word in the Word8 structure. The Byte structure provides conversions between strings of characters and sequences of bytes. This will be especially useful for the web server project since reading and writing to TCP/IP sockets is done with bytes which we will want to convert to strings.