It’s interesting that even though it’s now decades old we’re still running into this fundamentally rather trivial problem. It is especially pronounced in heavily cross-platform environments like VisualWorks. As soon as you have multiple environments, e.g. Windows and Linux and you move text files between the two, you’re pretty much guaranteed to run into files with doubled or tripled line-ends in them sooner or later. Often you can see both line-end conventions mixed up in the same file. The purpose of this article is to provide a reasonably complete picture of what’s going on in VisualWorks in this regard and how are we supposed to deal with that. Even seasoned smalltalkers sometimes miss some part of the whole picture, making it more difficult to deal with the consequences.

What’s going on?

Here’s a quick recapitulation. There are 3 line end conventions in common use: CR (MacOS), LF (Unix) and CRLF (Windows). CR and LF here refer to characters with ASCII code 13 and 10 respectively. This historical fact can have a profound effect on an environment with the ambition to have fully binary portable images across platforms. Let’s say we try to emulate the platform specific line end convention, i.e. use characters CR and LF in String instances on Windows, just LF on Unix, etc. This would make binary portability pretty difficult. Whenever we move an image from Windows to Unix and start it up we would need to convert all the existing String instances from the CRLF convention to LF convention. What’s worse it will effectively change the character size of many Strings. While that might be somewhat manageable (we already do something similar for platforms with different endianness) there are further implications. If strings get shortened, the positions of streams set up on top of them will suddenly be off as well. In fact any kind of “pointer” into the String will be potentially broken. That will be much harder to fix. And if it’s not fixed, things will start breaking. Binary portability wouldn’t really work.

So the only alternative is to pick one convention and stick with it everywhere. Historically the choice has been CR. I don’t know why, maybe there was certain affinity of the original Smalltalk-80 developers towards Macintosh platform back then. Ironically, we may eventually end up being the only environment keeping the CR convention as MacOS/X is moving Macs away from the CR convention towards the LF convention of it’s Unix based core. Nevertheless, the important point is that in Smalltalk line ends in String instances should *always* be marked with character CR only, no matter what platform it’s running on. While character LF is a perfectly valid character and it’s pretty easy to construct a String with that character in it, a String with LFs is usually a sign of some kind of anomaly and you can be sure that some components of the vast Smalltalk library won’t be able to cope with it. This is very important, so remember, only CRs in your Strings ! This simplification has other advantages too, things like reading a line of text from a stream becomes simply “upTo: Character cr” anywhere.

Tools of the trade

OK, now that’s settled, what about strings that come from outside ? It’s not a Smalltalk only world out there (not yet anyway) and the text files on your harddrive will usually use whatever is the native convention for a given platform. The key point here is that these strings are coming from “outside”. They should be converted to the Smalltalk convention (CR) as they are brought in. That’s why Strings with LF characters in them are often a sign of a failure to convert properly.

The most common method to bring a String in from outside is to use an ExternalStream. And indeed if you take a closer look at those, you’ll see that they have a configurable lineEndConvention. Setting an ExternalStream to #lineEndCRLF means that if you are using it to read e.g. a file, it will automatically convert byte sequence #[13 10] into a single character CR. When writing into such stream, it will automatically produce byte sequence #[13 10] whenever it is given character CR. Note that it will write #[10] if you give it character LF instead and similarly it will give you an LF when reading if it encounters a standalone byte 10 (one that is not preceded by 13). This is all assuming the stream is in text mode which is the default mode. If the stream is set to binary mode, it will pass the byte Integers through as they are, without any conversions. But you’ll be getting ByteArrays out of the stream, not Strings in that case. The last thing that the streams will do for you automatically is set a default line end convention based on the platform that the image is running on. So if you create an external stream on Linux, it will be set to #lineEndLF. On Windows it will be set to #lineEndCRLF, etc.

There’s another kind of streams that provide these capabilities, the EncodedStreams. Primary function of these is to convert bytes to characters according to a specified ‘character encoding’. Character encoding is an even bigger can of worms, but for the purpose of this discussion let’s just focus on the fact that EncodedStreams provide line end conversion the same way as external streams. In fact you can set an external stream to binary mode, wrap an EncodedStream around it and you should get the same results as with the external stream in text mode. This kind of setup will be handy in cases where you need to deal with character encodings that aren’t supported by the external streams. There are in fact only very few of those that are supported by external streams: ISO8859_1, MSCP1252 for older Windows versions, and few more obscure encodings. For anything else (e.g. UTF-8) you will need to use an EncodedStream. So it is actually quite important that the EncodedStream is polymorphic with ExternalStreams.

When tools need a hand

So far so good. We now know the tools that are available in the base image for dealing with line-end conversions. The automatic configuration for native platform convention usually works out just fine when you’re dealing with a single platform. But what if you’re accessing a file system from Windows that is shared with Linux ? These are the cases when you may need to intervene and set the line end convention appropriately yourself.

There’s also a “pseudo” convention #lineEndAuto which makes the stream do a quick scan of an initial portion of the stream to determine which lineEndConvention to use based on the actual contents. This however may not be feasible on some types of external streams, e.g. socket streams, because it requires reliable positioning capability.

Yet another alternative is the following. Instead of insisting on a specific convention for the entire stream, we could simply convert any convention to CRs as the stream is being read. This is especially nice when you want to cleanup a text with mixed conventions. Moreover this is actually doable without any pre-scanning and therefore could work on any kind of stream, even stream that cannot be peeked and positioned. This functionality isn’t available in the Base image though, so I’m going to plug here my AutoLineEndStream which does exactly that. It can be found in the ComputingStreams package in the public repository. With this the following will evaluate to true:

	stream := #[10 13 10 10 13 13] asString readStream.
	stream := AutoLineEndStream wrap: stream.
	stream contents = '\\\\\' withCRs

Note that AutoLineEndStream wraps a character stream (a stream that provides and consumes characters), e.g. an internal character stream or an external or encoded stream in text mode. The reason for this is complexity of character encodings. Even characters CR and LF need to be encoded when they are converted to/from bytes, and the encoding isn’t always just 13 or 10. A CR can be #[13 00] or #[00 13] in UTF-16 encoding depending on the endianness and it can get even worse with other encodings. In general case it is impossible to pick out CR and LF out of an encoded text without decoding the other characters as well. So making AutoLineEndStream work with bytes would force it deep into the character encoding issues. It is much simpler to leave this domain to EncodedStreams and operate on characters instead, suddenly its task becomes trivial. So if you need to read an external file simply keep the stream (external or encoded) in text mode, set it to #lineEndTransparent which means don’t do any conversion (all CRs and LFs will be preserved), and wrap it in AutoLineEndStream. Here’s the corresponding code:

	stream := 'messed up file.txt' asFilename readStream
	stream lineEndTransparent.
	stream := AutoLineEndStream wrap: stream.

Note that #lineEndTransparent is actually equivalent to #lineEndCR, just named differently to emphasize the effective transparency of this mode, and in this case definitely expresses the intent better.

I should also mention that AutoLineEndStream will perform the same conversion when writing. If you write a CR and then an LF into the stream, it will write only CR into the underlying stream. So if you happen to have a String with mixed up line ends in memory, simply writing it through an AutoLineEndStream should straighten that ou as well. The following should evaluate to true as well.

	stream := String new writeStream.
	stream := AutoLineEndStream wrap: stream.
	stream nextPutAll: #[10 13 10 10 13 13] asString.
	stream contents = '\\\\\' withCRs

That’s all I can think of that is relevant to this topic. I believe the problem isn’t particularly difficult to deal with once you have a good understanding of what is going on and what should be happening. I hope this article will help with that. Thanks for reading this far and if you have any corrections, suggestions, ideas, I’ll be watching the comments eagerly.

– Posted by Martin Kobetic