Skip to content

Commit 2d9e195

Browse files
authored
Merge pull request #13 from konsumlamm/bytestring
Update Bytestrings section
2 parents 8b93e1b + 272d311 commit 2d9e195

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

docs/input-and-output.html

+4-4
Original file line numberDiff line numberDiff line change
@@ -1095,14 +1095,14 @@ <h1>Input and Output</h1>
10951095
<p>However, processing files as strings has one drawback: it tends to be slow. As you know, <span class="fixed">String</span> is a type synonym for <span class="fixed">[Char]</span>. <span class="fixed">Char</span>s don't have a fixed size, because it takes several bytes to represent a character from, say, Unicode. Furthemore, lists are really lazy. If you have a list like <span class="fixed">[1,2,3,4]</span>, it will be evaluated only when completely necessary. So the whole list is sort of a promise of a list. Remember that <span class="fixed">[1,2,3,4]</span> is syntactic sugar for <span class="fixed">1:2:3:4:[]</span>. When the first element of the list is forcibly evaluated (say by printing it), the rest of the list <span class="fixed">2:3:4:[]</span> is still just a promise of a list, and so on. So you can think of lists as promises that the next element will be delivered once it really has to and along with it, the promise of the element after it. It doesn't take a big mental leap to conclude that processing a simple list of numbers as a series of promises might not be the most efficient thing in the world.</p>
10961096
<p>That overhead doesn't bother us so much most of the time, but it turns out to be a liability when reading big files and manipulating them. That's why Haskell has <em>bytestrings</em>. Bytestrings are sort of like lists, only each element is one byte (or 8 bits) in size. The way they handle laziness is also different. </p>
10971097
<p>Bytestrings come in two flavors: strict and lazy ones. Strict bytestrings reside in <span class="fixed">Data.ByteString</span> and they do away with the laziness completely. There are no promises involved; a strict bytestring represents a series of bytes in an array. You can't have things like infinite strict bytestrings. If you evaluate the first byte of a strict bytestring, you have to evaluate it whole. The upside is that there's less overhead because there are no thunks (the technical term for <i>promise</i>) involved. The downside is that they're likely to fill your memory up faster because they're read into memory at once.</p>
1098-
<p>The other variety of bytestrings resides in <span class="fixed">Data.ByteString.Lazy</span>. They're lazy, but not quite as lazy as lists. Like we said before, there are as many thunks in a list as there are elements. That's what makes them kind of slow for some purposes. Lazy bytestrings take a different approach &mdash; they are stored in chunks (not to be confused with thunks!), each chunk has a size of 64K. So if you evaluate a byte in a lazy bytestring (by printing it or something), the first 64K will be evaluated. After that, it's just a promise for the rest of the chunks. Lazy bytestrings are kind of like lists of strict bytestrings with a size of 64K. When you process a file with lazy bytestrings, it will be read chunk by chunk. This is cool because it won't cause the memory usage to skyrocket and the 64K probably fits neatly into your CPU's L2 cache.</p>
1099-
<p>If you look through the <a href="http://www.haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Lazy.html">documentation</a> for <span class="fixed">Data.ByteString.Lazy</span>, you'll see that it has a lot of functions that have the same names as the ones from <span class="fixed">Data.List</span>, only the type signatures have <span class="fixed">ByteString</span> instead of <span class="fixed">[a]</span> and <span class="fixed">Word8</span> instead of <span class="fixed">a</span> in them. The functions with the same names mostly act the same as the ones that work on lists. Because the names are the same, we're going to do a qualified import in a script and then load that script into GHCI to play with bytestrings. </p>
1098+
<p>The other variety of bytestrings resides in <span class="fixed">Data.ByteString.Lazy</span>. They're lazy, but not quite as lazy as lists. Like we said before, there are as many thunks in a list as there are elements. That's what makes them kind of slow for some purposes. Lazy bytestrings take a different approach &mdash; they are stored in chunks (not to be confused with thunks!), each chunk has a size of 32 KiB. So if you evaluate a byte in a lazy bytestring (by printing it or something), the first 32 KiB will be evaluated. After that, it's just a promise for the rest of the chunks. Lazy bytestrings are kind of like lists of strict bytestrings with a size of 32 KiB. When you process a file with lazy bytestrings, it will be read chunk by chunk. This is cool because it won't cause the memory usage to skyrocket and the 32 KiB probably fits neatly into your CPU's L2 cache.</p>
1099+
<p>If you look through the <a href="https://hackage.haskell.org/package/bytestring/docs/Data-ByteString-Lazy.html">documentation</a> for <span class="fixed">Data.ByteString.Lazy</span>, you'll see that it has a lot of functions that have the same names as the ones from <span class="fixed">Data.List</span>, only the type signatures have <span class="fixed">ByteString</span> instead of <span class="fixed">[a]</span> and <span class="fixed">Word8</span> instead of <span class="fixed">a</span> in them. The functions with the same names mostly act the same as the ones that work on lists. Because the names are the same, we're going to do a qualified import in a script and then load that script into GHCI to play with bytestrings. </p>
11001100
<pre name="code" class="haskell:hs">
11011101
import qualified Data.ByteString.Lazy as B
11021102
import qualified Data.ByteString as S
11031103
</pre>
11041104
<p><span class="fixed">B</span> has lazy bytestring types and functions, whereas <span class="fixed">S</span> has strict ones. We'll mostly be using the lazy version.</p>
1105-
<p>The function <span class="function label">pack</span> has the type signature <span class="fixed">pack :: [Word8] -&gt; ByteString</span>. What that means is that it takes a list of bytes of type <span class="fixed">Word8</span> and returns a <span class="fixed">ByteString</span>. You can think of it as taking a list, which is lazy, and making it less lazy, so that it's lazy only at 64K intervals.</p>
1105+
<p>The function <span class="function label">pack</span> has the type signature <span class="fixed">pack :: [Word8] -&gt; ByteString</span>. What that means is that it takes a list of bytes of type <span class="fixed">Word8</span> and returns a <span class="fixed">ByteString</span>. You can think of it as taking a list, which is lazy, and making it less lazy, so that it's lazy only at 32 KiB intervals.</p>
11061106
<p>What's the deal with that <span class="fixed">Word8</span> type? Well, it's like <span class="fixed">Int</span>, only that it has a much smaller range, namely 0-255. It represents an 8-bit number. And just like <span class="fixed">Int</span>, it's in the <span class="fixed">Num</span> typeclass. For instance, we know that the value <span class="fixed">5</span> is polymorphic in that it can act like any numeral type. Well, it can also take the type of <span class="fixed">Word8</span>.</p>
11071107
<pre name="code" class="haskell:hs">
11081108
ghci&gt; B.pack [99,97,110]
@@ -1131,7 +1131,7 @@ <h1>Input and Output</h1>
11311131
ghci&gt; foldr B.cons' B.empty [50..60]
11321132
Chunk "23456789:;&lt;" Empty
11331133
</pre>
1134-
<p>As you can see <span class="label function">empty</span> makes an empty bytestring. See the difference between <span class="fixed">cons</span> and <span class="fixed">cons'</span>? With the <span class="fixed">foldr</span>, we started with an empty bytestring and then went over the list of numbers from the right, adding each number to the beginning of the bytestring. When we used <span class="fixed">cons</span>, we ended up with one chunk for every byte, which kind of defeats the purpose.</p>
1134+
<p>As you can see, <span class="label function">empty</span> makes an empty bytestring. See the difference between <span class="fixed">cons</span> and <span class="fixed">cons'</span>? With the <span class="fixed">foldr</span>, we started with an empty bytestring and then went over the list of numbers from the right, adding each number to the beginning of the bytestring. When we used <span class="fixed">cons</span>, we ended up with one chunk for every byte, which kind of defeats the purpose.</p>
11351135
<p>Otherwise, the bytestring modules have a load of functions that are analogous to those in <span class="fixed">Data.List</span>, including, but not limited to, <span class="fixed">head</span>, <span class="fixed">tail</span>, <span class="fixed">init</span>, <span class="fixed">null</span>, <span class="fixed">length</span>, <span class="fixed">map</span>, <span class="fixed">reverse</span>, <span class="fixed">foldl</span>, <span class="fixed">foldr</span>, <span class="fixed">concat</span>, <span class="fixed">takeWhile</span>, <span class="fixed">filter</span>, etc. </p>
11361136
<p>It also has functions that have the same name and behave the same as some functions found in <span class="fixed">System.IO</span>, only <span class="fixed">String</span>s are replaced with <span class="fixed">ByteString</span>s. For instance, the <span class="fixed">readFile</span> function in <span class="fixed">System.IO</span> has a type of <span class="fixed">readFile :: FilePath -&gt; IO String</span>, while the <span class="label function">readFile</span> from the bytestring modules has a type of <span class="fixed">readFile :: FilePath -&gt; IO ByteString</span>. Watch out, if you're using strict bytestrings and you attempt to read a file, it will read it into memory at once! With lazy bytestrings, it will read it into neat chunks.</p>
11371137
<p>Let's make a simple program that takes two filenames as command-line arguments and copies the first file into the second file. Note that <span class="fixed">System.Directory</span> already has a function called <span class="fixed">copyFile</span>, but we're going to implement our own file copying function and program anyway.</p>

0 commit comments

Comments
 (0)