Will you surf the web ... FOR SCIENCE?

My name is Peter A. H. Peterson, and I am a doctoral candidate in Computer Science at the University of California, Los Angeles.
I'm doing research into improving the efficiency of data transfer methods, which could ultimately improve the performance and battery life of computers. I need some samples of "real-world" web data collected by people other than myself, so that I can evaluate how my prototype would perform if it were running on your computer.
Do you have a Mac or run Linux, and the Firefox browser (or would be willing to install it)? Would you be willing to surf the web for about an hour, and let me have the data your browser downloaded? I don't want any personal information, so I don't want you to go shopping, look at Facebook, access your bank, or do anything you think is private. You won't send me any data until you're finished, so you can change your mind right up until the end.
All I care about is how well my prototype processes your data. What I will do is run your data through some experimental code on my private network, which tries to find more efficient ways to transmit that information. You won't be identifiable in any way in my results, and I promise not to use this data for any other purpose or future projects -- I'll delete it when I'm finished.
Are you willing to help? If so, read on.
If you are concerned about any aspect of this, please read the FAQ below.

Contents

Overview

To capture the data in a format useful to me, you'll need to download and install a program I wrote and make a few temporary changes to your browser. I'll walk you through each simple step. Then, you'll just surf the web for about an hour doing non-private stuff. Afterwards, you'll follow my instructions to reverse the changes we made, and you'll send me the file of data my program created. Finally, you can delete the program you downloaded to capture the data.

Requirements

  1. A computer running OS X or Linux, and the tool tcpdump. Tcpdump is included in OS X and is widely available for Linux.
  2. The Firefox web browser, which is freely available online.)
  3. This script: http://tastytronic.net/~pedro/for_science/for_science.sh
  4. An hour or two to kill surfing the web.
  5. A willingness to send me your unencrypted data.

Instructions

1. Download the capture tool and make it executable.

Download the capture tool...

I created a tool to make it easy for you to capture the data. Download it here: http://tastytronic.net/~pedro/for_science/for_science.sh
Save it to your desktop with the name for_science.sh -- do not append ".txt" to the file if given the choice.
If you click on the link, you will probably be presented with the script's source code, which you can save with File | Save As in your browser. Otherwise, right-click or control-click and choose Save Link As... and save the file to your desktop.

... and make it executable

Then, open a terminal. On OS X, you should be able to do this by searching for "Terminal" in Spotlight search at the upper-right corner of your screen (the magnifying glass) and clicking on it. When the terminal is open, it should look something like this:
Fig. 1.1: A typical terminal window.
In the terminal, type the following case-sensitive commands:
cd Desktop
chmod u+x for_science.sh
Before you run the script, you need to change some settings in Firefox, described next.

2. Disable encryption and compression in Firefox

I need you to disable encryption and compression so that I can analyze the data. In part 5 (below), I'll show you how to re-enable these facilities. (If you have questions about why I need you to do this, see the FAQ.)
We'll make all the changes using the Firefox configuration mechanism called about:config. I've broken it down into a number of extremely simple steps.
Step 0: Close all Firefox windows and reopen Firefox.
Step 1: Type about:config in the Location Bar -- where you would normally type in a website's address. Press enter/return. Firefox will dutifully warn you that you could break your browser with a message like this:
Fig. 2.1: Firefox will warn you about using the config tool, but don't worry -- what we're doing is safe.
Step 2: In the Search window, enter enable_ssl3. This will limit the options to those including the string enable_ssl3.
Fig. 2.2: SSL is currently enabled.
Step 4: Double-click on security.enable_ssl3 and the Value field will change from true to false:
Fig. 2.3: SSL is now disabled.
Step 5: Now, we'll disable TLS in the same way. In the Search window, enter enable_tls. You will see something like this:
Fig. 2.4: TLS is currently enabled.
Step 6: Double-click on security.enable_tls and the Value field will change from true to false:
Fig. 2.5: TLS is now disabled.
Step 7: Next, enter accept-encoding into the search window, and you'll see something like this:
Fig. 2.6: Gzip compression is currently enabled.
Please note: if your entry doesn't read "gzip,deflate" as shown above, please contact me before going any further.
Step 8: Double-click on the entry for network.http.accept-encoding and a dialog box will appear like this:
Fig. 2.7: This setting takes a string of text rather than a true/false value.
Step 9: Erase the text and click OK. Now the Value field will be empty, like this:
Fig. 2.8: The accept-encoding value has been deleted.
Almost done! We just need to enable Private Browsing mode to help protect your privacy.

3. Enable Private Browsing

Firefox includes a feature called Private Browsing that treats the current session as a temporary and new "blank slate." This will make sure that I don't accidentally get any personal information from your browser (like cached information or "Cookies").
To enable Private Browsing: In Firefox, click the Tools menu and then click on Start Private Browsing.
Fig. 3.1: Firefox's "Private Browsing" mode will keep your information cached in your browser out of the captured data.
Private browsing ends when you close the browser, so you won't need to do anything special to turn it off when you are finished.

4. Start the capture program!

In your terminal, run the command ./for_science.sh.
(If you closed the terminal from before, reopen it, and type cd Desktop and press enter.)
When you run for_science.sh (and when the script stops) it might ask for your password -- don't worry, it will not be recorded by the script. It is only needed to start and stop the capturing software (which requires privileges). If you are concerned about this, you can inspect the source code of for_science.sh which will show that sudo only used to execute and stop tcpdump (the capture software).
When you start the script, you'll see something like this:
Fig. 4.1: Your password is necessary to record network data; it will not be included in the capture file.
Once you enter your password and things are up and running, you'll see a screen like this:
Fig. 4.2: Did you enable Private Browsing and disable SSL, TLS, and compression? If so, surf's up!
Ok... the program is running. Now what?

5. The Fun Part

You've now reached the fun part! Just go surf the web using Firefox. Don't surf the web with other browsers while the capture script is running. If you need to do something personal or sensitive, you can stop the capture script at any time by pressing ^C (control-c).
The program will quit automatically after 25 megabytes, or 8 hours passes (this is a precaution so that it doesn't run forever). Depending on how long it takes to capture the data, the script may request your password again. This is simply to stop the capture tool (again, your password is not saved or even seen by the script). If you need to quit sooner (for example, if you need to do something sensitive online), hit control-C -- and then make sure that you restart Firefox and re-enable compression and encryption as described in Step 5 below.
Problem accessing your favorite websites?
If you can't access your favorite sites, either the site requires encryption, or your browser may be attempting to make an encrypted connection for you. (Often, the browser will auto-complete an address that requires encryption even though the site can be accessed without it. Double-check the Location Bar to make sure that the address starts with "http://" and not "https://" -- if the "s" is there, just delete it and reload the page.
If entering the address manually starting with "http://" doesn't work, just browse a different site.
If every site has this problem, try closing Firefox, reopening it, and re-enabling Private Browsing (Tools | Start Private Browsing).
If that still doesn't work, please email me at pedro@tastytronic.net.
Once the script quits, you'll see a message like this:
Fig. 5.1: Your data has been saved! Send the data to pedro and remember to re-enable SSL, TLS, and compression.
At this point, there will be a file on your Desktop entitled (in this case): FOR_SCIENCE-1360426541.tar.gz.
Email me at pedro@tastytronic.net and I'll provide you with instructions for sending it to me.
If you want to capture more data, just go back to Step 4 and re-run ./for_science.sh in the terminal.
If you're done helping science for now, make sure that you re-enable compression and encryption, described next.

6. Re-enabling Encryption and Compression

Re-enabling encryption and compression is almost exactly the same as disabling it, only in reverse. It's very important that you remember to do this. In fact, if you volunteer, I will personally remind you to re-enable these features after you're done.
Here are the steps:
Step 1: Just like before, type about:config in the Location Bar -- where you would normally type in a website's address. Press enter/return. Firefox may warn you that this could be dangerous, but we (still) know what we're doing.
Step 2: In the Search window, enter enable_ssl3. This will limit the options to those including the string enable_ssl3. If you disabled encryption earlier, the Value field should say false.
Fig. 6.1: SSL is disabled.
Step 4: Double-click on security.enable_ssl3 and the word false should change to true:
Fig. 6.2: SSL encryption has been re-enabled!
Step 5: Now, we'll enable TLS in the same way. In the Search window, enter enable_tls. The enable_tls field should read false if you disabled it previously.
Fig. 6.3: TLS is disabled.
Step 6: Double-click on security.enable_tls and the Value field will change from false to true:
Fig. 6.4: TLS encryption has been re-enabled!
Step 7: Next, enter accept-encoding into the search window, and if you disabled compression previously, the Value field will be blank:
Fig. 6.5: Compression is disabled.
Step 8: Instead of double-clicking on the line, right-click (or control-click on a Mac) and a popup-menu will appear. Click the option Reset:
Fig. 6.6: Right (or control) clicking on line gives you an option to Reset the property.
Once you click Reset, the original options should appear like so:
Fig. 6.7: Compression has been re-enabled!
Step 9: Make sure you close your browser windows so that Private Browsing mode is cancelled. When you restart Firefox, everything will be back to normal.
Step 10: If you're done capturing data permanently, you can delete the program for_science.sh from your Desktop using the mouse, or by typing
rm for_science.sh
... in the terminal window.
If you want to send me more data, just repeat everything from Step 2 ("Disable encryption and compression in Firefox") onward.
That's it, and thanks again! You've made a significant contribution towards my graduating, and perhaps to improving computer efficiency in the long run!

Frequently Asked Questions

Q: Why do you need my web data?

A: For science! No really, I need to show that my research applies to the real world (not just to me).

I'm working on a research project involving the compressibility of data in the real world. It's pretty neat, and has the potential to someday improve the efficiency of your computers and smartphones so that they will run faster and have longer battery life. I have collected my own web data, and when I test my tool on that data, it works (yay!). However, "it worked for me" is not a scientific claim. It's possible that I don't surf the web the same way as most people, and so there's something about what I view that is artificially easy for my research. In order to make the argument that my tool does work on real world data, and could someday help the general public, I need volunteers to send me copies of their data.

Q: Why do I have to disable encryption and compression?

A: Because encryption and compression make it hard to analyze the data properly.

Encryption changes data essentially by using some technique to make the data appear to be random -- without any patterns. Unfortunately, patterns are what compressors find and use to save space, so random and random-looking data doesn't compress very well. For the same reason, once data is compressed, the patterns and structure that make it compressible are removed. As a result, it typically can't be compressed further. For these reasons, I need the data to be unencrypted and uncompressed for my tool to work properly.

Q: Isn't it a privacy risk to send you my data?

A: Yes, but it's not that bad.

It is a slight risk, but I wouldn't ask if I thought it was significant. First, it is temporary, and you will be in control at all times. The for_science.sh script will only capture data when you run it. It will only capture traffic on port 80 (web data only), it will stop automatically, and you will disable and enable encryption and compression yourself. I won't get the data until you actually send it to me yourself, so you can change your mind at any time. If you send me data, and then change your mind, I promise that I will delete it without using it.
Second, I don't need or want you to do anything sensitive such as your on-line banking, email, Facebook, buying things or researching that mole on your arm. Many "sensitive" sites will not let you access them without encryption, anyway. The truth is that much of what you do day-to-day on the web isn't encrypted anyway, so -- assuming that you trust me -- sending it to me is not significantly worse for your privacy.
That said, if you don't feel comfortable with it for any reason, or you are in a country where your privacy is especially threatened, please don't do this. And in any case, if you have questions about the privacy ramifications of contributing, please contact me at pedro@tastytronic.net.

Q: Will you capture anything other than web traffic?

A: No.

The capture tool is configured to only capture traffic on port 80, which is the port used for web traffic. This means that I will not see any traffic generated by your using a network printer, file server, email client, or any other application that uses a different port.

Q: What are you going to do with my data?

A: I'm going to use your data as input in various compression-related operations in a laboratory environment.

My research involves finding better ways to compress data to improve efficiency. So, I'm going to take your data and run it through a compression tool, testing how long it takes to compress, how much smaller the compressors make it, and how much time, space, and energy I could save by using my tool instead of Gzip -- the compressor that is built into the web standard.

Q: Will you, or anyone else, look at the data that I send you?

A: Almost certainly not.

I will be running your data through my research tool and collecting statistics, not opening the individual files in the web trace. In 25 megabytes of web data, there will be somewhere around 1,000 individual files, mostly little bits of text or images of buttons and so on, so there's not much reason to "look at it."
The only time I would conceivably do anything other than run the data through the tool is if there was something strange about your data that the tool couldn't handle properly. In that case, it would be important for me to try and determine why the software failed. I would first use a utility that would tell me what kind of data it was, like "a JPEG image" or "ASCII text" or "Javascript" or something like that, which should be enough for me to identify what was happening. In only extremely rare cases would there be any reason for me to actually open any single file in a viewer or editor.

Q: What should I do while I'm surfing?

A: Just be yourself!

The important thing is that the things that you browse are typical of your normal activity (not including personally sensitive websites).
So go visit all your favorite websites that don't include Facebook, e-commerce, banking, or anything medical or otherwise especially personal. It's OK if you have to think of something to go view -- but you shouldn't just go download something big specifically to make the test end more quickly... unless that's how you enjoy browsing the web.

Q: Will I be identified in your research? Could I be identified by someone reading your papers?

A: No.

My research will describe general statistics about user data and compression performance, and will not discuss particulars about any one person's data. So, for example, I might say things like "I received 250 megabytes of data traces from 25 volunteers," or "30% of all web traffic seen consisted of JPEG images" or "Data identified as 'hard-to-compress' could not be compressed more than 10%". I would never identify individual users, individual data contents, addresses, websites visited, or anything personally identifiable.

Q: How long will you keep the data?

A: Probably about a year.

I will finish the research for my degree, and then write some papers based on my research. I'll need to keep the data on my personal computers during that time, in case I need to run more experiments. When I'm not using it, I'll keep it encrypted, and when I am finished with the current round of work, I will delete the data rather than leaving it laying around forever.

Any other questions? Send them to pedro@tastytronic.net.