6.033 2010 Design Project 1 : Time-Travel Filesystem

Version History

Feb 9, 2010 -- Initial release
Feb 16, 2010 (This version) -- Added link to online submission site

I. Due Dates and Deliverables

There are three deliverables for Design Project 1:

A design proposal not exceeding 800 words, due on February 23th, 2010 (before recitation).
A design report not exceeding 2,500 words, due on March 18th, 2010 (before recitation).
An (optional) revision of a portion of your design report to improve your grade; you must set up a meeting with your writing instructor within one week of receiving your graded DP1 to take advantage of this.

All deliverables should be submitted via the online submission site.

As with real life system designs, 6.033 design projects are under-specified, and it is your job to complete the specification in a sensible way given the overall requirements of the project. As with designs in practice, the specifications often need some adjustment as the design is fleshed out. We recommend that you start early so that you can evolve your design over time. A good design is likely to take more than just a few days to put together.

II. The Problem

Your task is to design a filesystem with support for time travel. Time travel means that the file system allows the user to view the contents of a file or directory as of some time in the past. Time travel is useful in many different applications: for example, it provides a form of backup, allowing users to recover from accidental deletions; it also allows users to compare different versions of a file over time, for example, as a series of edits are made to a document.

Your design must provide time travel support in Unix, without requiring changes to the Unix File System API. In other words, it should be possible to use applications compiled for Unix in your modified Unix with time-travel support. You will need a way for applications to access historical versions of files. We suggest that you create a special directory at the root of the file system that allows you to refer to historical versions of files, as illustrated in this example:

% pwd
/home/madden/files
% ls
file.txt        file2.txt
% cat file.txt
hello
% date
Tue Feb  9 09:15:50 EST 2010
% echo goodbye > file.txt
% echo file3 > file3.txt
% rm file2.txt
% ls
file.txt        file3.txt
% cd /time-2010-02-09-09:00:00.000/home/madden/files
% ls
file.txt        file2.txt
% cat file.txt
hello
% diff /home/madden/files/file.txt /time-2010-02-09-09:00:00.000/home/madden/files
1c1
< goodbye
---
> hello

Your design will primarily consist of modifications to the implementations of every file system operation (read, write, open, rename, link, unlink, symlink, mkdir, and possibly chdir and stat) -- see Section 2.5 (page 91) of the textbook for a detailed description of the Unix File System API. Your implementation must make it possible to travel to any point of time and see the state of a file or directory as it existed at the specified time, which is why calls like read() and stat() must be modified.

You may find that it's hard to reconstruct the contents of a file as of a time when some application was writing that file. It's acceptable, if a user travels to some time T and reads file F, to produce the contents of F as of the close() call that occurred most recently before T. The user should see any files that existed at time T, and not see any files that did not exist at time T.

III. Requirements

Your design must meet the follow requirements:

It must preserve compatibility with applications programmed to the Unix file system API, with the exception of the special root directory that contains historical files, as in the example above (see the next requirement.) If necessary, you may add calls to the API as long as you maintain backwards compatibility.
It should return an error to applications that try to perform operations that update a historical version of a directory or file. Also, it should not allow the user to create a file or directory that could be interpreted as a historical directory. Finally, historical directories should be "virtual", in the sense that they do not appear when examining the contents of the inode representing the root directory or when performing a command like "ls -al /", and are only accessible through a direct chdir() ("cd") operation to a historical pathname, as shown in the example above.
It must allow observation of the state of any file or directory as it existed at the time the user has traveled to.
It must provide reasonable performance in terms of number of bytes or blocks read or written from disk. In particular, its performance shouldn't be much worse than the standard Unix file system when accessing either historical or current files.

Here are a few additional assumptions you can make to simplify your design:

You can assume that the operating system provides a method time() that returns the current system a floating point number with millisecond precision that specifies the number of seconds since January 1, 1970.
You do not need to worry about the case that the disk runs out of space.
You do not need to worry about concurrent modifications from different users -- e.g., you can assume that each operation happens at a distinct millisecond, and that only one user edits a file at a time.
You do not need to worry about what happens in the event of a crash in the middle of an operation (i.e., how to recover your data structures from various types of failures.)
You can assume that you have about 1 GB of RAM available to buffer write operations in memory.

IV. Design Proposal

The design proposal should be a concise summary (800 words) of your overall system design.

The format of your proposal should be roughly as follows: it should begin with an overview, describing the overall approach. It should then summarize your proposed design and explain how it solves the problem and meets the requirements in this document.

The core of the proposal should be the design description. This should explain the basic on-disk data structures used to represent your time travel file system and a sketch of how read and write operations will work. You may make use of bullets and tables, but must introduce them with at least 1-2 sentences of context. This section must include at least one graphic, correctly formatted with a caption and brief description.

The proposal should end with a summary of how the design meets the requirements (1-2 sentences). It should also include problems which remain to be resolved.

You do not have to present a detailed rationale or analysis in your proposal. However, if any of your design decisions are unusual (particularly creative, experimental, or risky) or if you deviate from the requirements, you should explain and justify those decisions in your proposal.

You will receive feedback on your proposal from your TA in time to adjust your final report.

V. Design Report

Your report should explain your design. It should discuss the major design decisions and tradeoffs you made, and justify your choices. It should discuss any limitations of which you are aware. You should assume that your report is being read by someone who has read this assignment and is familiar with the Unix file system and API, but has not thought carefully about this particular design problem. Give enough detail that your project can be turned over successfully to an implementation team. Your report should convince the reader that your design satisfies the requirements in Section III. Make sure your report includes the following:

An explanation of the behavior visible to users.
A description of how files and directories are stored on disk, and how changes to files and directories are recorded. You can assume that your reader is familiar with the Unix File System on-disk representation.
A description of how a historical version of a file or directory is located and how its bytes are retrieved from disk.
A specific discussion of how open, read, write, unlink, and chdir operations are implemented. For the other file system operations you do not need to provide a detailed discussion of them but you should briefly mention how their implementation would change relative to the standard Unix file system in your design.
An analysis of the number of reads and writes and space overhead required for each of the workloads given in Section V.A. below.
A (brief) discussion of what happens when a user tries to modify a historical file, and how errors are reported.

V.A. Workloads

Specifically consider how your system would perform on the following three workloads:

The user creates a website consisting of many small HTML files (<10 KB) at time T1, updates each of the files several times, and then travels back to time T1 and rereads each of the files in their entirety. For this workload compare the number of I/O operations (reads or writes of blocks from disk) required to perform these historical reads to the number of I/O that would be required if these were not historical reads. Also estimate the space required to store these historical versions, relative to the size of the original collection of files.
The user downloads a large movie (~10 GB) from his or her camcorder at time T2, and then performs edits to several small scenes in the movie, comprising about 20% of the total disk blocks occupied by the file (the movie size stays about the same after these edits.) Compare the number of I/O operations required to perform these writes to the number that would be required in the standard Unix file system (as specified in Section 2.5 of the course textbook.) Also estimate the space required to store the historical versions of the movie, relatively to size of the original movie.
Now suppose the user travels to time T2 and plays back the movie. How many I/O operations are required by your system? How many I/O operations would be required to play it back at the current time? How does that compare to the number of operations that would be required for playback of the current version in the standard Unix file system?

Assume that the cost of a sequential I/O (where the disk head is already positioned over the sector being read) is approximately the same as a random I/O (where the disk must seek to the sector being read.) Though mechanical hard drives don't actually behave this way, newer stable-storage devices, such solid-state disks based on Flash memory, do. It may not be possible to design a system with good performance in all possible scenarios: you may need to make tradeoffs, supporting some kinds of use well, and others not so well. Your should explicitly mention any such tradeoffs in your workload analysis.

V.B. Report organization

Use this organization for your report:

Title page: Give your report a title that reflects the subject and scope of your project. Include your name, email address, recitation instructor, section time(s), and the date on the title page.
No table of contents is needed
Introduction: Summarize what your design is intended to achieve, outline the design, explain the major trade-offs and design decisions you have made, and justify those trade-offs and decisions.
Design: Explain your design. Identify your design's main components, state, and algorithms for implementing the Unix file system operations. You should sub-divide the design, with corresponding subsections in the text, so that the reader can focus on and understand one piece at a time. Explain why your design makes sense as well as explaining how it works. Use diagrams, pseudo-code, and worked examples as appropriate.
Analysis: Explain how you expect your design to behave in different scenarios. What scenarios might pose problems for throughput, latency, or even correctness? What do you expect to be the scalability limits of your design?
Conclusion: Briefly summarize your design and provide recommendations for further actions and a list of any problems that must be resolved before the design can be implemented.
Acknowledgments and references: Give credit to individuals whom you consulted in developing your design. Provide a list of references at the end using the IEEE citation-sequence system ("IEEE style") described in the Mayfield Handbook.
Word count. Please indicate the word count of your report at the end of the document. Captions of figures should be included in the total word count.
Footnotes. Please do not use footnotes in your report.

Here are a few tips:

Use ideas and terms from the textbook and papers when appropriate; this will save you space (you can refer the reader to the relevant section of the textbook) and will save the reader some effort.
Before you explain the solution to any given problem, say what the problem is.
Before presenting the details of any given design component, ensure that the purpose and requirements of that component are well described.
It's often valuable to illustrate an idea using an example, but an example is no substitute for a full explanation of the idea.
You may want to separate the explanation of a component's data structures from its algorithms to access or use those data structures.
Explain all figures, tables, and pseudo-code; explain what is being presented, and what conclusions the reader should draw.

The format and appearance of your report should conform to the DP1 style guide.

VI. Revising Your Report

You may revise a portion of your design project if you received a grade of 90 or less on the original report (grades are out of 100). The maximum grade you can receive through revision is 90. Before you submit your revision, you must set up an appointment with your writing instructor to agree on a plan for your revision within one week of receiving your grade on DP1. You writing instructor will work out a time when your revision is due, typically one week after your meeting.

VII. How we evaluate your work

Your recitation and writing instructors will assign your report a grade that reflects both the design itself and how well your report presents the design. These are the main grading criteria.

The most important aspect of your design is that we can understand how it works and that you have clearly addressed the requirements and provided the elements listed in Sections III and IV. Complicated designs that we cannot understand will not be graded favorably. Some overall content considerations:

Does your solution address the stated problem?
How complex is your solution? Simple is better, yet sometimes simple will not do the job. On the other hand, unnecessary complexity is bad.
Is your analysis correct and clear?
Are your assumptions and decisions reasonable?

Some writing considerations:

Is the report well-organized? Does it follow standard organizational conventions for technical reports? Are the grammar and language sound?
Do you use diagrams and/or figures appropriately? Are diagrams or figures appropriately labeled, referenced, and discussed in the text?
Does the report use the concepts, models, and terminology introduced in 6.033? If not, does it have a good reason for using different vocabulary?
Does the report address the intended audience?
Are references cited and used correctly?

VIII. Collaboration

This project is an individual effort. You are welcome to discuss the problem and ideas for solutions with your friends, but if you include any of their ideas in your solution you should explicitly give them credit. You must be the sole author of your report.