This is the second half of my submission. This is the article that describes: ¥ The name and version of existing carving programs that were used. ¥ A description of algorithms you developed for the challenge. ¥ Working source code of tools you developed for the challenge. ¥ How the number of false positives was reduced. Working source code is attached. Semantic Carving with S2 Semantic carving represents an improvement over traditional carving. Semantic carving is a two-step process in which trial sequence of blocks are generated and then individually tested with a validator. Sequences of blocks that validate are stored as a completd file. Sequences that do not validate are discarded. Combinatorial explosion limits the general usefulness of semantic carving. In theory, a 50MB file system with 100,000 sectors has 100,000! different sequencies of sectors to evaluate. S2 is a semantic carver written in C++ that uses a number of strategies to reduce the number of sequences that need to be tested.Ê The following strategies are employed: * S2 keeps a map of the disk image. Sectors that are assigned to Ê allocated files or that have been carved out of the file system are Ê marked as "carved" are not considered for future carving. * S2 supports pluggable carvers. The following carvers are Ê implemented: Ê Sequential Carver - Look for a start block, extend block-by-block Ê until either a validating sequence of sectors is found, or else Ê there is some reason that the run cannot be extended. Ê Split carve - Try all possible sequences of sectors that start at Ê R1, continue to the possible end of R1, appended by all possible Ê sequences of sectors starting at R2 and ending at R2_END. Ê Split Carve Search - Given a START sector, repeatedly call Split Ê Carve with all possible R2 (where R2_END is the maximum length of Ê the run) Ê Split carve hole - Given START, END, and LENGTH OF THE FILE, cut Ê holes between START and END equal to the number of sectors that must Ê be removed, and try to validate each resulting run. * S2 supports pluggable validators. Each validator can validate Ê information for a particular file type. The validators consist of code Ê that is used to validate a region of blocks and to mark various Ê blocks in the sector map as possibly belonging to a file. Each Ê validator must present the following information to S2: ÊÊ - A validator "name" ÊÊ - A base name for files that a carver would create ÊÊ - An extension for files that the carver creates ÊÊ - A code that will be placed in the sector map for any sector Ê ÊÊ identified as being a "start" of a file. ÊÊ - A code that would be placed in the sector map for any sector that Ê ÊÊ could be identified as being the end of a file. ÊÊ - An optional binary sequence that is commonly used at the start of files Ê ÊÊ recognized by this validator. ÊÊ - An optional binary sequence that is commonly used at the end of a Ê ÊÊ file recognized by this validator. ÊÊ - A C++ function that can read a string of bytes and return either Ê ÊÊ C_OK (it validates), C_ERR (it doesn't validate), or C_EOF (it Ê ÊÊ starts to validate, but then we ran out of data) ÊÊ - A Unix command-line utility that can also be run to validate Ê ÊÊ whether or not a file is a valid file for a particular data type. ÊÊ - The minimum number of sectors that can be present in a file of this Ê ÊÊ type. ÊÊ - The maximum number of sectors that can be present in a file of Ê ÊÊ this type. ÊÊ - Optional flags: Ê ÊÊ TRIM - The file type does not require complete sectors, so try to Ê ÊÊ remove individual bytes at the end of the file after a validated Ê ÊÊ carved file is identified. ÊÊ Ê ÊÊ NO_EOF - The validator doesn't implement EOF Ê ÊÊ ADD_CRLF - If the file doesn't have a CR/LF at the end, add it. Ê ÊÊ START_CODE_TERMINATES - The START CODE is never seen inside the Ê ÊÊ file, so if a START_CODE is found inside a proposed sequence of Ê ÊÊ sectors, it's wrong. Ê ÊÊ END_CODE_TERMINATES - The END CODE, if found, definately Ê ÊÊ terminates the run. Ê ÊÊ NO_BINARY_TERMINATES - The file type stores compressed data, so Ê ÊÊ if a sector is found that only has text, it can't be part of a Ê ÊÊ run of sectors for this file. Ê ÊÊ IGNORE_START_WHITESPACE - Ignore any whitespace in a sector at Ê ÊÊ the beginning before the Start sequence of bytes Ê ÊÊ TEXT - The file must be text Ê ÊÊ NO_ZBLOCKS - No sectors in a carved file are filled with NULLs. Ê ÊÊ ALLOW_MULTIPLE_MATCHES - Even if one match is found, keep running Ê ÊÊ the carver. Ê * The following individual validators were written: Ê JPEG - Recognizes JPEG files. The JPEG validator understands the the Ê internal structure of the JPEG files but does not validate the Ê huffman codings. If a valid jpeg stream is found, the JPEG file as a Ê whole is handed off to an external program that attempts to Ê decompress the JPEG file using libjpeg version 6b. Ê The JPEG validator works well with all of the carvers. Ê ZIP - Recognizes ZIP files and understands their internal Ê structure. Zip files start with a known header. Zip files terminate Ê with a directory structure from which it is possible to learn both Ê the final sector and the number of bytes that are present in the ZIP Ê file. It should be possible to have this information automatically Ê provided to the split carver for carving documents that are split Ê into two pieces. For the DFRWS 2006 challenge, this information was Ê manually provided with explicit invocations to the various carvers Ê based on a scan of the raw media.Ê (See zipinfo below) Ê MSOLE - Recognizes Microsoft OLE 2.0 files, the file format used by Ê Word, Excel, and PowerPoint. The MSOLE validator can recognize the Ê beginning of a file and validate the Sector Allocation Table, the Ê Master Sector Allocation Table, the Short Sector Allocaiton Table, Ê and the OLE 2.0 directory stream. From the SAT it is possible to Ê determine the precise nubmer of blocks in a MSOLE 2.0 file. However, Ê the MSOLE validator cannot validate the actual information stored Ê inside the streams. Thus, it is possible to create MSOLE files that Ê validate but which are, nevertheless, invalid. Ê HTML - recognizes start and end of HTML files Ê TEXT - recognies sectors that appear to be text. Assigns the sector Ê to a corpus based on a language and vocabulary model. Ê BINARY - recognizes sectors as being "binary." In addiiton to s2, these other programs were written for the DFRWS 2006 challenge: zipinfo - scans a raw partition looking for zip directories. When a directory is found, zipinfo displays the contents of the directory and calculates the size that a zip file with that directory would have to be. msinfo - scans a raw partition looking for SATs and MSATs that point to the same sectors, which indicates that they are valid. Determines the sector offset for each of these SAT/MSAT pairs. Uses this information to determine the length of the file, the first sector of the file, and the last sector of the file. From this information the split carvers can be used. jinfo - a validator for jpeg files which decmpresses the huffman coding to verify if they are correct or not. zipinfo - a validator for zip files which verifies the huffman coding by calling unzip -t msdir - a program which prints the directory of an MSOLE 2.0 file. textinfo - a program which runs the language model in stand-alone mode for testing. ================ The files were found with the following strategy: * The sequential carver with each of the recognizers to identify all Ê of the contigious runs. Ê - S2's "file generator" produces a report in a format that was Ê Ê requested by the DFRWS staff. * The remaining "text" sectors were manually examined. Based on an Ê analysis of the content, it was determined that the text and HTML Ê files were derrived from the following sources: Ê - Shakespear's A Commedy of Errors. Ê - Alice in Wonderland Ê - Moby Dick (chapter 1 and chapter 84) Ê - A Sherlock Holmes story Ê - Dickens' A Christmas Carrol Ê - Something in french (later determined to be De la division du Ê - travail social) Ê A language model was built to distinguish english from french using Ê bigrams, and to identify each of the works using vocabulary (which Ê was manually created).Ê S2 was run in a different mode to produce a Ê map of every uncarved text sector. The runs were manually assembled Ê (in most cases, the sectors were congitious; in a few cases there Ê were breaks.) - At this point, I looked at all of the START blobs and determined Ê that there were still JPEGs, ZIP files, and MSOLE files. They had to Ê be fragmented into two or more pieces. - The split carver was developed and run successfully with the JPEG Ê validator to discover all of the JEPG files.Ê The split carver works Ê well with the JPEG carver because the JPEG validator implemented Ê C_EOF. This allowed me to determine when the first run was too long Ê (it was extended until the validator switched form returning C_ERR Ê to C_EOF.) However, multiple passes needed to be made because Ê several of the first runs included extraneous sectors at the end of Ê the first run which didn't cause decoding errors, but which made it Ê impossible to produce a complete image. Thus, a "delta" value was Ê added to the split-carver API. Delta is a negative number which Ê specifies the number of blocks to remove from the first run when Ê forming the second run. Ê The JPEGs were found before the split_hole_carver was written. Many Ê of the iamges, such as image11619 and image94846, had only a few Ê sectors inserted between the two runs. (this might simulate a long Ê file being allocated and having to "go around" a small file that had Ê been inserted.) Thus, a more effiicent way to find the sectors might Ê ahve been to run the split_hole_carver with a successively larger Ê and larger hole.Ê (This is what the split_carve_find_hole function Ê does, and it was used to find the image31533.jpg. It turns out that Ê the top half is 219 blocks, but C_ERR doesn't trip until block 355!) - The split_hole_carver was written for the zip files. the zip files Ê contain two directories - a directory at the end of the file, and a Ê directory that can be assembled from each file header. Thus, it was Ê pretty easy to tell that a set number of blocks had been "inserted" Ê into the sequence of runs that corresopnded to the zip files. I knew Ê the size of the hole, the start of the zip file, the end of the zip Ê file. I had a validator. The split_hole_carver tried all possible Ê combinations and found the files. - Of the 4 remaining Microsoft files, the news release from the Great Ê Sand Dunes National park was found using the validator Ê technique. Ê The validator failed with the FCC XLS file, the NIST file, and the Ê NCJRS files.Ê The FCC file could not be read with the Perl Excel Ê reader; the NIST file coudln't be found due to an error in the Word Ê WV library (I was using 1.0 instead of 1.2.1); NCJRS file was in Ê three pieces. Ê These files were found by assembling a file that had a validating Ê SAT (but not a validating excel file), opening it in Word or Excel Ê Excel, determining the source document, downloading the source Ê document from the web, and then comparing sector-by-sector hash Ê codes using hashset.py. I was able to manually figure out where the Ê runs were by comparing runs from the actual documents with the runs Ê on the disk image, manually ignoring a missing sector here or Ê there. I then carved from the disk image into a file and Ê opened them. They worked!