This document describes the two tools constructed and employed for extracting files from the 2007 challenge data: the blind extractor TNG, and the internet search and file subset verification utility supercollide. The blind extractor, TNG, uses three strategies or new algorithms for its file extraction purposes. These are: - Puzzle solving: For complex files, like ZIP, treat each part of the file that can be uniquely identified as a chunk of its own, and then use rules to reconstruct the file in the correct order. - Naive extraction: For files with large areas of hard-to-verify unstructured data, but where the header can be easily found, try to extract from the header onwards a certain length, and then run this through an external validator. - Data signature recognition: For files with no recognizable structure, use a statistical model (n-gram with n=3 in this case) to classify all parts of the file, assuming the inter-sector order is not inverted (i.e in no part of the challenge image does a sector of a file exist unless some prior sector of that file occurred first). Along this, TNG also uses a principle-of-exclusion array to redirect around known chunks, so that if one file splits another into two, the latter can be extracted after the former has been. However, due to time constraints, the array functions were only partly integrated into the extractor as a whole, and thus might have managed to secure more successful file extractions if it had been implemented more thoroughly. The actual signature detection part of TNG uses a Rabin-Karp rolling hash. Since the puzzle-solving algorithm in particular have to check against many signatures, the constant factor penalty we take for the R-K calculation is worth it compared to the O(nm) complexity of doing a separate search on each signature. Rabin-Karp hash searches work well when the signature is seldomly encountered, but not so well when it is often so. This can be seen with the BMP extractor, which slows down TNG (as the source given) considerably. The BMP extractor can be commented out in the main file, though. Details and results of the algorithms in TNG: - Puzzle solving. This is the most well developed part of TNG, and it is used in ZIP files, and to a limited extent for PNG files. The puzzle solving mechanism requires detailed knowledge of the file, and therefore would give best returns when dealing with files that have a complex file structure. ZIP files consist of data headers (FILE_HEADERs), directory entries (DIR_ENTRY) and central directory listings (ENDDIR_HEADERs). Some ZIP files also defer size information to a "footer" appended to the data header, after the data itself; therefore, it is a good format for the puzzle solving algorithm. The puzzle solving algorithm itself goes through two phases, in a loop, until nothing further can be extracted. These are coalescing and phantom generation. The coalescing phase checks multiple (unformed) chunks or chunk hierarchies against each other, growing the trees by joining them. The phantom generation phase constructs default chunks for those that have been lost or could not be joined in the coalescing phase. As the puzzle solving information is arranged in a hierarchy, the phantom generation phase does not construct default chunks unless there are no chunks that could have taken its place in the coalescing phase. For the challenge image, there was only one ZIP file, and that was, as far as could be ascertained, valid (no headers were lost), and therefore there was no need for the greater complexity of the phantom generation phase. It was not possible to ascertain the validity of the ZIP file completely, because its contents were encrypted (by WinZIP's AES encryption, compression method 99), but the structure as pieced together by the puzzle solver seems reasonable. The extractor for PNG files also found an uninterrupted PNG file. - Naive extraction: This part is simpler, but when it faces uninterrupted files where some conditions hold true, it is very quickly implemented. It was used to extract JPG files from the image, although it could not extract the interleaved files later on in the image. The naive extractor is triggered upon finding a header of the right file format. It then extracts a data block from the start of the header onwards to a certain reasonable maximum size, and runs that block through an external validation program - which program is used depends on the file format. If that doesn't succeed, then the naive extractor knows there is no way to extract an unbroken file from that block. If it does succeed, it employs a binary search (modified slightly to start close to the average size for this type of file) to narrow down on the correct size for the file. When done, it marks the block found in the principle of exclusion array, as the external validation tool is trusted, and continues. The constraints on the external validation tool are as follows: the tool must return true (no error) if it's given a valid file appended by garbage. It must always return false (error) if it's given a truncated valid file. It must also be noninteractive. The naive extractor was tested with the unix utilities djpeg (from libjpeg, for JPGs), unzip (from zip, for ZIP files), gs (from ghostscript, for PDF files), and antiword (for DOC files). Of these, only djpeg and antiword met the criteria; unzip would some times return false (error) for files appended by junk (if the junk looked like another zip file), and ghostscript would attempt to repair PDF files by itself, and thus return true (no error) for truncated files. I have left the PDF/ghostscript code in TNG as it may be of use in partial recovery (and it does not cause any false positives because all the PDF files are fragmented), but it can be excluded by removing the allocation clause in main.cc for pdf_ripper. Similarily, I've also left the antiword naive extractor in the codebase, although it did not succeed in finding any noninterleaved DOC files. For the challenge image, the naive extraction algorithm successfully found and extracted noninterleaved JPG files, one of which was (in all likelihood) a thumbnail from a larger picture. While this may be considered a false positive, it is a valid picture, and in cases where the larger picture would be corrupted, having a thumbnail picture is better than none at all. (This reasoning is similar to that extractors should try to extract files embedded in other files it can't extract, because the larger file might be unrecoverable; we don't know that in advance). - Data signature recognition: This part of the extractor uses a statistical model for grouping together similar sectors. As implemented, it uses an n- gram model (with n = 3) for two of the types, a byte counter for two others, and a substring match for one. The exact model is not important to the algorithm itself, and a better model (like prediction by partial matching, or dynamic markov compression) would probably lower the rate of false positives. The recognition system is based on a hierarchy of individual model recognition units. In this hierarchy, an unclassified data block of a sector in size starts by being classified into a broad category. When that is done, it is then classified within the subcategories of the broad category, so that any false positive behavior will be limited to errors within the same broad category. The categories in the implementation were DEFLATE (deflated data), EXT_PRINTABLE (printable text, including extended ascii for non-English text), EXTP_PDF (PDF markup), EXTP_MAIL (Mail text), EXTP_MAIL_HEADER (SMTP or RFC-822 mail headers) and EXTP_MAIL_BODY (everything which is EXTP_MAIL but not EXTP_MAIL_HEADER). The additional pseudo-category of TM_COMPLETE_FILE was used to mark blocks the naive extractor had found, so that they would be easily recognized during TNG's file dump phase. For the challenge image, the data signature recognition system was used to find mail text. Initial runs showed it prone to confusing PDF markup with mail text, and therefore another model for PDF markup was constructed. Although it did not completely distinguish all PDF markup from mail text (particularily because the n-gram model, as implemented, cannot give any clear confidence, just relative ratings above a random model), it did classify most of the text correctly. As a consequence, in order not to have the effort backfire as a false positive, the mail text has been classified as partly recovered. ------------------------------------------------------------------------------- As multiple PDF file signatures were found in the image when trying to make the ghostscript extractor, it seemed it would be a loss to not attempt to recover them. Since PDF files have a distinct title field, it appeared reasonable to try to search the web for them, automatically. Supercollide is a utility developed for part of this purpose. It hashes every sector in the image, as well as all the "needle files" -- files which could be included in the image, but where we don't know if they are or not. If the image contains a subset of sector hashes that match a given needle file, then we know (with high probability) that the needle file is included in the image. The actual implementation is more difficult because of the edge case where the needle file's last sector isn't a full sector; since its hash won't match any of the image sector hashes, we have to run another round on the image with the corresponding partial sector length. Supercollide itself is conservative, which means that it'll (almost never, subject to hash collisions) fit a needle file to the image if the needle file isn't present. Therefore, we're given considerable leeway in finding the needle files, as the wrong ones will get chaffed out during the supercollide phase. To find suitable needle files, a group of perl scripts were constructed. The first finds \Title tags in the image and extract the PDF title from them, and the second searches for these titles and downloads the first page of search results. While this may require some disk space to accomodate all the potential needle files, we do not incur any false positive penalty. All the PDF files that have been found were found in this way. The abstract method (using supercollide and a set of needle files) is not restricted to any file type, but the web search is more difficult and could not be implemented in time for other file types -- and could not even theoretically be used on those file types that have no easily searchable metadata. This suggests a search engine of hashes, but that would require considerable computational power and space. One could also consider a program that would extract "unusual word sequences" (phrases with low combined probability) to search for, no matter the data type; that would fix documents of any kind. However, again due to time constraints, that wasn't implemented. ------------------------------------------------------------------------------- - What were done to reduce false positives - The algorithms detailed above are designed to be conservative. In the case of TNG, the naive extractor, by its nature, only passes the files that the external validator accepts. The complex extractor did not produce any false positives in the first place. Part of this is due to that it only matched one file, but even when it matches multiple, it will only accept file chunks that do not have unreasonable values, as given by the various complex extractors' is_sane routines. As a consequence, there are few false positive chunks, and for these chunks to become a false positive file, they must be arranged in the correct order. The ZIP extractor, for instance, does not coalesce FILE_HEADER and DIR_ENTRY chunks with differing uncompressed size fields. The search part of TNG uses is_sane to check chunks at the time they are discovered. This errs on the side of false negatives, as corrupted headers may fail to be recognized or the corrupted fields considered unreasonable, rather than on the side of false positives. The algorithm or system most liable to false positive results is the data verifier, or statistical signature verifier. This is a particular problem because of the limits of the n-gram model. What was done here was to iterate and adapt; the initial mail text filter matched too much PDF markup, and thus a PDF markup n-gram classifier was constructed to remove those false positives. While being manual in a sense, the results can be used on other files because the n-gram models themselves are not fitted on the data that was to be extracted. The mail filter was fitted on the Enron mail corpus, the Spamassassin corpus, and the TREC05 corpus; and the PDF filter on a group of PDF files on the disk, including some game manuals. The hierarchy that constrains the data verifiers also help prevent false classification as it includes and excludes - for instance, a chunk found to be DEFLATE data and not text is not classified against either PDF markup or mail, since those are subordinates of text in the hierarchy. ------------------------------------------------------------------------------- - Compiling and running - TNG and supercollide were programmed in a Linux 2.6 environment. They have not been tested on other platforms, and may or may not work there. In particular, they will not work on platforms that require files to be explicitly opened as binary or text. It needs write access to the directory it is run from, to make directories and temporary files. TNG makes use of zlib (for verifying deflate data) and the external utilities djpeg (from libjpeg, for naively extracting JPG files), gs (from ghostscript, for PDF files), and antiword (from antiword, for DOC files). Technically, the extractors employing antiword and gs didn't find any files, so the program can be run without them, but djpeg is required to find the noninterleaved JPG files. To compile TNG, just run g++ -O9 -lz main.cc -o tng in the TNG directory. Compiling supercollide is equally simple: run g++ -O9 dir_intersect.cc -o superc in the supercollide directory. Running tng is self-explanatory - it takes one parameter, which is the image file. Running supercollide itself is also fairly easy, the first parameter is the image file, and the second parameter is the directory containing the needle files. However, running the entire supercollide chain can be a bit more difficult. First run the title extractor: perl extract_titles.pl image-data-file >titles.txt Then run the downloader: perl brute_download.pl titles.txt The downloader makes and populates a subdirectory called bruted_needles. Now run supercollide itself: superc image-data-file bruted_needles/ and when done, supercollide lists the needle files it found in the image, as well as their MD5 sums and the sectors they use in the image (in sorted order). Do note that TNG uses large amounts of memory. Each of the trigram models occupy about 70 megabytes, and the principle-of-exclusion array requires about equal to the size of the file being examined, even when limited to 127 chunks. (Given more time, the principle-of-exclusion array could probably be made much more efficient by using ranges instead of direct array values) In the worst case, TNG also uses four times the image file bytes for counting sectors. The memory usage of supercollide is determined by the size of the image file, but is not nearly so extreme. Each sector requires 32 bytes of hash and 4*2 bytes of information; for the DFRWS 2007 image this is about 23 MB.