Again, crazy idea. A certain site I won't mention now is down due to security concerns, and the usual discussion regarding the hashing of passwords arises. "Don't they use hash+salt?", "MD5 shouldn't be used because it's vulnerable to dictionary attacks", "If they steal the hashed passwords they'll have all the time in the world to crack them", etc. The general concern is that there's people out there who, given enough time, could obtain the original contents of a hashed string. How true is this?
Partly true. Dictionary attacks are based on the fact that people pick very poor passwords, using everyday words with (or even without) very slight modifications. Attackers basically go through a list of common terms and try different modifications of these terms, such as replacing 'o' with '0'. They hash these passwords and compare it with the stolen one. If equal, they rejoice with their newly cracked password. Of course, they knew what they were looking for, that is, a common variation of a common word.
This struck me as very interesting, because a hashes are small strings (128 characters, for example) that may represent very large data. So, in a way, hackers are reproducing the original data that was hashed. They are extracting the data, kinda. To be more precise, they are asking "Is this it?" a million times until they have it. And that's all that matters.
A major concern with hashing is the fact that certain inputs are bound to have the same hash value. There's a finite amount of distinct values you can represent with 128 characters (or n characters for that matter), but you have an infinite amount of possible documents and files and strings that you can hash. Collisions are inevitable.
Yet they are unlikely. If I hash "jorge" and I obtain "45j98dsjm899m...", it's very unlikely that some other coherent input will produce that same hash value. The inputs that will collide with my name are probably a bunch of garbage that won't translate into any type of file. That means that if I attack the string "45j98dsjm899m..." and obtain "jorge", it's almost certain this was the original value. All of these talk leads me to the main point. I'm no expert on compression or hashing algorithms, but I have a brain, so here it goes: I think bad hashing could serve as good compression.
The general idea would be this: to compress a file, (a) generate its hash (a more relaxed hash function should probably be needed) , (b) add important metadata for the file, such as its type, name and any other relevant info. To uncompress it, the algorithm will know the hashing function used, and the metadata included with the file. It will have a general idea of what it's looking for. Then it will proceed to attack the hash file using a sort of dictionary attack.
One important advantage of this approach is that low collision will warranty to a certain point that a corrupt file will not be produced. That is, if the algorithm produces a file that matches the metadata, it's almost impossible that that file is not the one looked for.
If you're still wondering where the crazy part is, well, you're damn crazier than I am. There's a lot of questionable points here. The hashed file, unlike a compressed file, doesn't contain all the information of the original, so theoretically it is very possible the file would be impossible to uncompress. Large files would need more than hash, which means they would have to be divided in some way that doesn't difficult the decompression process. Huge files, such as an ISO, probably would be impossible to compress using this. That's because the metadata applies to the file as a whole, and if it's partitioned into many pieces, the metadata will be a bit more obscure to create and analyze. What metadata I'm talking about, anyway? I guess the most important piece of metadata would be the file type. It might allow to get a general structure of the file, such as headers, footers, approximations of what might be in the middle, etc. So much metadata might be needed that the compression will be fruitless. Again, we could compress the metadata with a traditional compression algorithm, just to make things much more complicated :D.
To sum things up, I'm super compressing the file by not including a lot of info in the compressed file. This is compensated with processing on the part of the decompresser, which will reconstruct the file by using the "cheat" metadata provided and performing a dictionary attack on the hash. This might work wonders with internet downloads. People would be glad to have extra fast downloads in exchange for a slower first decompression.
How slow, you ask? No idea. I don't know how efficient can an assisted dictionary attack be. It could take days, for all I know, and that would be just stupid. I guess it would have to be fast enough that relatively big files can be downloaded and decompressed in significantly less time than it is currently. People with slow connections would be very happy about this.
This idea sounded so much better in my head. Still, I'm committed to throw them all out, because they might have some value. If I helped someone discard one theory, then I'm happy enough.