48

I have a 64 character SHA256 hash.

I'm hoping to train a model that can predict if the plaintext used to generate the hash begins with a 1 or not.

Regardless if this is "Possible", what algorithm would be the best approach?

My initial thoughts:

  • Generate a large sample of hashes that begin with a 1 and a large sample of hashes that do not begin with a 1
  • Set each of the 64 characters of a hash as a parameter for some sort of unsupervised logistic regression model.
  • Train the model by telling it when it is right/wrong.
  • Hopefully be able to create a model that can predict if the plaintext begins with a 1 or not with a high enough accuracy (and with a decent kappa)
Tim
  • 108,699
  • 20
  • 212
  • 390
John
  • 521
  • 1
  • 4
  • 3
  • 1
    For this to be tractable, you’d essentially have to have the ML algorithm carry out the hashing function. Maybe this operation could be represented in some form or fashion, but you’ll have to do most of the work by engineering features. – Sycorax Sep 10 '18 at 22:41
  • 2
    @John, it's likely not possible, but if you're still determined to test it out, you should make sure that whatever algorithm you use is powerful enough to do the opposite. Given the plaintext it should determine the first digit of the hash, or in other words learn the hashing function from examples. – csiz Sep 11 '18 at 01:47
  • 24
    FYI: This is likely motivated by bitcoin mining. – ClojureMostly Sep 11 '18 at 08:12
  • 2
    It would take an astronomical leap in computer learning for an ML algorithm to break a hash that the best (human) cryptographic minds in the world have so far been unable to break. – BlueRaja - Danny Pflughoeft Sep 11 '18 at 12:54
  • 56
    “How would I train a model that allows me to time travel, regardless of whether this is ‘possible’?” – Konrad Rudolph Sep 11 '18 at 15:09
  • 1
    @KonradRudolph That's easy. Step1: Describe the space of all possible designs. Step2: Enumerate the space in Step1. Step3: Take the number n generated in Step4, and return "True" if it describes a working time machine. Step4: If Step3 returns "True", then return n. Otherwise, return n+1. The only stable time loop is one in which you find a working time machine design. – Acccumulation Sep 11 '18 at 16:18
  • @KonradRudolph: I'm not prepared to believe time travel isn't possible. As for John, if he finds a model that runs in fewer steps than sha256, its publish time for great fame. – Joshua Sep 11 '18 at 18:57
  • 14
    @Joshua OP wants to *invert* SHA-256. I’ll let him publish even if it takes a thousand times as many steps as SHA-256. I’ll also cease existing since the solution almost certainly exploits a bug within the very fabric of reality to defeat logic. – Konrad Rudolph Sep 11 '18 at 19:00
  • 1
    @KonradRudolph: The halting problem requires finite time and finite energy (but the lower bound on the energy requirement is stupidly high). I wouldn't worry about a solution to inverting SHA-256 breaking logic. – Joshua Sep 11 '18 at 19:08
  • 4
    @Joshua This is completely unrelated to the halting problem. But, like the halting problem, OP wants to solve an uncomputable function: neither can be solved *at all*, regardless of time and energy. To clarify: it’s not simply inefficient or hard, it’s provably *impossible*. – Konrad Rudolph Sep 11 '18 at 19:11
  • 2
    Your first bullet point mis-states the problem: You say "hashes that begin with a 1", but you really mean "hashes *of inputs* that begin with a 1". This is maybe why some people are getting confused about whether you're trying to predict something about the hash given a string vs. predict something about "the string" given a hash. Is there some specific size of input that might make this problem not total nonsense (if the inputs are shorter than the hash, there might only be 1 input that produces any given hash). – Peter Cordes Sep 11 '18 at 23:45
  • You are asking if there exists any `x` such that `h = H(1 || x)` is true for a given `h`. This is can be formulated as a SAT problem for each given length of `x` or each hash function input block multiple. See satcoin. – Dan D. Sep 12 '18 at 05:07
  • @DanD. No it can’t, because $\forall h \exists (x, y) \mid h=H(x) =H(y) \land (x \text{ starts with 1}) \land (y \text{ doesn’t start with 1})$. – Konrad Rudolph Sep 12 '18 at 13:00
  • @KonradRudolph Then the SAT problem is trivially true. There always exists an `x` of arbitrary length. But in that `x` and `y` have neither the same length or a given length. For lengths of `x` and `y` of the same length and bit length less than or equal to the hash output length, that might not hold. – Dan D. Sep 12 '18 at 14:52
  • 2
    @DanD I’m not sure how you would formulate the SAT problem but you need to consider that in OP’s question the lengths are *not* necessarily the same. But even for the same length, if said length is sufficiently long, my condition holds (without proving anything about SAT). It’s simply a property of the design of SHA256: different sequences yield the same hash, and in particular the hash does not determine the character identity of any given input position. OP’s question is the same as asking “given $H(x) = x \mod 10$, I want to find from $H(x)$ whether $10 < x \leq 20$”. – Konrad Rudolph Sep 12 '18 at 15:03
  • 16
    All SHA256 hashes can be generated by a string that starts with a "1". – Reactgular Sep 13 '18 at 00:15
  • 3
    A better, related question might be: "Can machine learning be used to find flaws in cryptographic hash algorithms?" – R.. GitHub STOP HELPING ICE Sep 14 '18 at 14:08
  • 2
    Related post on Crypto.SE [Any practical uses of machine learning for cryptography?](https://crypto.stackexchange.com/questions/9751/any-practical-uses-of-machine-learning-for-cryptography) – mikeazo Sep 14 '18 at 17:28
  • 3
    @cgTag Can you give a proof or a reference? I don't think what you said is known. It is only believed to be true. – Pedro A Sep 15 '18 at 13:17
  • @PedroA there are an infinite number of strings that start with "1" but a finite number of SHA256 hashes. – Reactgular Sep 15 '18 at 13:20
  • 1
    @cgTag of course. But there's no proof that there is an infinite amount of plaintexts starting with 1 that yield, for example, 06289361528302715fafaffaffaafafa53892716 (or, if you know one, share with the world please!) (by the way I just smashed the keyboard there, maybe the digit count is wrong) – Pedro A Sep 15 '18 at 13:22
  • @PedroA only half the bits from the source string are used to generate SHA256 hashes, and the other half are random. The entire point of SHA256 is that the input does not control the output. Therefore you can not predict the input from the output, and thus any given hash has an infinite list of possible input strings. If the list is infinite, then all hashes have at least 1 string that starts with a "1" if not many strings. – Reactgular Sep 15 '18 at 13:34
  • 8
    @cgTag I'm sorry but that's just wrong. The input does control the output, otherwise it wouldn't be a function in the first place. Also, just because you have an infinite list of things, it doesn't mean one of them starts with 1. You're trying to prove a known cryptography conjecture in a SE comment. Note: I also believe it is true, but claiming it is true is misleading. If you're right, surely there will be a paper or some other reference. – Pedro A Sep 15 '18 at 13:40
  • @PedroA sounds like an interesting topic for a blog post or article by someone knowledgeable about cryptography. That is not me. :) – Reactgular Sep 15 '18 at 13:45
  • 3
    @cgTag I see :) The problem is that your comment is wrong and has a bunch of upvotes. Thanks for listening :) By the way I found a link supporting my claims: https://crypto.stackexchange.com/q/29781/43257 – Pedro A Sep 15 '18 at 13:48
  • 3
    @PedroA I think it's more a case of comment being unproven. Any SHA256 hash can be generated from a string starting with a 1 as first bit seems like a plausible hypothesis. – Taemyr Sep 17 '18 at 10:32

13 Answers13

102

This isn't really a stats answer, but:

No, you can't determine the first character of the plaintext from the hash, because there's no such thing as "the plaintext" for a given hash.

SHA-256 is a hashing algorithm. No matter what your plaintext, you get out a 32-byte signature, often expressed as a 64-character hex string. There are far more possible plaintexts than there are possible 64 character hex strings - the same hash can be generated from any number of different plaintexts. There's no reason to believe that the first character being/not being a '1' is uniform across all plaintexts producing a given hash.

Chris H
  • 1,254
  • 1
  • 7
  • 9
  • 22
    This is so far (in my opinion) the only correct answer. All the other answers seem to deal more with the problem of learning the hash function rather than learning to invert the hash (the actual question). They all seem to ignore that hashing its *not* an injective function. – Luca Citi Sep 11 '18 at 14:45
  • 7
    Could you not predict the *probability* that the first char is a one? Lack of one-to-one-ness is not an uncommon problem in statistical learning. – Matthew Drury Sep 11 '18 at 15:19
  • 17
    @MatthewDrury Given that SHA256 is designed to make all inputs equally likely for a given hash, there will hopefully be infinitely many inputs starting with 1, *for any given hash*. So if you want to estimate the probability, your best estimate is going to be roughly $\frac{1}{256}\pm\varepsilon$. – Konrad Rudolph Sep 11 '18 at 15:31
  • 12
    Yup, agree. I just wanted to note that the lack of injectivity is not really a structural problem with the application of machine learning. – Matthew Drury Sep 11 '18 at 15:35
  • 4
    Yes, of course. It's the combination of not being injective and the specific design that the co- image of a given hash is a set that has approximately half of the elements starting with a one and half starting with a zero. What I wanted to highlight is that while many answers bring up the smoothness of the feature space as an issue, this is not the real problem. – Luca Citi Sep 11 '18 at 17:22
  • 2
    The conclusion is correct, but the reasoning is flawed. Imagine another "hash function", which would simply return the first byte of its input. Sure, that's the poorest hash function imaginable (except "return 0 no matter what"), but you could technically use it wherever a hash function is required, and though you couldn't restore plaintext from "hash", you could easily predict the first character. – IMil Sep 12 '18 at 00:23
  • 6
    @IMil The reason I mentioned specifically the fact it's a hash function isn't to suggest *no* hash function could possibly ever reveal that information, but to motivate the statement that there's no such thing as "the plaintext". Sure, a (bad) hash function could be partially reversible and clearly tell us something about the whole set of plaintexts which would produce it, but there's no reason to believe that of SHA-256. – Chris H Sep 12 '18 at 05:05
  • 1
    In addition to what Chris H said, the people who made SHA-256 knows a lot about cryptography and specifically designed and tested SHA-256 to prevent predictability. It isn't even logical to ask if it is possible, because they specifically DESIGNED it to not do what you're asking it to do. It's basically like asking "How would A do B if A is specifically designed and tested to **NOT** do B?" The answer is, you can't. – Nelson Sep 15 '18 at 18:32
  • The network could simply be trained to produce *a* hash, not *the* hash. This is not a fundamental issue. – usr Sep 16 '18 at 11:27
  • 1
    @usr I'll assume you mean it could be trained to produce *a* plaintext, because absolutely nothing in the question or the answer was about producing hashes. If you think training a network to produce a plaintext from an SHA hash is plausible then by all means feel free to give it a shot. Just don't say I didn't warn you when you realise that a) it's almost certainly a waste of time even if that's all you want to achieve, and b) even if you succeed, you've done nothing to achieve what OP wants. – Chris H Sep 17 '18 at 08:21
  • Yes, I mean preimage. I am very certain it is not possible to train a network to do this. But I wanted to address your point that non-uniqueness of the preimage would prevent training. I do not think it does. – usr Sep 17 '18 at 08:49
  • @usr I never said it would prevent training (you can train, it just won't help you) - I said it means you can't determine the first character of the plaintext from the hash. OP wants to determine whether or not a text begins with "1" by examining its hash. Even assuming we somehow have an NN or whatever that takes an SHA hash as input and produces *a* plaintext with that hash, OP's no closer to knowing what they want. They now know that *some* plaintext with that hash does (or doesn't) begin with "1", there's nothing to say that's the one that was actually used in producing it initially. – Chris H Sep 17 '18 at 09:04
  • That's a good point. – usr Sep 17 '18 at 09:13
  • Theoretically, you could create a lookup table. Column 1 would have every possible combination of characters and column 2 would have the matching hash value. Sheesh, why do you computer science types make everything so complicated. – Andrew Brēza Sep 17 '18 at 15:08
53

SHA256 is designed to be as random as possible, so it is unlikely you would be able to separate hashes that came from 1-prefixed plaintext from those that do not; there should simply be no feature of the hash string that would give that information away.

Pavel Komarov
  • 1,097
  • 9
  • 19
  • 5
    "unlikely" and "should" - Thats what the algorithm will tell me. At first glance it does appear to be impossible, but I would like to know what algorithm and approach to take to test this hypothesis. – John Sep 10 '18 at 21:48
  • 25
    +1 It is guaranteed that any kind of "unsupervised logistic regression model" will be unable to do better than guessing unless it can be supplied a truly astronomical number of cases. This problem is tilting at windmills. – whuber Sep 10 '18 at 21:56
  • 44
    You can try it, but the learner will be trying to find a statistical relationship that is intentionally designed not to exist. – Pavel Komarov Sep 10 '18 at 22:08
  • @John You can make this a supervised problem by hashing tons of strings that don't start with 1, saving the hashes as your class 0 set, and then hashing tons that do start with 1, saving the hashes as your class 1 set. Then use any scikit-learn model: `model = WhateverClassifier(); model.fit(X, Y) # where X is the hashes and Y are classs labels; testX = hash(testStrings); testY = model.predict(testX) # see if testY is as you expect based on the strings you hashed.` There isn't really a best model for this, though SVMs are common with binary classification. – Pavel Komarov Sep 10 '18 at 22:22
  • 32
    "Designed to be as random as possible" is an understatement. Specifically, the design aims to have a maximally non-linear dependency, where each input bit influences approximately 50% of the output bits, and each output bit depends on approximately 50%of the input bits. This is known as _confusion and diffusion_. It makes the task here (recovering just the first bit) just as hard as recovering the whole message. – MSalters Sep 11 '18 at 08:41
  • 12
    I think you can strengthen "unlikely" in this answer. The OP has zero chance of applying statistics-based techniques to predict any part of a SHA256 hash from content, or vice-versa, with even a detectable improvement over random guessing. The practical solution is basically to pre-calculate the whole target population of original content. – Neil Slater Sep 11 '18 at 09:12
  • 5
    The crucial point is not that *it was designed* to be as random as possible. Designs fail all the time. The point is that experts have tried to find a flaw in that design for years and haven't found anything. In the current form, this answer sounds like *"You can't break x because it was designed to be unbreakable"*. – JiK Sep 12 '18 at 09:51
  • 4
    @John if you even knew what algorithm to use to *attempt* a proof you'd be ahead of the state of the art in research. – hobbs Sep 12 '18 at 19:07
  • This is like asking for an algorithm that tries to build a time machine. Since we have no idea what a time machine would be like, we also have no idea what an algorithm that looks for a time machine would be like. With no idea of what the result would be, we have no idea what the process of looking for the result would look like. Any algorithm we propose would have an expected chance of success of zero, so how can we pick a "best" one if they're all equally likely to work? – David Schwartz Sep 14 '18 at 21:41
43

Regardless if this is "Possible", what algorithm would be the best approach?

Sorry, but that's a nonsensical question. If something is impossible, then you can't search for the best approach to the problem.

In this case, this definitely should be impossible because hashing is a one-way function: several inputs (infinite, in fact) can produce the same output. If the first bit of input on its own would somehow influence the probability of a specific hash value, this would mean that the hash algorithm is completely flawed.

You certainly can train a neural network, linear classifier, SVM and whatnot to attempt prediction. And if you will be able to reliably predict the input from output for a certain hashing algorithm, this would prove that this algorithm is worthless. I'd say that for a widely used algorithm like SHA256 such a possibility is vanishingly low. However, it's a reasonable approach to quickly rule out new, unproven and untested hashing algorithms.

IMil
  • 461
  • 3
  • 4
  • 7
    It’s (almost) irrelevant that a hash is a one-way function: Off the top of my head I can think of lots of one-way functions that nevertheless allow me to infer features of the original input ($\mathop{\rm sign}(x)$, famously a non-linear one-way function, tells me a lot about the input). – Konrad Rudolph Sep 11 '18 at 14:58
  • 12
    @KonradRudolph: "One-way function" has a [specific meaning](http://mathworld.wolfram.com/One-WayFunction.html) in this context that is not the meaning you're thinking of. `sign(x)` is not a one-way function in this sense, because finding preimages is trivial. – user2357112 supports Monica Sep 11 '18 at 18:04
  • 4
    That said, I don't think the answer is using "one-way function" correctly either. – user2357112 supports Monica Sep 11 '18 at 18:06
  • 1
    @user2357112 Thanks, I didn't know that. I only knew the meaning as a function that's surjective but not bijective. That's also the definition given in the answer, which is what I objected to. – Konrad Rudolph Sep 11 '18 at 18:08
  • 1
    Yeah, sorry, I'm a bit lax with definitions. However, I believe that 'one-way' is more understandable to novices than the more strict terms. – IMil Sep 12 '18 at 00:27
  • If something is impossible, you can search, but you're wasting your time. – user253751 Sep 12 '18 at 02:24
  • "several inputs (infinite, in fact) can produce the same output" - This is only conjectured, not proved. – Pedro A Sep 15 '18 at 13:20
  • @PedroA how exactly is it "not proved"? If the input size is unlimited, and the output size is fixed, it inevitably follows that there are infinite number of inputs producing the same output. Whether this is true for a given output, is indeed more complicated to prove, but I didn't make this claim. – IMil Sep 16 '18 at 22:45
  • 1
    You are right, I misread your claim as the one you didn't make (that the output is a specific one). – Pedro A Sep 16 '18 at 22:52
28

While one can't prove a negative with an example. Still I feel an example would be suggestive; and perhaps useful. And it does show how one would (attempt to) solve similar problems.

In the case of I want to make binary predictions, using features that are binary vectors, a Random Forest is a solid choice. I guess this kind of answers the second part of your question: what is a good algorithm.

We well want to preprocess the SHA256 strings, into binary (Boolean) vectors, as each bit is statistically independent, thus each bit is a good feature. So that will make our inputs 256 element boolean vectors.

Demo

Here is a demonstration of how the whole thing can be done using the Julia DecisionTree.jl library.

You can copy paste the below into the julia prompt.

using SHA
using DecisionTree
using Statistics: mean
using Random: randstring

const maxlen=10_000 # longest string (document) to be hashed.

gen_plaintext(x) = gen_plaintext(Val{x}())
gen_plaintext(::Val{true}) = "1" * randstring(rand(0:maxlen-1))
gen_plaintext(::Val{false}) = randstring(rand(1:maxlen))


bitvector(x) = BitVector(digits(x, base=2, pad=8sizeof(x)))
bitvector(x::AbstractVector) = reduce(vcat, bitvector.(x))

function gen_observation(class)
    plaintext = gen_plaintext(class)
    obs = bitvector(sha256(plaintext))
    obs
end

function feature_mat(obs)
    convert(Array, reduce(hcat, obs)')
end

########################################

const train_labels = rand(Bool, 100_000)
const train_obs = gen_observation.(train_labels)
const train_feature_mat = feature_mat(train_obs)

const test_labels = rand(Bool, 100_000)
const test_obs = gen_observation.(test_labels)
const test_feature_mat = feature_mat(test_obs)


# Train the model
const model = build_forest(train_labels, train_feature_mat)
@show model


#Training Set accuracy:
@show mean(apply_forest(model, train_feature_mat) .== train_labels)

#Test Set accuracy:
@show mean(apply_forest(model, test_feature_mat) .== test_labels)

Results

When I did this, training on 100,000 random ASCII strings of length up to 10,000. Here are the results I saw:

Train the model

julia> const model = build_forest(train_labels, train_feature_mat)
Ensemble of Decision Trees
Trees:      10
Avg Leaves: 16124.7
Avg Depth:  17.9

Training Set accuracy:

julia> mean(apply_forest(model, train_feature_mat) .== train_labels)
0.95162

Test Set accuracy:

julia> mean(apply_forest(model, test_feature_mat) .== test_labels)
0.5016

Discussion

So that is basically nothing. We went from 95% on the training set, to barely over 50% on the test set. Someone could apply proper hypothesis tests, to see if we can reject the null
hypothesis, but I am pretty certain we can't. It is a tiny improvement over the guess rate.

That suggests that it can't be learned. If a Random Forest, can go from well fitted to hitting just the guess rate. Random Forests are pretty capable of learning difficult inputs. If there was something to learn, I would expect at least a few percent.

You can play around with different hash functions by changing the code. Which could be interesting I got basically same results when using the julia in built hash function (which is not a cryptographically secure hsah, but still is a good hash so should indeed send similar strings apart). I also got basically the same results for CRC32c.

Lyndon White
  • 2,744
  • 1
  • 19
  • 35
17

Hash functions are (by design) extremely badly suited for doing anything machine learning with them.

ML is essentially a family of methods for modelling / estimating locally continuous functions. I.e., you're trying to describe some physical system that, while it may have certain discontinuities, is in some sense in most of the parameter space smooth enough so that only a scattered sample of test data can be used to predict the result for other input. To do that, the AI algorithms need to somehow decompose the data into a clever basis representation, for which the training has suggested that e.g. if you see such and such shape (which appears to correlate with the result of such and such convolution) then there's a good chance that the output should have in the corresponding region such and such structure (which can again be described by a convolution or something).

(I know, many ML approaches aren't like convolution at all, but the general idea is always the same: you have some input space that's so high dimensional it's impossible to sample exhaustively, so you find a clever decomposition that allows you to extrapolate results from a comparatively sparse sample.)

The idea behind a cryptographic hash function however is that any change to the plaintext should result in a completely different digest. So no matter how you decompose the function, local estimators will not allow you to extrapolate how small fluctuations around that part influence the result. Unless of course you actually process all of the information of a limited set, but this wouldn't be called machine learning: you'd just be building a rainbow table.

leftaroundabout
  • 382
  • 1
  • 9
  • 4
    Reading this, it occurred to me that a) to build a rainbow table, you need to know what hash function to use, and b) it *might* be possible for a machine learning algorithm to identify which algorithm is in use, given a large enough sample of inputs and outputs (at least if the algorithm has identifiable flaws). So if the original problem were restated to be about an unknown hash function that needs to be identified, it might be more practically interesting. – senderle Sep 11 '18 at 11:53
7

This is an interesting question because it raises issues about what counts as "machine learning." There is certainly an algorithm that will eventually solve this problem if it can be solved. It goes like this:

  1. Pick your favorite programming language, and decide on an encoding that maps every string to a (potentially very large) integer.

  2. Pick a random number and convert it into a string. Check to see if it's a valid program in your language. If it's not, pick another number and try again. If it is, start it, immediately pause it, and add it to a list of paused programs.

  3. Run all the paused programs for a little while. If any of them halt without producing an adequate solution, remove them from the list. If one produces an adequate solution, you're done! Otherwise, return to 2 after letting them all run for a bit.

There's no question that if you have infinite storage and infinite time, the above algorithm will eventually find a good solution. But that's probably not what you mean by "machine learning."

Here's the rub: if you consider all possible problems, no machine learning algorithm can do better on average! This is known as the no free lunch theorem. It proves that among all the possible problems you could throw at any given machine learning algorithm, the number that it can solve quickly is vanishingly small.

It can solve those problems quickly only because they are governed by patterns that the algorithm can anticipate. For example, many successful algorithms assume the following:

  1. Solutions can be described by some complex series of matrix multiplications and nonlinear distortions, governed by a set of parameters.

  2. Good solutions will be clustered together in parameter space, so that all you have to do is pick a search neighborhood, find the best solution there, shift your search neighborhood so that the best solution is in the center, and repeat.

Obviously neither of these assumptions hold in general. The second is particularly suspect. And the no free lunch tells us that these assumptions don't even hold most of the time. In fact they almost never hold! It's just our good fortune that they do hold for certain problems that actually matter.

The problem you've chosen is designed from the beginning to violate assumption 2. Hash functions are specifically designed so that similar inputs give completely different outputs.

So your question—what is the best machine learning algorithm for solving this problem?—probably has a very straightforward answer: random search.

senderle
  • 296
  • 2
  • 7
  • I wonder how quantum computing will affect the no-free-lunch-theorem. Presumably, quantum computing is constrained by it, too. – Hannah Vernon Sep 11 '18 at 13:28
  • 1
    @MaxVernon oh, interesting. I would expect that all quantum algorithms have the same property when compared to *other quantum algorithms*. I do not know whether all quantum optimization algorithms have an average-case speedup over classical ones. They might! I have a [question and self-answer](https://cs.stackexchange.com/questions/57546/are-coevolutionary-free-lunches-really-free-lunches) that talks about a "free lunch" theorem that could be relevant. (tldr; the lunch is only free if you ignore some of the work done... but I wonder if that changes in the quantum case.) – senderle Sep 11 '18 at 13:39
5

It is next to impossible. However, people observed some patterns in SHA256 which might suggest its non-randomness A distinguisher for SHA256 using Bitcoin (mining faster along the way). Their tldr:

"To distinguish between an ideal random permutation hash and SHA256, hash a large amount (~2^80) of candidate 1024 bit blocks twice, as done in Bitcoin. Ensure that the bits of the candidate blocks are sparsely set (much fewer than the 512 mean expected), according to the Bitcoin protocol, discarding candidate blocks that do not meet the Bitcoin “difficulty” standard (where the resultant hashes start with a the large number of 0’s). With the remaining set of valid input candidates (467369 when this analysis was done), observe a particular set of 32 bits in the input block (located where Bitcoin has the nonce, input bits 607-639). Note that the mean number of bits set in the nonce field is skewed to the left, i.e. fewer than the expected value of 16 bits set (estimated mean 15.428)."

See a discussion on lobste.rs. One possible explanation is a bias introduced by the miners.

IndieSolver
  • 154
  • 6
  • 2
    This is interesting. But the reply on lobste.rs is probably right. That's a huge bias, easily discoverable. The notion that it has gone unnoticed for this long is pretty far-fetched. – senderle Sep 11 '18 at 13:13
  • 1
    @senderle In order to exploit the bias (if any), one should come up with an algorithm (essentially a ML/optimization algorithm) which is computationally cheap so that its own overhead when implemented/measured on state-of-the-art hardware is compensated by the speedup it provides. My very rough guess would be that the factor in terms of #hashtrials should be greater than 10x to beat brute force and its superoptimized implementations. The implications might be very serious, especially for people betting on crypto and security protocols. – IndieSolver Sep 11 '18 at 13:25
4

I'll answer with a program. To reduce computational requirements I'll use a variant of sha256 I call sha16, which is just the first 16 bits of sha256.

#!/usr/bin/python3

import hashlib
from itertools import count

def sha16(plaintext):
    h = hashlib.sha256()
    h.update(plaintext)
    return h.hexdigest()[:4]

def has_plaintext_start_with_1(digest):
    """Return True if and only if the given digest can be generated from a
    plaintext starting with "1" first bit."""
    return True

def plaintext_starting_with_1(digest):
    """Return a plaintext starting with '1' matching the given digest."""
    for c in count():
        plaintext = (b'\x80' + str(c).encode('ascii'))
        d = sha16(plaintext)
        if d == digest:
            return plaintext

for digest in range(0x10000):
    digest = "%04x" % (digest,)
    plain = plaintext_starting_with_1(digest)
    print("%s hashes to %s" % (plain, digest))

This produces the output:

b'\x8094207' hashes to 0000
b'\x8047770' hashes to 0001
b'\x8078597' hashes to 0002
b'\x8025129' hashes to 0003
b'\x8055307' hashes to 0004
b'\x80120019' hashes to 0005
b'\x8062700' hashes to 0006
b'\x8036411' hashes to 0007
b'\x80135953' hashes to 0008
b'\x8044091' hashes to 0009
b'\x808968' hashes to 000a
b'\x8039318' hashes to 000b
[...]

I'll leave the full proof as an exercise for the reader, but take my word for it: there's an input that starts with a "1" for each possible digest from 0000 to ffff.

There's also an input that doesn't start with "1". And there's one that starts with the complete works of Shakespeare, too.

This holds for any reasonably good hash function, although my brute force proof may become computationally infeasible.

Phil Frost
  • 141
  • 3
  • In mathematics, I don't like to _take your word for it_. Your program demonstrates that your sha16 function is surjective, but nothing more. You did not give a formal proof that this program could prove the actual SHA-256 function. By your style of conclusion, the [Collatz conjecture](https://en.wikipedia.org/wiki/Collatz_conjecture) would be solved because it is already solved for 32 bits and the program can be easily run longer. – Roland Illig Sep 15 '18 at 16:13
4

What you describe is basically a pre-image attack. You're trying to find an input such that, when it is hashed, the output has some property like "a leading 1".*

It is an explicit goal of cryptographic hashes to prevent such pre-image attacks. If you can make such an attack, we tend to consider that algorithm to be insecure and stop using it.

So while that means it's not impossible, it means your machine learning algorithm would have to simultaneously outwit a large fraction of the mathematicians in the world, and their super computers. It is unlikely that you will do so.

However, if you did, you would become known as someone who broke a major cryptographic hash algorithm. That fame is worth something!

*Technically a "first preimage attack" tries to find a match for a specific hash. However, to show that a hash algorithm has first preimage attack resistence, they typically show that you can't find any meaningful information about the input from the hash.

Cort Ammon
  • 547
  • 2
  • 5
2

Most all the answers here are telling you why you can't do this but here's the direct answer to:

Regardless if this is "Possible", what algorithm would be the best approach?

Assuming the input is sufficiently large:

  1. Take the count of the set of valid characters.
  2. Take the reciprocal of the number from step 1.

That's the probability that the input string starts with '1'. You don't even need to look at the input. If you can do better than that, it would mean the hash is very broken. You can save a lot of CPU cycles over trying to train an algorithm to pick random numbers.

You could train an algorithm and it might come up with a different answer because of overfitting. That is unless there's something really wrong with the hash algorithm. Using this algorithm is then going wrong more often than if you simply pick a random value.

JimmyJames
  • 119
  • 5
1

Hashing functions are purposefully designed to be difficult to model, so (as pointed out already) this is likely to be very difficult. Nevertheless, any weakness in the hashing function will reduce its entropy, making it more predictable.

Regardless if this is "Possible", what algorithm would be the best approach?

A useful example is the Physically Unclonable Function, or PUF - which is analogous to a hardware hashing function. Typically, manufacturing variations are purposefully used to give each PUF a slightly different response so that their 'hashed' output is different for a given input. Design weaknesses limit the entropy, however, and given enough challenge-response pairs it is often possible to build a black-box model of the PUF so that the response for a new, previously unseen challenge can be predicted.

Logistic regression is the most commonly used approach for these modelling attacks, such as in this paper by Rührmair.

Genetic algorithms (or more generally evolutionary strategies) may be an alternative approach, as they are applicable to problems that are not differentiable and/or linearly separable. They are also discussed in the above paper.

4Oh4
  • 891
  • 6
  • 6
1

Lets say that your plaintext/input is exactly one block long (512bits=1block for SHA256). The input space for it is $2^{512}$ and hash space is $2^{256}$. For simplicity, lets take the first $2^{256}$ inputs into consideration.

Now you train a machine learning algorithm (Any algorithm of your choice), with a training set of size $2^{64}$ by hashing every number from $0$ to $2^{64}-1$ (This itself would take alot of time and gargantuan amount of space to store, but lets put that aside for a moment).

After training over such a massive training set, you would expect the model to work accurately but no. The remaining $2^{256}-2^{64}$ input-hash pairs could be mapped in $(2^{256}-2^{64})!$ ways. Out of those many ways to arrange, only one arrangement would be our SHA256.

Let $S=(2^{256}-2^{64})$ (Total number of mappings) and
$C=\frac{90}{100}*S$ (Number of correct mappings for 90% accuracy)
The probably to achieve even 90% accuracy with our model would be (probability of $C$ correct mappings)*(probability of ($S-C$) incorrect mappings) = $$(\frac{1}{S}*\frac{1}{S-1}*\frac{1}{S-2}...*\frac{1}{S-(C-1)} ) * (\frac{S-C-1}{S-C}*\frac{S-C-2}{S-C-1}*\frac{S-C-3}{S-C-2}...*\frac{1}{2}) = \frac{(S-C-1)!}{S!}$$

Plugging in the values, probability that our model would achieve 90% accuracy is$$= \frac{(\frac{1}{10}*(2^{256}-2^{64})-1)!}{(2^{256}-2^{64})!}$$ Taking logarithms and using Sterling's approximation for factorials, the probability is $$\approx 2^{-(2^{263.9918466566}-2^{260.6509677217})}$$ $$\approx2^{-10.1322237391*2^{260.6509677217}}$$

Phew, thats a really small number. And this is an overestimation, since we have considered only the first $2^{256}$ inputs instead of the total $2^{512}$. The actually probability will be still lower.

user96549
  • 111
  • 3
1

The problem is that "machine learning" isn't intelligent. It just tries to find patterns. In SHA-256, there are no patterns. There is nothing to find. Machine learning hasn't got any chance that is better than brute force.

If you want to crack SHA-256 by computer, the only possibility is to create real intelligence, and since lots of clever humans haven't found a way to create SHA-256, you need to create artificial intelligence that is a lot higher than that of many clever humans. At that point, we don't know if such a super-human intelligence would crack SHA-256, prove that it cannot be cracked, or would decide that it is not clever enough to do either (just as humans). The fourth possibibility is of course that such a super-human artificial intelligence would not even bother but think about problems that are more important (to it).

gnasher729
  • 661
  • 4
  • 6