The judge doesn't really understand a hash well. They say things like "Google as...

chaps · on Oct 30, 2024

Erm, "Assigned" in this context is not new: https://law.justia.com/cases/federal/appellate-courts/ca5/17...

"More simply, a hash value is a string of characters obtained by processing the contents of a given computer file and assigning a sequence of numbers and letters that correspond to the file’s contents."

From 2018 in United States v. Reddick.

kwanbix · on Oct 30, 2024

The calculation is what assigns the value.

chaps · on Oct 30, 2024

No. The calculation is what determines what the assignation should be. It does not actually assign anything.

This FOIA litigation by ACLU v ICE goes into this topic quite a lot: https://caselaw.findlaw.com/court/us-2nd-circuit/2185910.htm...

toyg · on Oct 30, 2024

Yes, Google's calculation.

oefnak · on Oct 30, 2024

Did Google invent this hash?

chaps · on Oct 30, 2024

Why is that relevant? Google used a hashing function to persist a new record within a database. They created a record for this.

Like I said in a sib. comment, this FOIA lawsuit goes into questions of hashing pretty well: https://caselaw.findlaw.com/court/us-2nd-circuit/2185910.htm...

duxup · on Oct 30, 2024

Google at some point decided how to calculate that hash and that influences what the value is right? Assigned seems appropriate in that context?

Either way I think the judge's wording makes sense.

henryfjordan · on Oct 30, 2024

Google assigned the hashing algorithm (maybe, assuming it wasn't chosen in some law somewhere, I know this CSAM hashing is something the big tech work on together).

Once the hashing algorithm was assigned, individual values are computed or calculated.

I don't think the judge's wording is all that bad but the word "assigned" is making it sound like Google exercised some agency when really all it did was apply a pre-chosen algorithm.

uniformlyrandom · on Oct 30, 2024

Hash can be arbitrary, the only requirement is it is a deterministic one-way function.

Onavo · on Oct 30, 2024

And it should be mostly bijective under most conditions. (This is obviously impossible in practice but hashes with common collisions shouldn't be allowed as legal evidence). Also neural/visual hashes like those used by big tech makes things tricky.

bluGill · on Oct 30, 2024

The hash in question has many collisions. It it probably enough to get a warrant put it on a warrant, but it may not be enough to get a warrant without some other evidence. (it can be enough evidence to look for other public signs of evidence, or perhaps because there are a number of images that match different hashes)

axus · on Oct 30, 2024

There's a password on my Google account, I totally expect to have privacy for anything I didn't choose to share with other people.

The hash is kind of metadata recorded by Google, I feel like Google using it to keep child porn off their systems should be reasonable. Same ballpark as limiting my storage to 1GB based on file sizes. Sharing metadata without a warrant is a different question though.

cool_dude85 · on Oct 30, 2024

As should be expected from the lawyer world, it seems like whether you have an expectation of privacy using gmail comes down to very technical word choices in the ToS, which of course neither this guy nor anyone else has ever read. Specifically, it may be legally relevant to your expectation of privacy whether Google says they "may" or "will" scan for this stuff.

curiousllama · on Oct 30, 2024

Out of curiosity, what is false positive rate of a hash match?

If the FPR is comparable to asking a human "are these the same image?", then it would seem to be equivalent to a visual search. I wonder if (or why) human verification is actually necessary here.

EVa5I7bHFq9mnYK · on Oct 30, 2024

I doubt sha1 hashes are used for this. Those image hashes should match files regardless of orientation, cropping, resizing, re-compression, color correction etc. The collision could be far more frequent with these hashes.

bluGill · on Oct 30, 2024

The hash should ideally match even if you use photoshop to cut the one person out of the picture and put that person into a different photo. I'm not sure if that is possible, but that is what we want.

cool_dude85 · on Oct 30, 2024

The reason human verification is necessary is that the government is relying on something called the "private search" doctrine to conduct the search without a warrant. This doctrine allows them to repeat a search already conducted by a private party (i.e., Google) without getting a warrant. Since Google didn't actually look at the file, the government is not able to look at the file without a warrant, as that search exceeds the scope of the initial search Google performed.

ARandumGuy · on Oct 30, 2024

Naively, 1/(2^{hash_size_in_bits}). Which is about 1 in 4 billion odds for a 32 bit hash, and gets astronomically low at higher bit counts.

Of course, that's assuming a perfect, evenly distributed hash algorithm. And that's just the odds that any given pair of images has the same hash, not the odds that a hash conflict exists somewhere on the internet.

henryfjordan · on Oct 30, 2024

You need to know the input space as well as the output space (hash size).

If you have a 32bit hash but your input is only 16bit, you'll never have a collision (and you'll be wasting a ton of space on your hashes!).

Image files can get into the megabytes though, so unless the output hash is large the potential for collisions is probably not all that low.

Dylan16807 · on Oct 30, 2024

You do not need to know the input space.

Normal hash functions have pseudo-random outputs and they can collide even when the input space is much smaller than the output space.

In fact, I'll go run ten million values, encoded into 24 bits each, through a 40 bit hash and count the collisions. My hash of choice will be a truncated sha256.

... I got 49 collisions.

gorjusborg · on Oct 30, 2024

> Out of curiosity, what is false positive rate of a hash match?

No way to know without knowledge of the 'proprietary hashing technology'. Theoretically though, a hash can have infinitely many inputs that produce the same output.

Mismatching hash values from the same hashing algorithm can prove mismatching inputs, but matching hash values don't ensure matching inputs.

> I wonder if (or why) human verification is actually necessary here

It's not about frequency, it's about criticality of getting it right. If you are going to make a negatively life-altering report on someone, you'd better make sure the accusation is legitimate.

cool_dude85 · on Oct 30, 2024

I'd say the focus on hashing is a bit of a red herring.

Most anyone would agree that the hash matching should probably form probable cause for a warrant, allowing a judge to sign off on the police searching (i.e., viewing) the image. So, if it's a collision, the cops get a warrant and open up your linux ISO or cat meme, and it's all good. Probably the ideal case is that they get a warrant to search the specific image, and are only able to obtain a warrant to search your home and effects, etc. if the image does appear to be CSAM.

At issue here is the fact that no such warrant was obtained.

NoMoreNicksLeft · on Oct 30, 2024

> Most anyone would agree that the hash matching should probably form probable cause for a warrant

I disagree with this. Yes, if we were talking MD5, SHA, or some similar true hash algo, then the probability of a natural collision is small enough that I agree in principle.

But if the hash algo is of some other kind then I do not know enough about it to assert that it can justify probable cause. Anyone who agrees without knowing more about it is a fool.

cool_dude85 · on Oct 30, 2024

That's fair. I came away from reading the opinion that this was not a perceptual hash, but I don't think it is explicitly stated anywhere. I would have similar misgivings if indeed it is a perceptual hash.

thrwaway1985882 · on Oct 30, 2024

I think it'll prove far more likely that the government creates incentives to lead Google/other providers to fully do the search on their behalf.

The entire appeal seems to hinge on the fact that Google didn't actually view the image before passing it to NCMEC. Had Google policy been that all perceptual hash hits were reviewed by employees first, this would've likely been a one page denial.

ndriscoll · on Oct 30, 2024

If the hash algorithm were CRC8, then obviously it should not be probable cause for anything. If it were SHA-3, then it's basically proof beyond reasonable doubt of what the file is. It seems reasonable to question how collisions behave.

cool_dude85 · on Oct 30, 2024

I don't agree that it would be proof beyond reasonable doubt, especially because neither google nor law enforcement can produce the original image that got tagged.

Dylan16807 · on Oct 30, 2024

By original do you mean the one in the database or the one on the device?

If the device spit out the same SHA3, then either it had the exact same image, or the SHA3 was planted somehow. The idea that it's actually a different file is not a reasonable doubt. It's too unlikely.

cool_dude85 · on Oct 31, 2024

By the original, I mean the image that was used to produce the initial hash, which Google (rightly) claimed to be CSAM. Without some proof that an illicit image that has the same hash exists, I wouldn't accept a claim based on hash alone.

Dylan16807 · on Oct 31, 2024

Oh definitely you need someone to examine the image that was put in the database to show it's CSAM, if the legal argument depends on that. But that's an entirely different question from whether the image on the device is that image.

nokcha · on Oct 30, 2024

For non-broken cryptographic hashes (e.g., SHA-256), the false-positive rate is negligible. Indeed, cryptographic hashes were designed so that even nation-state adversaries do not have the resources to generate two inputs that hash to the same value.