The statistical improbability of a 5-segment match

From: johnvertical@...
Message: 67014
Date: 2010-12-31

At Torsten's request.

For a lower bound of possible 5-segment structures, suppose 4 possible vowels, 20 possible onsets, and 100 possible medials (clusters included). That makes 32000 possible bisyllabic CV(C/V)CV roots/stems. Let's approximate that one tenth of these, so 3200, are actually used.

Semantics-wise, the Gmc meaning we were discussing was clearly "loose"; the Uralic side has meanings like "mild, weak, slack, loose" so to be generous let's say it's a 10% chance these would have come from an original meaning of "loose". That is: our semantic leeway is that each word in language A may be compared with ten different words in language B.

Phonetic correspondence-wise, let's generous as well and say a word in lang A will phonetically match "exactly", on average, ten different words of lang B.

Taking one of the 3200 roots in lang A, first let's calculate how many of the ten corresponding word-shapes exist in lang B on average:
0 words: 0.9^10 ≈ 34.9%
1 words: 10*0.1*0.9^9 ≈ 38.7%
2 words: (10 2)*0.1^2*0.9^8 ≈ 19.4%
3 words: (10 3)*0.1^3*0.9^7 ≈ 5.7%
4 words: (10 4)*0.1^4*0.9^6 ≈ 1.1%
5 words: (10 5)*0.1^5*0.9^5 ≈ 0.15%
6 words or more = even smaller

The expectation value is 0.387 + 2*0.194 + 3*0.057 + 4*0.011 + 5*0.0015 (plus even smaller terms) ≈ pretty close to 1.

The odds of the one expected phonetically acceptable root (out of 3200) being one of the 10 semantically acceptable roots is thus 1/320. So: not very likely.

John Vertical