Big-endian is better than little-endian

menotim

Big-endian is better than little-endian

post by Menotim · 2024-04-29T02:30:48.053Z · LW · GW · 17 comments

17 comments

This is a response to the post We Write Numbers Backward, in which lsusr argues that little-endian numerical notation is better than big-endian.^[1] I believe this is wrong, and big-endian has a significant advantage not considered by lsusr.

Lsusr describes reading the number "123" in little-endian, using the following algorithm:

Read the first digit, multiply it by its order of magnitude (one), and add it to the total. (Running total: ??? one.)
Read the second digit, multiply it by its order of magnitude (ten), and add it to the total. (Running total: ??? twenty one.)
Read the third digit, multiply it by its order of magnitude (one hundred), and add it to the total. (Arriving at three hundred and twenty one.)

He compares it with two algorithms for reading a big-endian number. One is using the same process as for a little-endian number, but from right to left. I agree with him that this is worse than the little-endian algorithm, because it is easier to read a number in the same direction as the text that surrounds it, which in English is from left to right.

The other big-endian algorithm is the one I observe myself as usually using. For "321", it is:

Count the digits (three), and convert that into an order of magnitude (one hundred). (Running total: ??? hundred ???.)
Read the first digit, multiply it by its order of magnitude (one hundred), and add it to the total. (Running total: three hundred ???.)
Read the second digit, multiply it by its order of magnitude (ten), and add it to the total. (Running total: three hundred and twenty ???.
Read the third digit, multiply it by its order of magnitude (one), and add it to the total. (Arriving at three hundred and twenty one.)

The point raised by lsusr against the big-endian algorithm is that we must count a number's digits before we can start reading them. He doesn't say explicitly why he dislikes this, but I can see three reasons:

It makes the algorithm more complex.
It makes the algorithm slower.
It means we cannot begin the algorithm if we only have access to the beginning of a number.

Making the algorithm more complex is bad, but not very bad, because it is still fairly simple. It being slow to count all the digits in a number is a real problem, but we can usually solve it by separating groups of digits using commas or by using exponential notation. Finally, only having access the beginning of a number is not a very common situation in day-to-day life.

So these problems might not be that important, but they are still problems, so, if they were the only consideration, little-endian would be better. Then, what other advantage does big-endian have over little-endian?

Though it is not common for us to not be able to process the entire representation of a number, we often have reason not to need to. Numbers represent quantities, and sometimes we only want to know an approximation, not an exact quantity.

For example, if I look up the population of India, Wikipedia will tell me it was estimated to be 1,428,627,663 people (in big-endian notation), but I will usually have no reason not to think of it as "about 1.4 billion". By running the big-endian algorithm only partially, this is exactly what we get: an estimate of a number to some order of magnitude.

By contrast, after running the little-endian algorithm partially, we find the number's value modulo a power of ten. In most situations, that is completely useless. Besides, since the data on the population of India is actually an estimate from 2023, in that example, we also can be pretty sure the least significant digits aren't even accurate.

What if you are not a person, but a computer, converting a string into an integer? In that case, having a simpler and faster algorithm is important, having to start with only the beginning of a string (what the user has typed so far) is plausible, and knowing the number's approximate value is useless. So in this case the little-endian algorithm is much better than the big-endian one.

But there is another algorithm that can be used by a computer for parsing big-endian numbers. Operating on "321":

Read the first digit and add it to the total. (Running total: three.)
Read the second digit. Multiply the total by ten, and add the digit to the total. (Running total: thirty two.)
Read the third digit. Multiply the total by ten, and add the digit to the total. (Arriving at three hundred and twenty one.)

This algorithm operates sequentially on the string, and it is even simpler and faster than the little-endian algorithm, because it doesn't have to keep track of the order of magnitude. So for computers, too, reading big-endian is easier.^[2]

So why do humans use the previous algorithm to read numbers, instead of this one? For the same reason we prefer big-endian to little-endian: successive approximations that narrow down on a number are more useful than operations for which the state in the middle is useless.

Lsusr's article ends by claiming the inventor of Arabic numerals knew little-endian numbers were better, and used them, because Arabic is written right-to-left. But positional decimal notation was not invented by the Arabs. It was invented by the Hindus, and brought to Europe by the Arabs. And the Hindus used the Brahmi script, which is written left-to-right. Therefore, the inventor of the Hindu-Arabic numeric system used big-endian notation.

^{^}
Lsusr's post has a good explanation of what little-endian and big-endian mean, so I won't repeat it here.
^{^}
Of course, there could be an even simpler little-endian algorithm I can't think of. If you know one, let me know.

17 comments

Comments sorted by top scores.

comment by det (o-zewe) · 2024-04-29T12:37:29.979Z · LW(p) · GW(p)

More evidence in favor of big-endian: In modern Hebrew and Arabic, numbers are written in the same direction as in English: e.g.

.שטחה של המדינה הוא 22,072 קמ"ר

As a native English speaker (and marginal Hebrew reader), I read each word in that Hebrew sentence right-to-left and then read the number left-to-right.

I never considered the possibility that native Hebrew speakers might read the number from right to left, in a little-endian way. But my guess is (contra lsusr) nobody does this: when my keyboard is in Hebrew-entry mode, it still writes numbers left-to-right.^[1]

This indicates that even when you give little-endian an advantage, in practice big-endian still wins out.

^{^}
I also tested in Arabic-entry mode, and it does the same even when using the Eastern Arabic numerals, e.g ١٢٣٤٥٦٧٨٩.

It's hard to Google for this, but this indicates that modern Arabic also treats numbers as left-to-right big-endian [I just verified with an Arabic speaker that this is indeed the case]. It's possible this was different historically, but I'm having a hard time Googling to find out either way.

Replies from: philh

↑ comment by philh · 2024-05-16T16:40:16.850Z · LW(p) · GW(p)

Isn't this showing that Hebrew and Arabic write numbers little-endian? Surely big-versus-little-endian isn't about left-to-right or right-to-left, it's about how numbers flow relative to word reading order.

Replies from: o-zewe

↑ comment by det (o-zewe) · 2024-05-30T11:09:12.960Z · LW(p) · GW(p)

But Hebrew and Arabic speakers read numbers left to right (even though they read everything else right to left). So they treat the numbers as big-endian.

Replies from: philh

↑ comment by philh · 2024-05-31T13:46:45.334Z · LW(p) · GW(p)

Ah, thanks, I see now. You're saying that even if it's written with the small end before the big end according to the way the words flow, the direction of eye scanning and of mentally parsing and of giving a name to the number is still big end before small end? Similarly I might write a single word sdrawkcab in English text but the reader would still read it first-letter-to-last-letter.

Curious, when handwriting, what order do you write in?

comment by Ben (ben-lang) · 2024-04-29T12:31:59.797Z · LW(p) · GW(p)

How does a little endian do a decimal point? Do they put the fractional part of the number at the beginning (before the decimal) and the integer part afterwards? Eg. 123.456 becomes 654.321? So just like all integers in big-endian notation can be imaged to have a trailing ".0" they can all be imagined to have a leading "0." in little-endian?

The way we do it currently has the nice feature that the powers of 10 keep going in the same direction (smaller) through a decimal point. To maintain this feature a little-endian requires that everything before the decimal point is the sub-integer component. Which has the feature lsusr doesn't like that if we are reading character by character the decimal forces us to re-interpret all previous characters.

[Edited to get the endians the right way around]

Replies from: Menotim, JBlack

↑ comment by Menotim · 2024-04-29T12:59:53.564Z · LW(p) · GW(p)

You're mixing up big-endian and little-endian. Big-endian is the notation used in English: twelve is 12 in big-endian and 21 in little-endian. But yes, 123.456 in big-endian would be 654.321 and with a decimal point, you couldn't parse little-endian numbers in the way described by lsusr.

Replies from: ben-lang

↑ comment by Ben (ben-lang) · 2024-04-29T13:14:32.438Z · LW(p) · GW(p)

You are right.

I thought the whole idea with the naming was that the convention whereby "twelve is written 12" the symbol at the end "2" is the one symbolising the littlest bit, so I thought it was called "little endian" for that reason.

I now I have a lot of questions about how the names were chosen (to wikipedia!). It seems really backwards.

Replies from: Menotim

↑ comment by Menotim · 2024-04-29T16:10:13.066Z · LW(p) · GW(p)

I had the same confusion when I first heard those names. It's called little-endian because "you start with the little end", and the term comes from an analogy to Gulliver

↑ comment by JBlack · 2024-04-29T13:20:10.244Z · LW(p) · GW(p)

Ordinary numerals in English are already big-endian: that is, the digits with largest ("big") positional value are first in reading order. The term (with this meaning) is most commonly applied to computer representation of numbers, having been borrowed from the book Gulliver's Travels in which part of the setting involves bitter societal conflict about which end of an egg one should break in order to start eating it.

comment by fitw · 2024-04-29T06:27:20.535Z · LW(p) · GW(p)

AFAIK, Indians used the little endian notation. Arabs reversed it since they were writing from left to right, but the Europeans did not reverse what the Arabs did. Today of course Indians follow the west in using big endian, but the little endian practice reflects in, for instance, the ka-ṭa-pa-yā-di system used in Indian musicology ( https://en.wikipedia.org/wiki/Katapayadi_system ).

Replies from: Menotim

↑ comment by Menotim · 2024-04-29T11:04:17.063Z · LW(p) · GW(p)

Katapayadi does seem to be little endian, but the examples I found on Wikipedia of old Indian numerals and their predecessor, Brahmi numerals, seem to be big-endian.

Replies from: fitw

↑ comment by fitw · 2024-04-29T14:31:46.766Z · LW(p) · GW(p)

Good point. I didn't know these examples, so my comment is at least partially wrong. I am puzzled about Brahmi numerals: that is supposedly Ashokan Brahmi, but India did not have numerals at the time of Ashoka.

comment by localdeity · 2024-04-30T00:03:48.768Z · LW(p) · GW(p)

One aspect neither of you have explicitly addressed is the speaking of numbers; speaking, after all, predates writing. We say "one billion, four hundred twenty-eight million, [...]".

Given that that's what we say, the first two pieces of information we need are "one" and "billion". More generally, we need to get the first 1-3 digits (the leftmost comma-separated group), then we need the magnitude, then we can proceed reading off all remaining digits.

Given that the magnitude is not explicitly written down, we get it by counting the digits. If the digits are comma-separated into groups of 3 (and "right-justified", so that if there are 3n+1 or 3n+2 digits, then the extra 1-2 are the leftmost group), then it's generally possible to get the magnitude from your "peripheral vision" (as opposed to counting them one by one) for numbers less than, say, 1 billion, which are what you'd most often encounter; like, "52" vs "52,193" vs "52,193,034", you don't need to count carefully to distinguish those. (It gets harder around 52,193,034,892 vs 52,193,034,892,110, but manually handling those numbers is rare.) So if getting the magnitude is a mostly free operation, then you might as well just present the digits left-to-right for people who read left-to-right.

Now, is it sensible that we speak "one billion, four hundred twenty-eight million, [...]"? Seems fine to me. It presents the magnitude and the most significant digits first (and essentially reminds you of the magnitude every 3 digits), and either the speaker or the listener can cut it off at any point and have an estimate accurate to as many digits as they care for. (That is essentially the use case of "partially running the algorithm" you describe.) I think I'd hate listening to "six hundred sixty three, six hundred twenty-seven thousand, four hundred twenty-eight million, and one billion", or even suffixes of it like "four hundred twenty eight million and one billion". Tell me the important part first!

comment by quiet_NaN · 2024-04-29T23:19:04.870Z · LW(p) · GW(p)

I think that it is obvious that Middle-Endianness is a satisfactory compromise between Big and Little Endian.

More seriously, it depends on what you want to do with the number. If you want to use it in a precise calculation, such as adding it to another number, you obviously want to process the least significant digits of the inputs first (which is what bit serial processors literally do).

If I want to know if a serially transmitted number is below or above a threshold, it would make sense to transmit it MSB first (with a fixed length).

Of course, using integers to count the number of people in India seems like using the wrong tool for the job to me altogether. Even if you were an omniscient ASI, this level of precision would require you to have clear standards at what time a human counts as born and at least provide a second-accurate timestamp or something. Few people care if the population in India was divisible by 17 at any fixed point in time, which is what we would mostly use integers for.

The natural type for the number of people in India (as opposed to the number of people in your bedroom) would be a floating point number.

And the correct way to specify a floating point number is to start with the exponent, which is the most important part. You will need to parse all of the bits of the exponent either way to get an idea of the magnitude of the number (unless we start encoding the exponent as a floating point number, again.)

The next most important thing is the sign bit. Then comes the mantissa, starting with the most significant bit.

So instead of writing

The electric charge of the electron is .

What we should write is:

The electric charge of the electron is $C \times 10^{- 19} \times - 1.602176634.$

Standardizing for a shorter form (1.6e-19 C --> ??) is left as an exercise to the reader, as are questions about the benefits we get from switching to base-2 exponentials (base-e exponentials do not seem particularly handy, I kind of like using the same system of digits for both my floats and my ints) and omitting the then-redundant one in front of the dot of the mantissa.

comment by jbash · 2024-04-29T15:23:09.733Z · LW(p) · GW(p)

Interestingly enough, the terms "big-endian" and "little-endian" were actually coined as a way of mocking people for debating this (in the context of computer byte order).

comment by Yair Halberstadt (yair-halberstadt) · 2024-04-30T05:34:22.654Z · LW(p) · GW(p)

What if you are not a person, but a computer, converting a string into an integer? In that case, having a simpler and faster algorithm is important, having to start with only the beginning of a string (what the user has typed so far) is plausible, and knowing the number's approximate value is useless. So in this case the little-endian algorithm is much better than the big-endian one.

For the most part this is irrelevant. If you are a computer with only partial input, recieving the rest of the input is so much slower than parsing it that it literally does not matter which algorithm you use. If you have all the input, either is equally fast. Also, most of the complexity is going to be validating the input and handling edge cases (e.g commas Vs decimal points), not keeping track of the total.

There could be some weird edge cases, where you're reading a huge number, for some reason stored as a string, in a known format, don't require validation, from an extremely fast storage mechanism, e.g. RAM or an SSD. In that case you might care about performance of keeping track of the running total, but weird edge cases shouldn't decide which format to use (and in practice I guess they'd both be essentially identically fast here too).

Computers also need to decide whether to use big Indian or little Indian for their native representation of numbers. But either way they'll be using specialized hardware to operate on them, and both will be equally fast (and for the most part invisible to the user).

comment by localdeity · 2024-04-29T23:06:39.937Z · LW(p) · GW(p)

The other big-endian algorithm is the one I observe myself as usually using. For "321", it is:
Count the digits (three), and convert that into an order of magnitude (one hundred). (Running total: ??? hundred ???.)
Read the first digit, multiply it by its order of magnitude (one hundred), and add it to the total. (Running total: three hundred ???.)
Read the second digit, multiply it by its order of magnitude (ten), and add it to the total. (Running total: three hundred and twenty ???.
Read the third digit, multiply it by its order of magnitude (one), and add it to the total. (Arriving at three hundred and twenty one.)

I generally agree, except I find words like "multiply" and "add" a bit misleading to use in this context. If I read a number like 3,749,328, then it's not like I take 3 million, and then take 7, multiply by 100,000, and get 700,000, and then perform a general-purpose addition operation and compute the subtotal of 3,700,000. First of all, "multiply by 100,000" is generally more like "Shift left by 5 (in our base-10 representation)"; but moreover, the whole operation is more like a "Set the nth digit of the number to be this". If this were a computer working in base 2, "set nth digit" would be implemented as "mask out the nth bit of the current number [though in this case we know it's already 0 and can skip this step], then take the input bit, shift left by n, and OR it with the current number".

(In this context I find it a bit misleading to say that "One hundred plus twenty yields one hundred and twenty" is performing an addition operation, any more than "x plus y yields x+y" counts as performing addition. Because 100, by place-value notation, means 1 * 100, and 20 means 2 * 10, and 120 means 1 * 100 + 2 * 10, so you really are just restating the input.)

Also, I might switch the order of the first two steps in practice. "Three ... [pauses to count digits] million, seven hundred forty-nine thousand, ...".

Big-endian is better than little-endian

Contents

17 comments