Revisit the English version (#1835)

* Review the English version using Claude-4.5.

* Update mkdocs.yml

* Align the section titles.

* Bug fixes
This commit is contained in:
Yudong Jin
2025-12-30 17:54:01 +08:00
committed by GitHub
parent 091afd38b4
commit 45e1295241
106 changed files with 4195 additions and 3398 deletions
@@ -1,66 +1,66 @@
# Basic data types
# Basic Data Types
When discussing data in computers, various forms like text, images, videos, voice and 3D models comes to mind. Despite their different organizational forms, they are all composed of various basic data types.
When we talk about data in computers, we think of various forms such as text, images, videos, audio, 3D models, and more. Although these data are organized in different ways, they are all composed of various basic data types.
**Basic data types are those that the CPU can directly operate on** and are directly used in algorithms, mainly including the following.
**Basic data types are types that the CPU can directly operate on**, and they are directly used in algorithms, mainly including the following.
- Integer types: `byte`, `short`, `int`, `long`.
- Floating-point types: `float`, `double`, used to represent decimals.
- Character type: `char`, used to represent letters, punctuation, and even emojis in various languages.
- Boolean type: `bool`, used to represent "yes" or "no" decisions.
- Integer types `byte`, `short`, `int`, `long`.
- Floating-point types `float`, `double`, used to represent decimal numbers.
- Character type `char`, used to represent letters, punctuation marks, and even emojis in various languages.
- Boolean type `bool`, used to represent "yes" and "no" judgments.
**Basic data types are stored in computers in binary form**. One binary digit is 1 bit. In most modern operating systems, 1 byte consists of 8 bits.
**Basic data types are stored in binary form in computers**. One binary bit is $1$ bit. In most modern operating systems, $1$ byte consists of $8$ bits.
The range of values for basic data types depends on the size of the space they occupy. Below, we take Java as an example.
The range of values for basic data types depends on the size of the space they occupy. Below is an example using Java.
- The integer type `byte` occupies 1 byte = 8 bits and can represent $2^8$ numbers.
- The integer type `int` occupies 4 bytes = 32 bits and can represent $2^{32}$ numbers.
- Integer type `byte` occupies $1$ byte = $8$ bits, and can represent $2^{8}$ numbers.
- Integer type `int` occupies $4$ bytes = $32$ bits, and can represent $2^{32}$ numbers.
The following table lists the space occupied, value range, and default values of various basic data types in Java. While memorizing this table isn't necessary, having a general understanding of it and referencing it when required is recommended.
The following table lists the space occupied, value ranges, and default values of various basic data types in Java. You don't need to memorize this table; a general understanding is sufficient, and you can refer to it when needed.
<p align="center"> Table <id> &nbsp; Space occupied and value range of basic data types </p>
<p align="center"> Table <id> &nbsp; Space occupied and value ranges of basic data types </p>
| Type | Symbol | Space Occupied | Minimum Value | Maximum Value | Default Value |
| ------- | -------- | -------------- | ------------------------ | ----------------------- | -------------- |
| Integer | `byte` | 1 byte | $-2^7$ ($-128$) | $2^7 - 1$ ($127$) | 0 |
| | `short` | 2 bytes | $-2^{15}$ | $2^{15} - 1$ | 0 |
| | `int` | 4 bytes | $-2^{31}$ | $2^{31} - 1$ | 0 |
| | `long` | 8 bytes | $-2^{63}$ | $2^{63} - 1$ | 0 |
| Float | `float` | 4 bytes | $1.175 \times 10^{-38}$ | $3.403 \times 10^{38}$ | $0.0\text{f}$ |
| | `double` | 8 bytes | $2.225 \times 10^{-308}$ | $1.798 \times 10^{308}$ | 0.0 |
| Char | `char` | 2 bytes | 0 | $2^{16} - 1$ | 0 |
| Boolean | `bool` | 1 byte | $\text{false}$ | $\text{true}$ | $\text{false}$ |
| Type | Symbol | Space Occupied | Minimum Value | Maximum Value | Default Value |
| ---------- | -------- | -------------- | ------------------------ | ----------------------- | -------------- |
| Integer | `byte` | 1 byte | $-2^7$ ($-128$) | $2^7 - 1$ ($127$) | $0$ |
| | `short` | 2 bytes | $-2^{15}$ | $2^{15} - 1$ | $0$ |
| | `int` | 4 bytes | $-2^{31}$ | $2^{31} - 1$ | $0$ |
| | `long` | 8 bytes | $-2^{63}$ | $2^{63} - 1$ | $0$ |
| Float | `float` | 4 bytes | $1.175 \times 10^{-38}$ | $3.403 \times 10^{38}$ | $0.0\text{f}$ |
| | `double` | 8 bytes | $2.225 \times 10^{-308}$ | $1.798 \times 10^{308}$ | $0.0$ |
| Character | `char` | 2 bytes | $0$ | $2^{16} - 1$ | $0$ |
| Boolean | `bool` | 1 byte | $\text{false}$ | $\text{true}$ | $\text{false}$ |
Please note that the above table is specific to Java's basic data types. Every programming language has its own data type definitions, which might differ in space occupied, value ranges, and default values.
Please note that the above table is specific to Java's basic data types. Each programming language has its own data type definitions, and their space occupied, value ranges, and default values may vary.
- In Python, the integer type `int` can be of any size, limited only by available memory; the floating-point `float` is double precision 64-bit; there is no `char` type, as a single character is actually a string `str` of length 1.
- C and C++ do not specify the size of basic data types, it varies with implementation and platform. The above table follows the LP64 [data model](https://en.cppreference.com/w/cpp/language/types#Properties), used for Unix 64-bit operating systems including Linux and macOS.
- The size of `char` in C and C++ is 1 byte, while in most programming languages, it depends on the specific character encoding method, as detailed in the "Character Encoding" chapter.
- Even though representing a boolean only requires 1 bit (0 or 1), it is usually stored in memory as 1 byte. This is because modern computer CPUs typically use 1 byte as the smallest addressable memory unit.
- In Python, the integer type `int` can be of any size, limited only by available memory; the floating-point type `float` is double-precision 64-bit; there is no `char` type, a single character is actually a string `str` of length 1.
- C and C++ do not explicitly specify the size of basic data types, which varies by implementation and platform. The above table follows the LP64 [data model](https://en.cppreference.com/w/cpp/language/types#Properties), which is used in Unix 64-bit operating systems including Linux and macOS.
- The size of character `char` is 1 byte in C and C++, and in most programming languages it depends on the specific character encoding method, as detailed in the "Character Encoding" section.
- Even though representing a boolean value requires only 1 bit ($0$ or $1$), it is usually stored as 1 byte in memory. This is because modern computer CPUs typically use 1 byte as the minimum addressable memory unit.
So, what is the connection between basic data types and data structures? We know that data structures are ways to organize and store data in computers. The focus here is on "structure" rather than "data".
So, what is the relationship between basic data types and data structures? We know that data structures are ways of organizing and storing data in computers. The subject of this statement is "structure", not "data".
If we want to represent "a row of numbers", we naturally think of using an array. This is because the linear structure of an array can represent the adjacency and the ordering of the numbers, but whether the stored content is an integer `int`, a decimal `float`, or a character `char`, is irrelevant to the "data structure".
If we want to represent "a row of numbers", we naturally think of using an array. This is because the linear structure of an array can represent the adjacency and order relationships of numbers, but the content stored—whether integer `int`, floating-point `float`, or character `char`is unrelated to the "data structure".
In other words, **basic data types provide the "content type" of data, while data structures provide the "way of organizing" data**. For example, in the following code, we use the same data structure (array) to store and represent different basic data types, including `int`, `float`, `char`, `bool`, etc.
In other words, **basic data types provide the "content type" of data, while data structures provide the "organization method" of data**. For example, in the following code, we use the same data structure (array) to store and represent different basic data types, including `int`, `float`, `char`, `bool`, etc.
=== "Python"
```python title=""
# Using various basic data types to initialize arrays
# Initialize arrays using various basic data types
numbers: list[int] = [0] * 5
decimals: list[float] = [0.0] * 5
# Python's characters are actually strings of length 1
# In Python, characters are actually strings of length 1
characters: list[str] = ['0'] * 5
bools: list[bool] = [False] * 5
# Python's lists can freely store various basic data types and object references
# Python lists can freely store various basic data types and object references
data = [0, 0.0, 'a', False, ListNode(0)]
```
=== "C++"
```cpp title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
int numbers[5];
float decimals[5];
char characters[5];
@@ -70,7 +70,7 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "Java"
```java title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
int[] numbers = new int[5];
float[] decimals = new float[5];
char[] characters = new char[5];
@@ -80,7 +80,7 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "C#"
```csharp title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
int[] numbers = new int[5];
float[] decimals = new float[5];
char[] characters = new char[5];
@@ -90,7 +90,7 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "Go"
```go title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
var numbers = [5]int{}
var decimals = [5]float64{}
var characters = [5]byte{}
@@ -100,7 +100,7 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "Swift"
```swift title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
let numbers = Array(repeating: 0, count: 5)
let decimals = Array(repeating: 0.0, count: 5)
let characters: [Character] = Array(repeating: "a", count: 5)
@@ -110,14 +110,14 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "JS"
```javascript title=""
// JavaScript's arrays can freely store various basic data types and objects
// JavaScript arrays can freely store various basic data types and objects
const array = [0, 0.0, 'a', false];
```
=== "TS"
```typescript title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
const numbers: number[] = [];
const characters: string[] = [];
const bools: boolean[] = [];
@@ -126,7 +126,7 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "Dart"
```dart title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
List<int> numbers = List.filled(5, 0);
List<double> decimals = List.filled(5, 0.0);
List<String> characters = List.filled(5, 'a');
@@ -136,9 +136,9 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "Rust"
```rust title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
let numbers: Vec<i32> = vec![0; 5];
let decimals: Vec<f32> = vec![0.0, 5];
let decimals: Vec<f32> = vec![0.0; 5];
let characters: Vec<char> = vec!['0'; 5];
let bools: Vec<bool> = vec![false; 5];
```
@@ -146,7 +146,7 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "C"
```c title=""
// Using various basic data types to initialize arrays
// Initialize arrays using various basic data types
int numbers[10];
float decimals[10];
char characters[10];
@@ -156,15 +156,38 @@ In other words, **basic data types provide the "content type" of data, while dat
=== "Kotlin"
```kotlin title=""
// Initialize arrays using various basic data types
val numbers = IntArray(5)
val decinals = FloatArray(5)
val characters = CharArray(5)
val bools = BooleanArray(5)
```
=== "Ruby"
```ruby title=""
# Ruby lists can freely store various basic data types and object references
data = [0, 0.0, 'a', false, ListNode(0)]
```
=== "Zig"
```zig title=""
// Using various basic data types to initialize arrays
var numbers: [5]i32 = undefined;
var decimals: [5]f32 = undefined;
var characters: [5]u8 = undefined;
var bools: [5]bool = undefined;
const hello = [5]u8{ 'h', 'e', 'l', 'l', 'o' };
// The above code demonstrates how to define a literal array, where you can choose to specify the array size or use _ instead. When using _, Zig will attempt to automatically calculate the array length
const matrix_4x4 = [4][4]f32{
[_]f32{ 1.0, 0.0, 0.0, 0.0 },
[_]f32{ 0.0, 1.0, 0.0, 1.0 },
[_]f32{ 0.0, 0.0, 1.0, 0.0 },
[_]f32{ 0.0, 0.0, 0.0, 1.0 },
};
// Multidimensional arrays (matrices) are actually nested arrays, and we can easily create a multidimensional array
const array = [_:0]u8{ 1, 2, 3, 4 };
// Define a sentinel-terminated array. Essentially, this is to be compatible with C's specified string terminator \0. We use the syntax [N:x]T to describe an array with elements of type T and length N, where the value at index N should be x
```
??? pythontutor "Visualized Execution"
https://pythontutor.com/render.html#code=class%20ListNode%3A%0A%20%20%20%20%22%22%22%E9%93%BE%E8%A1%A8%E8%8A%82%E7%82%B9%E7%B1%BB%22%22%22%0A%20%20%20%20def%20__init__%28self,%20val%3A%20int%29%3A%0A%20%20%20%20%20%20%20%20self.val%3A%20int%20%3D%20val%20%20%23%20%E8%8A%82%E7%82%B9%E5%80%BC%0A%20%20%20%20%20%20%20%20self.next%3A%20ListNode%20%7C%20None%20%3D%20None%20%20%23%20%E5%90%8E%E7%BB%A7%E8%8A%82%E7%82%B9%E5%BC%95%E7%94%A8%0A%0A%22%22%22Driver%20Code%22%22%22%0Aif%20__name__%20%3D%3D%20%22__main__%22%3A%0A%20%20%20%20%23%20%E4%BD%BF%E7%94%A8%E5%A4%9A%E7%A7%8D%E5%9F%BA%E6%9C%AC%E6%95%B0%E6%8D%AE%E7%B1%BB%E5%9E%8B%E6%9D%A5%E5%88%9D%E5%A7%8B%E5%8C%96%E6%95%B0%E7%BB%84%0A%20%20%20%20numbers%20%3D%20%5B0%5D%20*%205%0A%20%20%20%20decimals%20%3D%20%5B0.0%5D%20*%205%0A%20%20%20%20%23%20Python%20%E7%9A%84%E5%AD%97%E7%AC%A6%E5%AE%9E%E9%99%85%E4%B8%8A%E6%98%AF%E9%95%BF%E5%BA%A6%E4%B8%BA%201%20%E7%9A%84%E5%AD%97%E7%AC%A6%E4%B8%B2%0A%20%20%20%20characters%20%3D%20%5B'0'%5D%20*%205%0A%20%20%20%20bools%20%3D%20%5BFalse%5D%20*%205%0A%20%20%20%20%23%20Python%20%E7%9A%84%E5%88%97%E8%A1%A8%E5%8F%AF%E4%BB%A5%E8%87%AA%E7%94%B1%E5%AD%98%E5%82%A8%E5%90%84%E7%A7%8D%E5%9F%BA%E6%9C%AC%E6%95%B0%E6%8D%AE%E7%B1%BB%E5%9E%8B%E5%92%8C%E5%AF%B9%E8%B1%A1%E5%BC%95%E7%94%A8%0A%20%20%20%20data%20%3D%20%5B0,%200.0,%20'a',%20False,%20ListNode%280%29%5D&cumulative=false&curInstr=12&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=311&rawInputLstJSON=%5B%5D&textReferences=false
@@ -1,87 +1,87 @@
# Character encoding *
# Character Encoding *
In the computer system, all data is stored in binary form, and `char` is no exception. To represent characters, we need to develop a "character set" that defines a one-to-one mapping between each character and binary numbers. With the character set, computers can convert binary numbers to characters by looking up the table.
In computers, all data is stored in binary form, and character `char` is no exception. To represent characters, we need to establish a "character set" that defines a one-to-one correspondence between each character and binary numbers. With a character set, computers can convert binary numbers to characters by looking up the table.
## ASCII character set
## ASCII Character Set
The <u>ASCII code</u> is one of the earliest character sets, officially known as the American Standard Code for Information Interchange. It uses 7 binary digits (the lower 7 bits of a byte) to represent a character, allowing for a maximum of 128 different characters. As shown in the figure below, ASCII includes uppercase and lowercase English letters, numbers 0 ~ 9, various punctuation marks, and certain control characters (such as newline and tab).
<u>ASCII code</u> is the earliest character set, with the full name American Standard Code for Information Interchange. It uses 7 binary bits (the lower 7 bits of one byte) to represent a character, and can represent a maximum of 128 different characters. As shown in the figure below, ASCII code includes uppercase and lowercase English letters, numbers 0 ~ 9, some punctuation marks, and some control characters (such as newline and tab).
![ASCII code](character_encoding.assets/ascii_table.png)
However, **ASCII can only represent English characters**. With the globalization of computers, a character set called <u>EASCII</u> was developed to represent more languages. It expands from the 7-bit structure of ASCII to 8 bits, enabling the representation of 256 characters.
However, **ASCII code can only represent English**. With the globalization of computers, a character set called <u>EASCII</u> that can represent more languages emerged. It expands from the 7-bit basis of ASCII to 8 bits, and can represent 256 different characters.
Globally, various region-specific EASCII character sets have been introduced. The first 128 characters of these sets are consistent with the ASCII, while the remaining 128 characters are defined differently to accommodate the requirements of different languages.
Worldwide, a batch of EASCII character sets suitable for different regions have appeared successively. The first 128 characters of these character sets are unified as ASCII code, and the last 128 characters are defined differently to adapt to the needs of different languages.
## GBK character set
## GBK Character Set
Later, it was found that **EASCII still could not meet the character requirements of many languages**. For instance, there are nearly a hundred thousand Chinese characters, with several thousand used regularly. In 1980, the Standardization Administration of China released the <u>GB2312</u> character set, which included 6763 Chinese characters, essentially fulfilling the computer processing needs for the Chinese language.
Later, people found that **EASCII code still cannot meet the character quantity requirements of many languages**. For example, there are nearly one hundred thousand Chinese characters, and several thousand are used daily. In 1980, the China National Standardization Administration released the <u>GB2312</u> character set, which included 6,763 Chinese characters, basically meeting the needs for computer processing of Chinese characters.
However, GB2312 could not handle some rare and traditional characters. The <u>GBK</u> character set expands GB2312 and includes 21886 Chinese characters. In the GBK encoding scheme, ASCII characters are represented with one byte, while Chinese characters use two bytes.
However, GB2312 cannot handle some rare characters and traditional Chinese characters. The <u>GBK</u> character set is an extension based on GB2312, which includes a total of 21,886 Chinese characters. In the GBK encoding scheme, ASCII characters are represented using one byte, and Chinese characters are represented using two bytes.
## Unicode character set
## Unicode Character Set
With the rapid evolution of computer technology and a plethora of character sets and encoding standards, numerous problems arose. On the one hand, these character sets generally only defined characters for specific languages and could not function properly in multilingual environments. On the other hand, the existence of multiple character set standards for the same language caused garbled text when information was exchanged between computers using different encoding standards.
With the vigorous development of computer technology, character sets and encoding standards flourished, which brought many problems. On the one hand, these character sets generally only define characters for specific languages and cannot work normally in multilingual environments. On the other hand, multiple character set standards exist for the same language, and if two computers use different encoding standards, garbled characters will appear during information transmission.
Researchers of that era thought: **What if a comprehensive character set encompassing all global languages and symbols was developed? Wouldn't this resolve the issues associated with cross-linguistic environments and garbled text?** Inspired by this idea, the extensive character set, Unicode, was born.
Researchers of that era thought: **If a sufficiently complete character set is released that includes all languages and symbols in the world, wouldn't it be possible to solve cross-language environment and garbled character problems**? Driven by this idea, a large and comprehensive character set, Unicode, was born.
<u>Unicode</u> is referred to as "统一码" (Unified Code) in Chinese, theoretically capable of accommodating over a million characters. It aims to incorporate characters from all over the world into a single set, providing a universal character set for processing and displaying various languages and reducing the issues of garbled text due to different encoding standards.
<u>Unicode</u> is called "统一码" (Unified Code) in Chinese and can theoretically accommodate over one million characters. It is committed to including characters from around the world into a unified character set, providing a universal character set to handle and display various language texts, reducing garbled character problems caused by different encoding standards.
Since its release in 1991, Unicode has continually expanded to include new languages and characters. As of September 2022, Unicode contains 149,186 characters, including characters, symbols, and even emojis from various languages. In the vast Unicode character set, commonly used characters occupy 2 bytes, while some rare characters may occupy 3 or even 4 bytes.
Since its release in 1991, Unicode has continuously expanded to include new languages and characters. As of September 2022, Unicode has included 149,186 characters, including characters, symbols, and even emojis from various languages. In the vast Unicode character set, commonly used characters occupy 2 bytes, and some rare characters occupy 3 bytes or even 4 bytes.
Unicode is a universal character set that assigns a number (called a "code point") to each character, **but it does not specify how these character code points should be stored in a computer system**. One might ask: How does a system interpret Unicode code points of varying lengths within a text? For example, given a 2-byte code, how does the system determine if it represents a single 2-byte character or two 1-byte characters?
Unicode is a universal character set that essentially assigns a number (called a "code point") to each character, **but it does not specify how to store these character code points in computers**. We can't help but ask: when Unicode code points of multiple lengths appear simultaneously in a text, how does the system parse the characters? For example, given an encoding with a length of 2 bytes, how does the system determine whether it is one 2-byte character or two 1-byte characters?
**A straightforward solution to this problem is to store all characters as equal-length encodings**. As shown in the figure below, each character in "Hello" occupies 1 byte, while each character in "算法" (algorithm) occupies 2 bytes. We could encode all characters in "Hello 算法" as 2 bytes by padding the higher bits with zeros. This method would enable the system to interpret a character every 2 bytes, recovering the content of the phrase.
For the above problem, **a straightforward solution is to store all characters as equal-length encodings**. As shown in the figure below, each character in "Hello" occupies 1 byte, and each character in "算法" (algorithm) occupies 2 bytes. We can encode all characters in "Hello 算法" as 2 bytes in length by padding the high bits with 0. In this way, the system can parse one character every 2 bytes and restore the content of this phrase.
![Unicode encoding example](character_encoding.assets/unicode_hello_algo.png)
However, as ASCII has shown us, encoding English only requires 1 byte. Using the above approach would double the space occupied by English text compared to ASCII encoding, which is a waste of memory space. Therefore, a more efficient Unicode encoding method is needed.
However, ASCII code has already proven to us that encoding English only requires 1 byte. If the above scheme is adopted, the size of English text will be twice that under ASCII encoding, which is very wasteful of memory space. Therefore, we need a more efficient Unicode encoding method.
## UTF-8 encoding
## UTF-8 Encoding
Currently, UTF-8 has become the most widely used Unicode encoding method internationally. **It is a variable-length encoding**, using 1 to 4 bytes to represent a character, depending on the complexity of the character. ASCII characters need only 1 byte, Latin and Greek letters require 2 bytes, commonly used Chinese characters need 3 bytes, and some other rare characters need 4 bytes.
Currently, UTF-8 has become the most widely used Unicode encoding method internationally. **It is a variable-length encoding** that uses 1 to 4 bytes to represent a character, depending on the complexity of the character. ASCII characters only require 1 byte, Latin and Greek letters require 2 bytes, commonly used Chinese characters require 3 bytes, and some other rare characters require 4 bytes.
The encoding rules for UTF-8 are not complex and can be divided into two cases:
The encoding rules of UTF-8 are not complicated and can be divided into the following two cases.
- For 1-byte characters, set the highest bit to $0$, and the remaining 7 bits to the Unicode code point. Notably, ASCII characters occupy the first 128 code points in the Unicode set. This means that **UTF-8 encoding is backward compatible with ASCII**. This implies that UTF-8 can be used to parse ancient ASCII text.
- For characters of length $n$ bytes (where $n > 1$), set the highest $n$ bits of the first byte to $1$, and the $(n + 1)^{\text{th}}$ bit to $0$; starting from the second byte, set the highest 2 bits of each byte to $10$; the rest of the bits are used to fill the Unicode code point.
- For 1-byte characters, set the highest bit to $0$, and set the remaining 7 bits to the Unicode code point. It is worth noting that ASCII characters occupy the first 128 code points in the Unicode character set. That is to say, **UTF-8 encoding is backward compatible with ASCII code**. This means we can use UTF-8 to parse very old ASCII code text.
- For characters with a length of $n$ bytes (where $n > 1$), set the highest $n$ bits of the first byte to $1$, and set the $(n + 1)$-th bit to $0$; starting from the second byte, set the highest 2 bits of each byte to $10$; use all remaining bits to fill in the Unicode code point of the character.
The figure below shows the UTF-8 encoding for "Hello算法". It can be observed that since the highest $n$ bits are set to $1$, the system can determine the length of the character as $n$ by counting the number of highest bits set to $1$.
The figure below shows the UTF-8 encoding corresponding to "Hello算法". It can be observed that since the highest $n$ bits are all set to $1$, the system can parse the length of the character as $n$ by reading the number of highest bits that are $1$.
But why set the highest 2 bits of the remaining bytes to $10$? Actually, this $10$ serves as a kind of checksum. If the system starts parsing text from an incorrect byte, the $10$ at the beginning of the byte can help the system quickly detect anomalies.
But why set the highest 2 bits of all other bytes to $10$? In fact, this $10$ can serve as a check symbol. Assuming the system starts parsing text from an incorrect byte, the $10$ at the beginning of the byte can help the system quickly determine an anomaly.
The reason for using $10$ as a checksum is that, under UTF-8 encoding rules, it's impossible for the highest two bits of a character to be $10$. This can be proven by contradiction: If the highest two bits of a character are $10$, it indicates that the character's length is $1$, corresponding to ASCII. However, the highest bit of an ASCII character should be $0$, which contradicts the assumption.
The reason for using $10$ as a check symbol is that under UTF-8 encoding rules, it is impossible for a character's highest two bits to be $10$. This conclusion can be proven by contradiction: assuming the highest two bits of a character are $10$, it means the length of the character is $1$, corresponding to ASCII code. However, the highest bit of ASCII code should be $0$, which contradicts the assumption.
![UTF-8 encoding example](character_encoding.assets/utf-8_hello_algo.png)
Apart from UTF-8, other common encoding methods include:
In addition to UTF-8, common encoding methods also include the following two.
- **UTF-16 encoding**: Uses 2 or 4 bytes to represent a character. All ASCII characters and commonly used non-English characters are represented with 2 bytes; a few characters require 4 bytes. For 2-byte characters, the UTF-16 encoding equals the Unicode code point.
- **UTF-32 encoding**: Every character uses 4 bytes. This means UTF-32 occupies more space than UTF-8 and UTF-16, especially for texts with a high proportion of ASCII characters.
- **UTF-16 encoding**: Uses 2 or 4 bytes to represent a character. All ASCII characters and commonly used non-English characters are represented with 2 bytes; a few characters need to use 4 bytes. For 2-byte characters, UTF-16 encoding is equal to the Unicode code point.
- **UTF-32 encoding**: Every character uses 4 bytes. This means that UTF-32 takes up more space than UTF-8 and UTF-16, especially for text with a high proportion of ASCII characters.
From the perspective of storage space, using UTF-8 to represent English characters is very efficient because it only requires 1 byte; using UTF-16 to encode some non-English characters (such as Chinese) can be more efficient because it only requires 2 bytes, while UTF-8 might need 3 bytes.
From the perspective of storage space occupation, using UTF-8 to represent English characters is very efficient because it only requires 1 byte; using UTF-16 encoding for some non-English characters (such as Chinese) will be more efficient because it only requires 2 bytes, while UTF-8 may require 3 bytes.
From a compatibility perspective, UTF-8 is the most versatile, with many tools and libraries supporting UTF-8 as a priority.
From a compatibility perspective, UTF-8 has the best universality, and many tools and libraries support UTF-8 first.
## Character encoding in programming languages
## Character Encoding in Programming Languages
Historically, many programming languages utilized fixed-length encodings such as UTF-16 or UTF-32 for processing strings during program execution. This allows strings to be handled as arrays, offering several advantages:
For most past programming languages, strings during program execution use fixed-length encodings such as UTF-16 or UTF-32. Under fixed-length encoding, we can treat strings as arrays for processing, and this approach has the following advantages.
- **Random access**: Strings encoded in UTF-16 can be accessed randomly with ease. For UTF-8, which is a variable-length encoding, locating the $i^{th}$ character requires traversing the string from the start to the $i^{th}$ position, taking $O(n)$ time.
- **Character counting**: Similar to random access, counting the number of characters in a UTF-16 encoded string is an $O(1)$ operation. However, counting characters in a UTF-8 encoded string requires traversing the entire string.
- **String operations**: Many string operations like splitting, concatenating, inserting, and deleting are easier on UTF-16 encoded strings. These operations generally require additional computation on UTF-8 encoded strings to ensure the validity of the UTF-8 encoding.
- **Random access**: UTF-16 encoded strings can be easily accessed randomly. UTF-8 is a variable-length encoding. To find the $i$-th character, we need to traverse from the beginning of the string to the $i$-th character, which requires $O(n)$ time.
- **Character counting**: Similar to random access, calculating the length of a UTF-16 encoded string is also an $O(1)$ operation. However, calculating the length of a UTF-8 encoded string requires traversing the entire string.
- **String operations**: Many string operations (such as splitting, joining, inserting, deleting, etc.) on UTF-16 encoded strings are easier to perform. Performing these operations on UTF-8 encoded strings usually requires additional calculations to ensure that invalid UTF-8 encoding is not generated.
The design of character encoding schemes in programming languages is an interesting topic involving various factors:
In fact, the design of character encoding schemes for programming languages is a very interesting topic involving many factors.
- Javas `String` type uses UTF-16 encoding, with each character occupying 2 bytes. This was based on the initial belief that 16 bits were sufficient to represent all possible characters and proven incorrect later. As the Unicode standard expanded beyond 16 bits, characters in Java may now be represented by a pair of 16-bit values, known as “surrogate pairs.
- JavaScript and TypeScript use UTF-16 encoding for similar reasons as Java. When JavaScript was first introduced by Netscape in 1995, Unicode was still in its early stages, and 16-bit encoding was sufficient to represent all Unicode characters.
- C# uses UTF-16 encoding, largely because the .NET platform, designed by Microsoft, and many Microsoft technologies, including the Windows operating system, extensively use UTF-16 encoding.
- Java's `String` type uses UTF-16 encoding, with each character occupying 2 bytes. This is because at the beginning of Java language design, people believed that 16 bits were sufficient to represent all possible characters. However, this was an incorrect judgment. Later, the Unicode specification expanded beyond 16 bits, so characters in Java may now be represented by a pair of 16-bit values (called "surrogate pairs").
- The strings of JavaScript and TypeScript use UTF-16 encoding for reasons similar to Java. When Netscape first introduced the JavaScript language in 1995, Unicode was still in its early stages of development, and at that time, using 16-bit encoding was sufficient to represent all Unicode characters.
- C# uses UTF-16 encoding mainly because the .NET platform was designed by Microsoft, and many of Microsoft's technologies (including the Windows operating system) extensively use UTF-16 encoding.
Due to the underestimation of character counts, these languages had to use "surrogate pairs" to represent Unicode characters exceeding 16 bits. This approach has its drawbacks: strings containing surrogate pairs may have characters occupying 2 or 4 bytes, losing the advantage of fixed-length encoding. Additionally, handling surrogate pairs adds complexity and debugging difficulty to programming.
Due to the underestimation of character quantities by the above programming languages, they had to adopt the "surrogate pair" method to represent Unicode characters with lengths exceeding 16 bits. This is a reluctant compromise. On the one hand, in strings containing surrogate pairs, one character may occupy 2 bytes or 4 bytes, thus losing the advantage of fixed-length encoding. On the other hand, handling surrogate pairs requires additional code, which increases the complexity and difficulty of debugging in programming.
Addressing these challenges, some languages have adopted alternative encoding strategies:
For the above reasons, some programming languages have proposed different encoding schemes.
- Pythons `str` type uses Unicode encoding with a flexible representation where the storage length of characters depends on the largest Unicode code point in the string. If all characters are ASCII, each character occupies 1 byte, 2 bytes for characters within the Basic Multilingual Plane (BMP), and 4 bytes for characters beyond the BMP.
- Gos `string` type internally uses UTF-8 encoding. Go also provides the `rune` type for representing individual Unicode code points.
- Rusts `str` and `String` types use UTF-8 encoding internally. Rust also offers the `char` type for individual Unicode code points.
- Python's `str` uses Unicode encoding and adopts a flexible string representation where the stored character length depends on the largest Unicode code point in the string. If all characters in the string are ASCII characters, each character occupies 1 byte; if there are characters exceeding the ASCII range but all within the Basic Multilingual Plane (BMP), each character occupies 2 bytes; if there are characters exceeding the BMP, each character occupies 4 bytes.
- Go language's `string` type uses UTF-8 encoding internally. Go language also provides the `rune` type, which is used to represent a single Unicode code point.
- Rust language's `str` and `String` types use UTF-8 encoding internally. Rust also provides the `char` type for representing a single Unicode code point.
Its important to note that the above discussion pertains to how strings are stored in programming languages, **which is different from how strings are stored in files or transmitted over networks**. For file storage or network transmission, strings are usually encoded in UTF-8 format for optimal compatibility and space efficiency.
It should be noted that the above discussion is about how strings are stored in programming languages, **which is different from how strings are stored in files or transmitted over networks**. In file storage or network transmission, we usually encode strings into UTF-8 format to achieve optimal compatibility and space efficiency.
@@ -1,48 +1,48 @@
# Classification of data structures
# Classification of Data Structures
Common data structures include arrays, linked lists, stacks, queues, hash tables, trees, heaps, and graphs. They can be classified into "logical structure" and "physical structure".
Common data structures include arrays, linked lists, stacks, queues, hash tables, trees, heaps, and graphs. They can be classified from two dimensions: "logical structure" and "physical structure".
## Logical structure: linear and non-linear
## Logical Structure: Linear and Non-Linear
**The logical structures reveal the logical relationships between data elements**. In arrays and linked lists, data are arranged in a specific sequence, demonstrating the linear relationship between data; while in trees, data are arranged hierarchically from the top down, showing the derived relationship between "ancestors" and "descendants"; and graphs are composed of nodes and edges, reflecting the intricate network relationship.
**Logical structure reveals the logical relationships between data elements**. In arrays and linked lists, data is arranged in a certain order, embodying the linear relationship between data; while in trees, data is arranged hierarchically from top to bottom, showing the derived relationship between "ancestors" and "descendants"; graphs are composed of nodes and edges, reflecting complex network relationships.
As shown in the figure below, logical structures can be divided into two major categories: "linear" and "non-linear". Linear structures are more intuitive, indicating data is arranged linearly in logical relationships; non-linear structures, conversely, are arranged non-linearly.
As shown in the figure below, logical structures can be divided into two major categories: "linear" and "non-linear". Linear structures are more intuitive, indicating that data is linearly arranged in logical relationships; non-linear structures are the opposite, arranged non-linearly.
- **Linear data structures**: Arrays, Linked Lists, Stacks, Queues, Hash Tables, where elements have a one-to-one sequential relationship.
- **Non-linear data structures**: Trees, Heaps, Graphs, Hash Tables.
- **Linear data structures**: Arrays, linked lists, stacks, queues, hash tables, where elements have a one-to-one sequential relationship.
- **Non-linear data structures**: Trees, heaps, graphs, hash tables.
Non-linear data structures can be further divided into tree structures and network structures.
- **Tree structures**: Trees, Heaps, Hash Tables, where elements have a one-to-many relationship.
- **Network structures**: Graphs, where elements have a many-to-many relationships.
- **Tree structures**: Trees, heaps, hash tables, where elements have a one-to-many relationship.
- **Network structures**: Graphs, where elements have a many-to-many relationship.
![Linear and non-linear data structures](classification_of_data_structure.assets/classification_logic_structure.png)
## Physical structure: contiguous and dispersed
## Physical Structure: Contiguous and Dispersed
**During the execution of an algorithm, the data being processed is stored in memory**. The figure below shows a computer memory stick where each black square is a physical memory space. We can think of memory as a vast Excel spreadsheet, with each cell capable of storing a certain amount of data.
**When an algorithm program runs, the data being processed is mainly stored in memory**. The figure below shows a computer memory stick, where each black square contains a memory space. We can imagine memory as a huge Excel spreadsheet, where each cell can store a certain amount of data.
**The system accesses the data at the target location by means of a memory address**. As shown in the figure below, the computer assigns a unique identifier to each cell in the table according to specific rules, ensuring that each memory space has a unique memory address. With these addresses, the program can access the data stored in memory.
**The system accesses data at the target location through memory addresses**. As shown in the figure below, the computer assigns a number to each cell in the spreadsheet according to specific rules, ensuring that each memory space has a unique memory address. With these addresses, the program can access data in memory.
![Memory stick, memory spaces, memory addresses](classification_of_data_structure.assets/computer_memory_location.png)
![Memory stick, memory space, memory address](classification_of_data_structure.assets/computer_memory_location.png)
!!! tip
It's worth noting that comparing memory to an Excel spreadsheet is a simplified analogy. The actual working mechanism of memory is more complex, involving concepts like address space, memory management, cache mechanisms, virtual memory, and physical memory.
It is worth noting that comparing memory to an Excel spreadsheet is a simplified analogy. The actual working mechanism of memory is quite complex, involving concepts such as address space, memory management, cache mechanisms, virtual memory, and physical memory.
Memory is a shared resource for all programs. When a block of memory is occupied by one program, it cannot be simultaneously used by other programs. **Therefore, memory resources are an important consideration in the design of data structures and algorithms**. For instance, the algorithm's peak memory usage should not exceed the remaining free memory of the system; if there is a lack of contiguous memory blocks, then the data structure chosen must be able to be stored in non-contiguous memory blocks.
Memory is a shared resource for all programs. When a block of memory is occupied by a program, it usually cannot be used by other programs at the same time. **Therefore, in the design of data structures and algorithms, memory resources are an important consideration**. For example, the peak memory occupied by an algorithm should not exceed the remaining free memory of the system; if there is a lack of contiguous large memory blocks, then the data structure chosen must be able to be stored in dispersed memory spaces.
As illustrated in the figure below, **the physical structure reflects the way data is stored in computer memory** and it can be divided into contiguous space storage (arrays) and non-contiguous space storage (linked lists). The two types of physical structures exhibit complementary characteristics in terms of time efficiency and space efficiency.
As shown in the figure below, **physical structure reflects the way data is stored in computer memory**, and can be divided into contiguous space storage (arrays) and dispersed space storage (linked lists). The two physical structures exhibit complementary characteristics in terms of time efficiency and space efficiency.
![Contiguous space storage and dispersed space storage](classification_of_data_structure.assets/classification_phisical_structure.png)
**It is worth noting that all data structures are implemented based on arrays, linked lists, or a combination of both**. For example, stacks and queues can be implemented using either arrays or linked lists; while implementations of hash tables may involve both arrays and linked lists.
It is worth noting that **all data structures are implemented based on arrays, linked lists, or a combination of both**. For example, stacks and queues can be implemented using either arrays or linked lists; while the implementation of hash tables may include both arrays and linked lists.
- **Array-based implementations**: Stacks, Queues, Hash Tables, Trees, Heaps, Graphs, Matrices, Tensors (arrays with dimensions $\geq 3$).
- **Linked-list-based implementations**: Stacks, Queues, Hash Tables, Trees, Heaps, Graphs, etc.
- **Can be implemented based on arrays**: Stacks, queues, hash tables, trees, heaps, graphs, matrices, tensors (arrays with dimensions $\geq 3$), etc.
- **Can be implemented based on linked lists**: Stacks, queues, hash tables, trees, heaps, graphs, etc.
Data structures implemented based on arrays are also called “Static Data Structures,” meaning their length cannot be changed after initialization. Conversely, those based on linked lists are called “Dynamic Data Structures,” which can still adjust their size during program execution.
After initialization, linked lists can still adjust their length during program execution, so they are also called "dynamic data structures". After initialization, the length of arrays cannot be changed, so they are also called "static data structures". It is worth noting that arrays can achieve length changes by reallocating memory, thus possessing a certain degree of "dynamism".
!!! tip
If you find it challenging to comprehend the physical structure, it is recommended that you read the next chapter, "Arrays and Linked Lists," and revisit this section later.
If you find it difficult to understand physical structure, it is recommended to read the next chapter first, and then review this section.
+4 -4
View File
@@ -1,9 +1,9 @@
# Data structures
# Data Structure
![Data structures](../assets/covers/chapter_data_structure.jpg)
![Data structure](../assets/covers/chapter_data_structure.jpg)
!!! abstract
Data structures serve as a robust and diverse framework.
Data structure is like a sturdy and diverse framework.
They offer a blueprint for the orderly organization of data, upon which algorithms come to life.
It provides a blueprint for the orderly organization of data, upon which algorithms come to life.
@@ -1,24 +1,24 @@
# Number encoding *
# Number Encoding *
!!! tip
In this book, chapters marked with an asterisk '*' are optional readings. If you are short on time or find them challenging, you may skip these initially and return to them after completing the essential chapters.
In this book, chapters marked with an asterisk * are optional readings. If you are short on time or find them challenging, you may skip these initially and return to them after completing the essential chapters.
## Integer encoding
## Sign-Magnitude, 1's Complement, and 2's Complement
In the table from the previous section, we observed that all integer types can represent one more negative number than positive numbers, such as the `byte` range of $[-128, 127]$. This phenomenon seems counterintuitive, and its underlying reason involves knowledge of sign-magnitude, one's complement, and two's complement encoding.
In the table from the previous section, we found that all integer types can represent one more negative number than positive numbers. For example, the `byte` range is $[-128, 127]$. This phenomenon is counterintuitive, and its underlying reason involves knowledge of sign-magnitude, 1's complement, and 2's complement.
Firstly, it's important to note that **numbers are stored in computers using the two's complement form**. Before analyzing why this is the case, let's define these three encoding methods:
First, it should be noted that **numbers are stored in computers in the form of "2's complement"**. Before analyzing the reasons for this, let's first define these three concepts.
- **Sign-magnitude**: The highest bit of a binary representation of a number is considered the sign bit, where $0$ represents a positive number and $1$ represents a negative number. The remaining bits represent the value of the number.
- **One's complement**: The one's complement of a positive number is the same as its sign-magnitude. For negative numbers, it's obtained by inverting all bits except the sign bit.
- **Two's complement**: The two's complement of a positive number is the same as its sign-magnitude. For negative numbers, it's obtained by adding $1$ to their one's complement.
- **Sign-magnitude**: We treat the highest bit of the binary representation of a number as the sign bit, where $0$ represents a positive number and $1$ represents a negative number, and the remaining bits represent the value of the number.
- **1's complement**: The 1's complement of a positive number is the same as its sign-magnitude. For a negative number, the 1's complement is obtained by inverting all bits except the sign bit of its sign-magnitude.
- **2's complement**: The 2's complement of a positive number is the same as its sign-magnitude. For a negative number, the 2's complement is obtained by adding $1$ to its 1's complement.
The figure below illustrates the conversions among sign-magnitude, one's complement, and two's complement:
The figure below shows the conversion methods among sign-magnitude, 1's complement, and 2's complement.
![Conversions between sign-magnitude, one's complement, and two's complement](number_encoding.assets/1s_2s_complement.png)
![Conversions among sign-magnitude, 1's complement, and 2's complement](number_encoding.assets/1s_2s_complement.png)
Although <u>sign-magnitude</u> is the most intuitive, it has limitations. For one, **negative numbers in sign-magnitude cannot be directly used in calculations**. For example, in sign-magnitude, calculating $1 + (-2)$ results in $-3$, which is incorrect.
<u>Sign-magnitude</u>, although the most intuitive, has some limitations. On one hand, **the sign-magnitude of negative numbers cannot be directly used in operations**. For example, calculating $1 + (-2)$ in sign-magnitude yields $-3$, which is clearly incorrect.
$$
\begin{aligned}
@@ -29,20 +29,20 @@ $$
\end{aligned}
$$
To address this, computers introduced the <u>one's complement</u>. If we convert to one's complement and calculate $1 + (-2)$, then convert the result back to sign-magnitude, we get the correct result of $-1$.
To solve this problem, computers introduced <u>1's complement</u>. If we first convert sign-magnitude to 1's complement and calculate $1 + (-2)$ in 1's complement, then convert the result back to sign-magnitude, we can obtain the correct result of $-1$.
$$
\begin{aligned}
& 1 + (-2) \newline
& \rightarrow 0000 \; 0001 \; \text{(Sign-magnitude)} + 1000 \; 0010 \; \text{(Sign-magnitude)} \newline
& = 0000 \; 0001 \; \text{(One's complement)} + 1111 \; 1101 \; \text{(One's complement)} \newline
& = 1111 \; 1110 \; \text{(One's complement)} \newline
& = 0000 \; 0001 \; \text{(1's complement)} + 1111 \; 1101 \; \text{(1's complement)} \newline
& = 1111 \; 1110 \; \text{(1's complement)} \newline
& = 1000 \; 0001 \; \text{(Sign-magnitude)} \newline
& \rightarrow -1
\end{aligned}
$$
Additionally, **there are two representations of zero in sign-magnitude**: $+0$ and $-0$. This means two different binary encodings for zero, which could lead to ambiguity. For example, in conditional checks, not differentiating between positive and negative zero might result in incorrect outcomes. Addressing this ambiguity would require additional checks, potentially reducing computational efficiency.
On the other hand, **the sign-magnitude of the number zero has two representations, $+0$ and $-0$**. This means that the number zero corresponds to two different binary encodings, which may cause ambiguity. For example, in conditional judgments, if we don't distinguish between positive zero and negative zero, it may lead to incorrect judgment results. If we want to handle the ambiguity of positive and negative zero, we need to introduce additional judgment operations, which may reduce the computational efficiency of the computer.
$$
\begin{aligned}
@@ -51,67 +51,67 @@ $$
\end{aligned}
$$
Like sign-magnitude, one's complement also suffers from the positive and negative zero ambiguity. Therefore, computers further introduced the <u>two's complement</u>. Let's observe the conversion process for negative zero in sign-magnitude, one's complement, and two's complement:
Like sign-magnitude, 1's complement also has the problem of positive and negative zero ambiguity. Therefore, computers further introduced <u>2's complement</u>. Let's first observe the conversion process of negative zero from sign-magnitude to 1's complement to 2's complement:
$$
\begin{aligned}
-0 \rightarrow \; & 1000 \; 0000 \; \text{(Sign-magnitude)} \newline
= \; & 1111 \; 1111 \; \text{(One's complement)} \newline
= 1 \; & 0000 \; 0000 \; \text{(Two's complement)} \newline
= \; & 1111 \; 1111 \; \text{(1's complement)} \newline
= 1 \; & 0000 \; 0000 \; \text{(2's complement)} \newline
\end{aligned}
$$
Adding $1$ to the one's complement of negative zero produces a carry, but with `byte` length being only 8 bits, the carried-over $1$ to the 9th bit is discarded. Therefore, **the two's complement of negative zero is $0000 \; 0000$**, the same as positive zero, thus resolving the ambiguity.
Adding $1$ to the 1's complement of negative zero produces a carry, but since the `byte` type has a length of only 8 bits, the $1$ that overflows to the 9th bit is discarded. That is to say, **the 2's complement of negative zero is $0000 \; 0000$, which is the same as the 2's complement of positive zero**. This means that in 2's complement representation, there is only one zero, and the positive and negative zero ambiguity is thus resolved.
One last puzzle is the $[-128, 127]$ range for `byte`, with an additional negative number, $-128$. We observe that for the interval $[-127, +127]$, all integers have corresponding sign-magnitude, one's complement, and two's complement, allowing for mutual conversion between them.
One last question remains: the range of the `byte` type is $[-128, 127]$, and how is the extra negative number $-128$ obtained? We notice that all integers in the interval $[-127, +127]$ have corresponding sign-magnitude, 1's complement, and 2's complement, and sign-magnitude and 2's complement can be converted to each other.
However, **the two's complement $1000 \; 0000$ is an exception without a corresponding sign-magnitude**. According to the conversion method, its sign-magnitude would be $0000 \; 0000$, indicating zero. This presents a contradiction because its two's complement should represent itself. Computers designate this special two's complement $1000 \; 0000$ as representing $-128$. In fact, the calculation of $(-1) + (-127)$ in two's complement results in $-128$.
However, **the 2's complement $1000 \; 0000$ is an exception, and it does not have a corresponding sign-magnitude**. According to the conversion method, we get that the sign-magnitude of this 2's complement is $0000 \; 0000$. This is clearly contradictory because this sign-magnitude represents the number $0$, and its 2's complement should be itself. The computer specifies that this special 2's complement $1000 \; 0000$ represents $-128$. In fact, the result of calculating $(-1) + (-127)$ in 2's complement is $-128$.
$$
\begin{aligned}
& (-127) + (-1) \newline
& \rightarrow 1111 \; 1111 \; \text{(Sign-magnitude)} + 1000 \; 0001 \; \text{(Sign-magnitude)} \newline
& = 1000 \; 0000 \; \text{(One's complement)} + 1111 \; 1110 \; \text{(One's complement)} \newline
& = 1000 \; 0001 \; \text{(Two's complement)} + 1111 \; 1111 \; \text{(Two's complement)} \newline
& = 1000 \; 0000 \; \text{(Two's complement)} \newline
& = 1000 \; 0000 \; \text{(1's complement)} + 1111 \; 1110 \; \text{(1's complement)} \newline
& = 1000 \; 0001 \; \text{(2's complement)} + 1111 \; 1111 \; \text{(2's complement)} \newline
& = 1000 \; 0000 \; \text{(2's complement)} \newline
& \rightarrow -128
\end{aligned}
$$
As you might have noticed, all these calculations are additions, hinting at an important fact: **computers' internal hardware circuits are primarily designed around addition operations**. This is because addition is simpler to implement in hardware compared to other operations like multiplication, division, and subtraction, allowing for easier parallelization and faster computation.
You may have noticed that all the above calculations are addition operations. This hints at an important fact: **the hardware circuits inside computers are mainly designed based on addition operations**. This is because addition operations are simpler to implement in hardware compared to other operations (such as multiplication, division, and subtraction), easier to parallelize, and have faster operation speeds.
It's important to note that this doesn't mean computers can only perform addition. **By combining addition with basic logical operations, computers can execute a variety of other mathematical operations**. For example, the subtraction $a - b$ can be translated into $a + (-b)$; multiplication and division can be translated into multiple additions or subtractions.
Please note that this does not mean that computers can only perform addition. **By combining addition with some basic logical operations, computers can implement various other mathematical operations**. For example, calculating the subtraction $a - b$ can be converted to calculating the addition $a + (-b)$; calculating multiplication and division can be converted to calculating multiple additions or subtractions.
We can now summarize the reason for using two's complement in computers: with two's complement representation, computers can use the same circuits and operations to handle both positive and negative number addition, eliminating the need for special hardware circuits for subtraction and avoiding the ambiguity of positive and negative zero. This greatly simplifies hardware design and enhances computational efficiency.
Now we can summarize the reasons why computers use 2's complement: based on 2's complement representation, computers can use the same circuits and operations to handle the addition of positive and negative numbers, without the need to design special hardware circuits to handle subtraction, and without the need to specially handle the ambiguity problem of positive and negative zero. This greatly simplifies hardware design and improves operational efficiency.
The design of two's complement is quite ingenious, and due to space constraints, we'll stop here. Interested readers are encouraged to explore further.
The design of 2's complement is very ingenious. Due to space limitations, we will stop here. Interested readers are encouraged to explore further.
## Floating-point number encoding
## Floating-Point Number Encoding
You might have noticed something intriguing: despite having the same length of 4 bytes, why does a `float` have a much larger range of values compared to an `int`? This seems counterintuitive, as one would expect the range to shrink for `float` since it needs to represent fractions.
Careful readers may have noticed: `int` and `float` have the same length, both are 4 bytes, but why does `float` have a much larger range than `int`? This is very counterintuitive because it stands to reason that `float` needs to represent decimals, so the range should be smaller.
In fact, **this is due to the different representation method used by floating-point numbers (`float`)**. Let's consider a 32-bit binary number as:
In fact, **this is because floating-point number `float` uses a different representation method**. Let's denote a 32-bit binary number as:
$$
b_{31} b_{30} b_{29} \ldots b_2 b_1 b_0
$$
According to the IEEE 754 standard, a 32-bit `float` consists of the following three parts:
According to the IEEE 754 standard, a 32-bit `float` consists of the following three parts.
- Sign bit $\mathrm{S}$: Occupies 1 bit, corresponding to $b_{31}$.
- Exponent bit $\mathrm{E}$: Occupies 8 bits, corresponding to $b_{30} b_{29} \ldots b_{23}$.
- Fraction bit $\mathrm{N}$: Occupies 23 bits, corresponding to $b_{22} b_{21} \ldots b_0$.
- Sign bit $\mathrm{S}$: occupies 1 bit, corresponding to $b_{31}$.
- Exponent bit $\mathrm{E}$: occupies 8 bits, corresponding to $b_{30} b_{29} \ldots b_{23}$.
- Fraction bit $\mathrm{N}$: occupies 23 bits, corresponding to $b_{22} b_{21} \ldots b_0$.
The value of a binary `float` number is calculated as:
The calculation method for the value corresponding to the binary `float` is:
$$
\text{val} = (-1)^{b_{31}} \times 2^{\left(b_{30} b_{29} \ldots b_{23}\right)_2 - 127} \times \left(1 . b_{22} b_{21} \ldots b_0\right)_2
\text {val} = (-1)^{b_{31}} \times 2^{\left(b_{30} b_{29} \ldots b_{23}\right)_2-127} \times\left(1 . b_{22} b_{21} \ldots b_0\right)_2
$$
Converted to a decimal formula, this becomes:
Converted to decimal, the calculation formula is:
$$
\text{val} = (-1)^{\mathrm{S}} \times 2^{\mathrm{E} - 127} \times (1 + \mathrm{N})
\text {val}=(-1)^{\mathrm{S}} \times 2^{\mathrm{E} -127} \times (1 + \mathrm{N})
$$
The range of each component is:
@@ -119,21 +119,21 @@ The range of each component is:
$$
\begin{aligned}
\mathrm{S} \in & \{ 0, 1\}, \quad \mathrm{E} \in \{ 1, 2, \dots, 254 \} \newline
(1 + \mathrm{N}) = & (1 + \sum_{i=1}^{23} b_{23-i} \times 2^{-i}) \subset [1, 2 - 2^{-23}]
(1 + \mathrm{N}) = & (1 + \sum_{i=1}^{23} b_{23-i} 2^{-i}) \subset [1, 2 - 2^{-23}]
\end{aligned}
$$
![Example calculation of a float in IEEE 754 standard](number_encoding.assets/ieee_754_float.png)
![Calculation example of float under IEEE 754 standard](number_encoding.assets/ieee_754_float.png)
Observing the figure above, given an example data $\mathrm{S} = 0$, $\mathrm{E} = 124$, $\mathrm{N} = 2^{-2} + 2^{-3} = 0.375$, we have:
Observing the figure above, given example data $\mathrm{S} = 0$, $\mathrm{E} = 124$, $\mathrm{N} = 2^{-2} + 2^{-3} = 0.375$, we have:
$$
\text{val} = (-1)^0 \times 2^{124 - 127} \times (1 + 0.375) = 0.171875
\text { val } = (-1)^0 \times 2^{124 - 127} \times (1 + 0.375) = 0.171875
$$
Now we can answer the initial question: **The representation of `float` includes an exponent bit, leading to a much larger range than `int`**. Based on the above calculation, the maximum positive number representable by `float` is approximately $2^{254 - 127} \times (2 - 2^{-23}) \approx 3.4 \times 10^{38}$, and the minimum negative number is obtained by switching the sign bit.
Now we can answer the initial question: **the representation of `float` includes an exponent bit, resulting in a range far greater than `int`**. According to the above calculation, the maximum positive number that `float` can represent is $2^{254 - 127} \times (2 - 2^{-23}) \approx 3.4 \times 10^{38}$, and the minimum negative number can be obtained by switching the sign bit.
**However, the trade-off for `float`'s expanded range is a sacrifice in precision**. The integer type `int` uses all 32 bits to represent the number, with values evenly distributed; but due to the exponent bit, the larger the value of a `float`, the greater the difference between adjacent numbers.
**Although floating-point number `float` expands the range, its side effect is sacrificing precision**. The integer type `int` uses all 32 bits to represent numbers, and the numbers are evenly distributed; however, due to the existence of the exponent bit, the larger the value of floating-point number `float`, the larger the difference between two adjacent numbers tends to be.
As shown in the table below, exponent bits $\mathrm{E} = 0$ and $\mathrm{E} = 255$ have special meanings, **used to represent zero, infinity, $\mathrm{NaN}$, etc.**
@@ -141,10 +141,10 @@ As shown in the table below, exponent bits $\mathrm{E} = 0$ and $\mathrm{E} = 25
| Exponent Bit E | Fraction Bit $\mathrm{N} = 0$ | Fraction Bit $\mathrm{N} \ne 0$ | Calculation Formula |
| ------------------ | ----------------------------- | ------------------------------- | ---------------------------------------------------------------------- |
| $0$ | $\pm 0$ | Subnormal Numbers | $(-1)^{\mathrm{S}} \times 2^{-126} \times (0.\mathrm{N})$ |
| $1, 2, \dots, 254$ | Normal Numbers | Normal Numbers | $(-1)^{\mathrm{S}} \times 2^{(\mathrm{E} -127)} \times (1.\mathrm{N})$ |
| $0$ | $\pm 0$ | Subnormal Number | $(-1)^{\mathrm{S}} \times 2^{-126} \times (0.\mathrm{N})$ |
| $1, 2, \dots, 254$ | Normal Number | Normal Number | $(-1)^{\mathrm{S}} \times 2^{(\mathrm{E} -127)} \times (1.\mathrm{N})$ |
| $255$ | $\pm \infty$ | $\mathrm{NaN}$ | |
It's worth noting that subnormal numbers significantly improve the precision of floating-point numbers. The smallest positive normal number is $2^{-126}$, and the smallest positive subnormal number is $2^{-126} \times 2^{-23}$.
It is worth noting that subnormal numbers significantly improve the precision of floating-point numbers. The smallest positive normal number is $2^{-126}$, and the smallest positive subnormal number is $2^{-126} \times 2^{-23}$.
Double-precision `double` also uses a similar representation method to `float`, which is not elaborated here for brevity.
Double-precision `double` also uses a representation method similar to `float`, which will not be elaborated here.
+29 -29
View File
@@ -2,65 +2,65 @@
### Key review
- Data structures can be categorized from two perspectives: logical structure and physical structure. Logical structure describes the logical relationships between data, while physical structure describes how data is stored in memory.
- Frequently used logical structures include linear structures, trees, and networks. We usually divide data structures into linear (arrays, linked lists, stacks, queues) and non-linear (trees, graphs, heaps) based on their logical structure. The implementation of hash tables may involve both linear and non-linear data structures.
- When a program is running, data is stored in memory. Each memory space has a corresponding address, and the program accesses data through these addresses.
- Physical structures can be divided into continuous space storage (arrays) and discrete space storage (linked lists). All data structures are implemented using arrays, linked lists, or a combination of both.
- The basic data types in computers include integers (`byte`, `short`, `int`, `long`), floating-point numbers (`float`, `double`), characters (`char`), and booleans (`bool`). The value range of a data type depends on its size and representation.
- Sign-magnitude, 1's complement, 2's complement are three methods of encoding integers in computers, and they can be converted into each other. The most significant bit of the sign-magnitude is the sign bit, and the remaining bits represent the value of the number.
- Integers are encoded by 2's complement in computers. The benefits of this representation include (i) the computer can unify the addition of positive and negative integers, (ii) no need to design special hardware circuits for subtraction, and (iii) no ambiguity of positive and negative zero.
- The encoding of floating-point numbers consists of 1 sign bit, 8 exponent bits, and 23 fraction bits. Due to the exponent bit, the range of floating-point numbers is much greater than that of integers, but at the cost of precision.
- ASCII is the earliest English character set, with 1 byte in length and a total of 127 characters. GBK is a popular Chinese character set, which includes more than 20,000 Chinese characters. Unicode aims to provide a complete character set standard that includes characters from various languages in the world, thus solving the garbled character problem caused by inconsistent character encoding methods.
- UTF-8 is the most popular and general Unicode encoding method. It is a variable-length encoding method with good scalability and space efficiency. UTF-16 and UTF-32 are fixed-length encoding methods. When encoding Chinese characters, UTF-16 takes up less space than UTF-8. Programming languages like Java and C# use UTF-16 encoding by default.
- Data structures can be classified from two perspectives: logical structure and physical structure. Logical structure describes the logical relationships between data elements, while physical structure describes how data is stored in computer memory.
- Common logical structures include linear, tree, and network structures. We typically classify data structures as linear (arrays, linked lists, stacks, queues) and non-linear (trees, graphs, heaps) based on their logical structure. The implementation of hash tables may involve both linear and non-linear data structures.
- When a program runs, data is stored in computer memory. Each memory space has a corresponding memory address, and the program accesses data through these memory addresses.
- Physical structures are primarily divided into contiguous space storage (arrays) and dispersed space storage (linked lists). All data structures are implemented using arrays, linked lists, or a combination of both.
- Basic data types in computers include integers `byte`, `short`, `int`, `long`, floating-point numbers `float`, `double`, characters `char`, and booleans `bool`. Their value ranges depend on the size of space they occupy and their representation method.
- Sign-magnitude, 1's complement, and 2's complement are three methods for encoding numbers in computers, and they can be converted into each other. The most significant bit of sign-magnitude is the sign bit, and the remaining bits represent the value of the number.
- Integers are stored in computers in 2's complement form. Under 2's complement representation, computers can treat the addition of positive and negative numbers uniformly, without needing to design special hardware circuits for subtraction, and there is no ambiguity of positive and negative zero.
- The encoding of floating-point numbers consists of 1 sign bit, 8 exponent bits, and 23 fraction bits. Due to the exponent bits, the range of floating-point numbers is much larger than that of integers, at the cost of sacrificing precision.
- ASCII is the earliest English character set, with a length of 1 byte, containing a total of 127 characters. GBK is a commonly used Chinese character set, containing over 20,000 Chinese characters. Unicode is committed to providing a complete character set standard, collecting characters from various languages around the world, thereby solving the garbled text problem caused by inconsistent character encoding methods.
- UTF-8 is the most popular Unicode encoding method, with excellent universality. It is a variable-length encoding method with good scalability, effectively improving storage space efficiency. UTF-16 and UTF-32 are fixed-length encoding methods. When encoding Chinese characters, UTF-16 occupies less space than UTF-8. Programming languages such as Java and C# use UTF-16 encoding by default.
### Q & A
**Q**: Why does a hash table contain both linear and non-linear data structures?
**Q**: Why do hash tables contain both linear and non-linear data structures?
The underlying structure of a hash table is an array. To resolve hash collisions, we may use "chaining" (discussed in a later section, "Hash collision"): each bucket in the array points to a linked list, which may transform into a tree (usually a red-black tree) when its length is larger than a certain threshold.
From a storage perspective, the underlying structure of a hash table is an array, where each bucket might contain a value, a linked list, or a tree. Therefore, hash tables may contain both linear data structures (arrays, linked lists) and non-linear data structures (trees).
The underlying structure of a hash table is an array. To resolve hash collisions, we may use "chaining" (discussed in the subsequent "Hash Collision" section): each bucket in the array points to a linked list, which may be converted to a tree (usually a red-black tree) when the list length exceeds a certain threshold.
From a storage perspective, the underlying structure of a hash table is an array, where each bucket slot may contain a value, a linked list, or a tree. Therefore, hash tables may contain both linear data structures (arrays, linked lists) and non-linear data structures (trees).
**Q**: Is the length of the `char` type 1 byte?
The length of the `char` type is determined by the encoding method of the programming language. For example, Java, JavaScript, TypeScript, and C# all use UTF-16 encoding (to save Unicode code points), so the length of the `char` type is 2 bytes.
The length of the `char` type is determined by the encoding method used by the programming language. For example, Java, JavaScript, TypeScript, and C# all use UTF-16 encoding (to store Unicode code points), so the `char` type has a length of 2 bytes.
**Q**: Is there any ambiguity when we refer to array-based data structures as "static data structures"? The stack can also perform "dynamic" operations such as popping and pushing.
**Q**: Is there ambiguity in referring to array-based data structures as "static data structures"? Stacks can also perform "dynamic" operations such as push and pop.
The stack can implement dynamic data operations, but the data structure is still "static" (the length is fixed). Although array-based data structures can dynamically add or remove elements, their capacity is fixed. If the stack size exceeds the pre-allocated size, then the old array will be copied into a newly created and larger array.
Stacks can indeed implement dynamic data operations, but the data structure is still "static" (fixed length). Although array-based data structures can dynamically add or remove elements, their capacity is fixed. If the data volume exceeds the pre-allocated size, a new larger array needs to be created, and the contents of the old array must be copied to the new array.
**Q**: When building a stack (queue), its size is not specified, so why are they "static data structures"?
**Q**: When constructing a stack (queue), its size is not specified. Why are they "static data structures"?
In high-level programming languages, we do not need to manually specify the initial capacity of stacks (queues); this task is automatically completed within the class. For example, the initial capacity of Java's `ArrayList` is usually 10. Furthermore, the expansion operation is also completed automatically. See the subsequent "List" chapter for details.
In high-level programming languages, we do not need to manually specify the initial capacity of a stack (queue); this work is automatically completed within the class. For example, the initial capacity of Java's `ArrayList` is typically 10. Additionally, the expansion operation is also automatically implemented. See the subsequent "List" section for details.
**Q**The method of converting the sign-magnitude to the 2's complement is "first negate and then add 1", so converting the 2's complement to the sign-magnitude should be its inverse operation "first subtract 1 and then negate".
However, the 2's complement can also be converted to the sign-magnitude through "first negate and then add 1", why is this?
**Q**: The method of converting sign-magnitude to 2's complement is "first negate then add 1". So converting 2's complement to sign-magnitude should be the inverse operation "first subtract 1 then negate". However, 2's complement can also be converted to sign-magnitude through "first negate then add 1". Why is this?
**A**This is because the mutual conversion between the sign-magnitude and the 2's complement is equivalent to computing the "complement". We first define the complement: assuming $a + b = c$, then we say that $a$ is the complement of $b$ to $c$, and vice versa, $b$ is the complement of $a$ to $c$.
This is because the mutual conversion between sign-magnitude and 2's complement is actually the process of computing the "complement". Let us first define the complement: assuming $a + b = c$, then we say that $a$ is the complement of $b$ to $c$, and conversely, $b$ is the complement of $a$ to $c$.
Given a binary number $0010$ with length $n = 4$, if this number is the sign-magnitude (ignoring the sign bit), then its 2's complement can be obtained by "first negating and then adding 1":
Given an $n = 4$ bit binary number $0010$, if we treat this number as sign-magnitude (ignoring the sign bit), then its 2's complement can be obtained through "first negate then add 1":
$$
0010 \rightarrow 1101 \rightarrow 1110
$$
Observe that the sum of the sign-magnitude and the 2's complement is $0010 + 1110 = 10000$, i.e., the 2's complement $1110$ is the "complement" of the sign-magnitude $0010$ to $10000$. **This means that the above "first negate and then add 1" is equivalent to computing the complement to $10000$**.
We find that the sum of sign-magnitude and 2's complement is $0010 + 1110 = 10000$, which means the 2's complement $1110$ is the "complement" of sign-magnitude $0010$ to $10000$. **This means the above "first negate then add 1" is actually the process of computing the complement to $10000$**.
So, what is the "complement" of $1110$ to $10000$? We can still compute it by "negating first and then adding 1":
So, what is the "complement" of 2's complement $1110$ to $10000$? We can still use "first negate then add 1" to obtain it:
$$
1110 \rightarrow 0001 \rightarrow 0010
$$
In other words, the sign-magnitude and the 2's complement are each other's "complement" to $10000$, so "sign-magnitude to 2's complement" and "2's complement to sign-magnitude" can be implemented with the same operation (first negate and then add 1).
In other words, sign-magnitude and 2's complement are each other's "complement" to $10000$, so "sign-magnitude to 2's complement" and "2's complement to sign-magnitude" can be implemented using the same operation (first negate then add 1).
Of course, we can also use the inverse operation of "first negate and then add 1" to find the sign-magnitude of the 2's complement $1110$, that is, "first subtract 1 and then negate":
Of course, we can also use the inverse operation to find the sign-magnitude of 2's complement $1110$, that is, "first subtract 1 then negate":
$$
1110 \rightarrow 1101 \rightarrow 0010
$$
To sum up, "first negate and then add 1" and "first subtract 1 and then negate" are both computing the complement to $10000$, and they are equivalent.
In summary, both "first negate then add 1" and "first subtract 1 then negate" are computing the complement to $10000$, and they are equivalent.
Essentially, the "negate" operation is actually to find the complement to $1111$ (because `sign-magnitude + 1's complement = 1111` always holds); and the 1's complement plus 1 is equal to the 2's complement to $10000$.
Essentially, the "negate" operation is actually finding the complement to $1111$ (because `sign-magnitude + 1's complement = 1111` always holds); and adding 1 to the 1's complement yields the 2's complement, which is the complement to $10000$.
We take $n = 4$ as an example in the above, and it can be generalized to any binary number with any number of digits.
The above uses $n = 4$ as an example, and it can be generalized to binary numbers of any number of bits.