目录
1. URL
1.1. 摘录一波官方文档(rfc1738)中的重点
1.2. 摘录一波 MDN 文档中的重点
2. URI
2.1. 摘录一波官方文档(rfc3986)中的重点
3. URI、URL、URN
3.1. 摘录一波官方文档(rfc3305)中的重点
4. IRI
4.1. 摘录一波官方文档(rfc3987)中的重点
1. URL
1.1. 摘录一波官方文档(rfc1738)中的重点
1. Introduction
This document describes the syntax and semantics for a compact string representation for a resource available via the Internet(注意,URL 是互联网中的一个可用资源的字符串表示,注意是互联网中。). These strings are called "Uniform Resource Locators" (URLs).
2. General URL Syntax
Just as there are many different methods of access to resources, there are several schemes for describing the location of such resources.
URLs are used to `locate' resources, by providing an abstract identification of the resource location.
2.1. The main parts of URLs
In general, URLs are written as follows:
代码语言:javascript复制<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used (<scheme>) followed by a colon and then a string (the <scheme-specific-part>) whose interpretation depends on the scheme.
Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus (" "), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http").
2.2. URL Character Encoding Issues
In most URL schemes, the sequences of characters in different parts of a URL are used to represent sequences of octets used in Internet protocols. For example, in the ftp scheme, the host name, directory name and file names are such sequences of octets, represented by parts of the URL. Within those parts, an octet may be represented by the chararacter which has that octet as its code within the US-ASCII [20] coded character set.
In addition, octets may be encoded by a character triplet consisting of the character "%" followed by the two hexadecimal digits (from "0123456789ABCDEF") which forming the hexadecimal value of the octet. (The characters "abcdef" may also be used in hexadecimal encodings.)
Octets must be encoded if they have no corresponding graphic character within the US-ASCII coded character set, if the use of the corresponding character is unsafe, or if the corresponding character is reserved for some other interpretation within the particular URL scheme.
- No corresponding graphic US-ASCII:
- URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.
- Unsafe:
- Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text; the quote mark (""") is used to delimit URLs in some systems. The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character "%" is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".
- All unsafe characters must always be encoded within a URL. For example, the character "#" must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.
- Reserved:
- Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme.
- Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL.
- Thus, only alphanumerics, the special characters "$-_. !*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
- On the other hand, characters that are not required to be encoded (including alphanumerics) may be encoded within the scheme-specific part of a URL, as long as they are not being used for a reserved purpose.
1.2. 摘录一波 MDN 文档中的重点
1. Summary
With Hypertext and HTTP, URL is one of the key concepts of the Web. It is the mechanism used by browsers to retrieve any published resource on the web.
URL stands for Uniform Resource Locator. A URL is nothing more than the address of a given unique resource on the Web. In theory, each valid URL points to a unique resource. Such resources can be an HTML page, a CSS document, an image, etc. In practice, there are some exceptions, the most common being a URL pointing to a resource that no longer exists or that has moved. As the resource represented by the URL and the URL itself are handled by the Web server, it is up to the owner of the web server to carefully manage that resource and its associated URL.
2. Basics: anatomy of a URL
Tip: You might think of a URL like a regular postal mail address: the scheme represents the postal service you want to use, the domain name is the city or town, and the port is like the zip code; the path represents the building where your mail should be delivered; the parameters represent extra information such as the number of the apartment in the building; and, finally, the anchor represents the actual person to whom you've addressed your mail.
3. Scheme
The first part of the URL is the scheme, which indicates the protocol that the browser must use to request the resouce (a protocol is a set method for exchanging or transferring data around a computer network). Usually for websites the protocol is HTTPS or HTTP (its unsecured version). Addressing web pages requires one of these two, but browsers also know how to handle other schemes such as mailto: (to open a mail client), so don't be surprised if you see other protocols.
4. Authority
Next follows the authority, which is separated from the scheme by the character pattern ://. If present the authority includes both the domain (e.g. www.example.com) and the port (80), separated by a colon:
- The domain indicates which Web server is being requested. Usually this is a domain name, but an IP address may also be used (but this is rare as it is much less convenient).
- The port indicates the technical "gate" used to access the resources on the web server. It is usually omitted if the web server uses the standard ports of the HTTP protocol (80 for HTTP and 443 for HTTPS) to grant access to its resources. Otherwise it is mandatory.
Note: The separator between the scheme and authority is ://. The colon separates the scheme from the next part of the URL, while // indicates that the next part of the URL is the authority.
One example of a URL that doesn't use an authority is the mail client (mailto:foobar). It contains a scheme but doesn't use an authority component. Therefore, the colon is not followed by two slashes and only acts as a delimiter between the scheme and mail address. (特别注意:// 后面跟着是认证信息,像邮件协议 mailto,由于没有认证信息,所以 mailto: 后面没有跟着 //)
5. Path to resource
/path/to/myfile.html is the path to the resource on the Web server. In the early days of the Web, a path like this represented a physical file location on the Web server. Nowadays, it is mostly an abstraction handled by Web servers without any physical reality.
6. Parameters
?key1=value1&key2=value2 are extra parameters provided to the Web server. Those parameters are a list of key/value pairs separated with the & symbol. The Web server can use those parameters to do extra stuff before returning the resource. Each Web server has its own rules regarding parameters, and the only reliable way to know if a specific Web server is handling parameters is by asking the Web server owner.
7. Anchor
#SomewhereInTheDocument is an anchor to another part of the resource itself. An anchor represents a sort of "bookmark" inside the resource, giving the browser the directions to show the content located at that "bookmarked" spot. On an HTML document, for example, the browser will scroll to the point where the anchor is defined; on a video or audio document, the browser will try to go to the time the anchor represents. It is worth noting that the part after the #, also known as the fragment identifier, is never sent to the server with the request.
2. URI
2.1. 摘录一波官方文档(rfc3986)中的重点
1. Introduction
A Uniform Resource Identifier (URI) provides a simple and extensible means for identifying a resource.
代码语言:javascript复制ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel: 1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
1.1 Overview of URIs
URIs are characterized as follows:
- Uniform
- Uniformity provides several benefits. It allows different types of resource identifiers to be used in the same context, even when the mechanisms used to access those resources may differ. It allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers. It allows introduction of new types of resource identifiers without interfering with the way that existing identifiers are used. It allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a pre-existing, large, and widely used set of resource identifiers.
- Resource
- This specification does not limit the scope of what might be a resource; rather, the term "resource" is used in a general sense for whatever might be identified by a URI. Familiar examples include an electronic document, an image, a source of information with a consistent purpose (e.g., "today's weather report for Los Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a collection of other resources. A resource is not necessarily accessible via the Internet; e.g., human beings, corporations, and bound books in a library can also be resources. Likewise, abstract concepts can be resources, such as the operators and operands of a mathematical equation, the types of a relationship (e.g., "parent" or "employee"), or numeric values (e.g., zero, one, and infinity).
- Identifier
- An identifier embodies the information required to distinguish what is being identified from all other things within its scope of identification. Our use of the terms "identify" and "identifying" refer to this purpose of distinguishing one resource from all other resources, regardless of how that purpose is accomplished (e.g., by name, address, or context). These terms should not be mistaken as an assumption that an identifier defines or embodies the identity of what is referenced, though that may be the case for some identifiers. Nor should it be assumed that a system using URIs will access the resource identified: in many cases, URIs are used to denote resources without any intention that they be accessed. Likewise, the "one" resource identified might not be singular in nature (e.g., a resource might be a named set or a mapping that varies over time).
3. URI、URL、URN
3.1. 摘录一波官方文档(rfc3305)中的重点
1. Classical View
During the early years of discussion of web identifiers (early to mid 90s), people assumed that an identifier type would be cast into one of two (or possibly more) classes. An identifier might specify the location of a resource (a URL) or its name (a URN), independent of location. Thus a URI was either a URL or a URN. There was discussion about generalizing this by the addition of a discrete number of additional classes; for example, a URI might point to metadata rather than the resource itself, in which case the URI would be a URC (citation). URI space was thus viewed as partitioned into subspaces: URL, URN, and additional subspaces to be defined. The only such additional space ever proposed was Uniform Resource Characteristics (URC) and there never was any buy-in; so without loss of generality, it's reasonable to say that URI space was thought to be partitioned into two classes: URL and URN. Thus, for example, "http:" was a URL scheme, and "isbn:" would (someday) be a URN scheme. Any new scheme would be cast into one of these two classes.
2.2 Contemporary View
Over time, the importance of this additional level of hierarchy seemed to lessen; the view became that an individual scheme did not need to be cast into one of a discrete set of URI types, such as "URL", "URN", "URC", etc. Web-identifier schemes are, in general, URI schemes, as a given URI scheme may define subspaces. Thus "http:" is a URI scheme. "urn:" is also a URI scheme; it defines subspaces, called "namespaces". For example, the set of URNs, of the form "urn:isbn:n-nn-nnnnnn-n", is a URN namespace. ("isbn" is an URN namespace identifier. It is not a "URN scheme", nor is it a "URI scheme.")
Further, according to the contemporary view, the term "URL" does not refer to a formal partition of URI space; rather, URL is a useful but informal concept. A URL is a type of URI that identifies a resource via a representation of its primary access mechanism (e.g., its network "location"), rather than by some other attributes it may have. Thus, as we noted, "http:" is a URI scheme. An http URI is a URL. The phrase "URL scheme" is now used infrequently, usually to refer to some subclass of URI schemes which exclude URNs.
4. IRI
4.1. 摘录一波官方文档(rfc3987)中的重点
1. Introduction
1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a sequence of characters chosen from a limited subset of the repertoire of US-ASCII [ASCII] characters. (注释:组成 URI 的字符是从 US-ASCII 字符集里面挑的,范围非常小)
The characters in URIs are frequently used for representing words of natural languages. This usage has many advantages: Such URIs are easier to memorize, easier to interpret, easier to transcribe, easier to create, and easier to guess. For most languages other than English, however, the natural script uses characters other than A - Z. For many people, handling Latin characters is as difficult as handling the characters of other scripts is for those who use only the Latin alphabet. Many languages with non-Latin scripts are transcribed with Latin letters. These transcriptions are now often used in URIs, but they introduce additional ambiguities.(概要:URI 里面通常都是自然语言的表达,但是如果用的不是英语,那只能经过编码进行表达,增加了复杂度和迷惑性)
The infrastructure for the appropriate handling of characters from local scripts is now widely deployed in local versions of operating system and application software. Software that can handle a wide variety of scripts and languages at the same time is increasingly common. Also, increasing numbers of protocols and formats can carry a wide range of characters.(概要:现在处理字符的操作系统、应用软件的能力大幅度提升,一般都有能力处理很大范围内的字符))
This document defines a new protocol element called Internationalized Resource Identifier (IRI) by extending the syntax of URIs to a much wider repertoire of characters. (概要:IRI,国际化资源标识符,扩充了 URI 的语法,拥有更大的字符空间)
2. IRI Syntax
As with URIs, an IRI is defined as a sequence of characters, not as a sequence of octets. This definition accommodates the fact that IRIs may be written on paper or read over the radio as well as stored or transmitted digitally. (概要:IRI 是一串字符,而不是一串8位字节,因为 IRI 不仅会被数字化传输,而且可能写到纸上)
2.1. Summary of IRI Syntax
IRIs are defined similarly to URIs in [RFC3986], but the class of unreserved characters is extended by adding the characters of the UCS (Universal Character Set, [ISO10646]) beyond U 007F, subject to the limitations given in the syntax rules below and in section 6.1. Otherwise, the syntax and use of components and reserved characters is the same as that in [RFC3986].(概要:非保留字扩充了 USC 字符集 U 007F 后面的字符,其他规则跟 URI 的套路差不多)
3. Relationship between IRIs and URIs
IRIs are meant to replace URIs in identifying resources for protocols, formats, and software components that use a UCS-based character repertoire. These protocols and components may never need to use URIs directly, especially when the resource identifier is used simply for identification purposes. However, when the resource identifier is used for resource retrieval, it is in many cases necessary to determine the associated URI, because currently most retrieval mechanisms are only defined for URIs. In this case, IRIs can serve as presentation elements for URI protocol elements. An example would be an address bar in a Web user agent. (概要:很多 IRI 协议其实只有标识资源的用途,并没有获取资源的需求,所以这些是不需要转换为 URL 的。但是如果你想要是用 IRI 获取资源,那还是得转换为 URL,因为现在大多数基础设施,还是依赖 URL 进行资源获取)
参考:
URL(rfc1738): https://www.ietf.org/rfc/rfc1738.txt What is a URL?: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL URN: https://developer.mozilla.org/en-US/docs/Glossary/URN URI: https://developer.mozilla.org/en-US/docs/Glossary/URI URI(rfc3986): https://www.ietf.org/rfc/rfc3986.txt URI、URL、URN(rfc3305): https://www.ietf.org/rfc/rfc3305.txt IRI(rfc3987): https://www.ietf.org/rfc/rfc3987.txt