GitHub Flavored Markdown Spec
Posted Lywon的博客
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了GitHub Flavored Markdown Spec相关的知识,希望对你有一定的参考价值。
GitHub Flavored Markdown Spec
- 1Introduction
- 2Preliminaries
- 3Blocks and inlines
- 4Leaf blocks
- 5Container blocks
- 6Inlines
- 6.1Backslash escapes
- 6.2Entity and numeric character references
- 6.3Code spans
- 6.4Emphasis and strong emphasis
- 6.5Strikethrough (extension)
- 6.6Links
- 6.7Images
- 6.8Autolinks
- 6.9Autolinks (extension)
- 6.10Raw HTML
- 6.11Disallowed Raw HTML (extension)
- 6.12Hard line breaks
- 6.13Soft line breaks
- 6.14Textual content
- Appendix: A parsing strategy
1Introduction
1.1What is GitHub Flavored Markdown?
GitHub Flavored Markdown, often shortened as GFM, is the dialect of Markdown that is currently supported for user content on GitHub.com and GitHub Enterprise.
This formal specification, based on the CommonMark Spec, defines the syntax and semantics of this dialect.
GFM is a strict superset of CommonMark. All the features which are supported in GitHub user content and that are not specified on the original CommonMark Spec are hence known as extensions, and highlighted as such.
While GFM supports a wide range of inputs, it’s worth noting that GitHub.com and GitHub Enterprise perform additional post-processing and sanitization after GFM is converted to HTML to ensure security and consistency of the website.
1.2What is Markdown?
Markdown is a plain text format for writing structured documents,
based on conventions for indicating formatting in email
and usenet posts. It was developed by John Gruber (with
help from Aaron Swartz) and released in 2004 in the form of a
syntax description
and a Perl script (Markdown.pl
) for converting Markdown to
HTML. In the next decade, dozens of implementations were
developed in many languages. Some extended the original
Markdown syntax with conventions for footnotes, tables, and
other document elements. Some allowed Markdown documents to be
rendered in formats other than HTML. Websites like Reddit,
StackOverflow, and GitHub had millions of people using Markdown.
And Markdown started to be used beyond the web, to author books,
articles, slide shows, letters, and lecture notes.
What distinguishes Markdown from many other lightweight markup syntaxes, which are often easier to write, is its readability. As Gruber writes:
The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. (http://daringfireball.net/projects/markdown/)
The point can be illustrated by comparing a sample of AsciiDoc with an equivalent sample of Markdown. Here is a sample of AsciiDoc from the AsciiDoc manual:
1. List item one.
+
List item one continued with a second paragraph followed by an
Indented block.
+
.................
$ ls *.sh
$ mv *.sh ~/tmp
.................
+
List item continued with a third paragraph.
2. List item two continued with an open block.
+
--
This paragraph is part of the preceding list item.
a. This list is nested and does not require explicit item
continuation.
+
This paragraph is part of the preceding list item.
b. List item b.
This paragraph belongs to item two of the outer list.
--
And here is the equivalent in Markdown:
1. List item one.
List item one continued with a second paragraph followed by an
Indented block.
$ ls *.sh
$ mv *.sh ~/tmp
List item continued with a third paragraph.
2. List item two continued with an open block.
This paragraph is part of the preceding list item.
1. This list is nested and does not require explicit item continuation.
This paragraph is part of the preceding list item.
2. List item b.
This paragraph belongs to item two of the outer list.
The AsciiDoc version is, arguably, easier to write. You don’t need to worry about indentation. But the Markdown version is much easier to read. The nesting of list items is apparent to the eye in the source, not just in the processed document.
1.3Why is a spec needed?
John Gruber’s canonical description of Markdown’s syntax does not specify the syntax unambiguously. Here are some examples of questions it does not answer:
-
How much indentation is needed for a sublist? The spec says that continuation paragraphs need to be indented four spaces, but is not fully explicit about sublists. It is natural to think that they, too, must be indented four spaces, but
Markdown.pl
does not require that. This is hardly a “corner case,” and divergences between implementations on this issue often lead to surprises for users in real documents. (See this comment by John Gruber.) -
Is a blank line needed before a block quote or heading? Most implementations do not require the blank line. However, this can lead to unexpected results in hard-wrapped text, and also to ambiguities in parsing (note that some implementations put the heading inside the blockquote, while others do not). (John Gruber has also spoken in favor of requiring the blank lines.)
-
Is a blank line needed before an indented code block? (
Markdown.pl
requires it, but this is not mentioned in the documentation, and some implementations do not require it.)paragraph code?
-
What is the exact rule for determining when list items get wrapped in
<p>
tags? Can a list be partially “loose” and partially “tight”? What should we do with a list like this?1. one 2. two 3. three
Or this?
1. one - a - b 2. two
(There are some relevant comments by John Gruber here.)
-
Can list markers be indented? Can ordered list markers be right-aligned?
8. item 1 9. item 2 10. item 2a
-
Is this one list with a thematic break in its second item, or two lists separated by a thematic break?
* a * * * * * * b
-
When list markers change from numbers to bullets, do we have two lists or one? (The Markdown syntax description suggests two, but the perl scripts and many other implementations produce one.)
1. fee 2. fie - foe - fum
-
What are the precedence rules for the markers of inline structure? For example, is the following a valid link, or does the code span take precedence ?
[a backtick (`)](/url) and [another backtick (`)](/url).
-
What are the precedence rules for markers of emphasis and strong emphasis? For example, how should the following be parsed?
*foo *bar* baz*
-
What are the precedence rules between block-level and inline-level structure? For example, how should the following be parsed?
- `a long code span can contain a hyphen like this - and it can screw things up`
-
Can list items include section headings? (
Markdown.pl
does not allow this, but does allow blockquotes to include headings.)- # Heading
-
Can list items be empty?
* a * * b
-
Can link references be defined inside block quotes or list items?
> Blockquote [foo]. > > [foo]: /url
-
If there are multiple definitions for the same reference, which takes precedence?
[foo]: /url1 [foo]: /url2 [foo][]
In the absence of a spec, early implementers consulted Markdown.pl
to resolve these ambiguities. But Markdown.pl
was quite buggy, and
gave manifestly bad results in many cases, so it was not a
satisfactory replacement for a spec.
Because there is no unambiguous spec, implementations have diverged considerably. As a result, users are often surprised to find that a document that renders one way on one system (say, a GitHub wiki) renders differently on another (say, converting to docbook using pandoc). To make matters worse, because nothing in Markdown counts as a “syntax error,” the divergence often isn’t discovered right away.
1.4About this document
This document attempts to specify Markdown syntax unambiguously.
It contains many examples with side-by-side Markdown and
HTML. These are intended to double as conformance tests. An
accompanying script spec_tests.py
can be used to run the tests
against any Markdown program:
python test/spec_tests.py --spec spec.txt --program PROGRAM
Since this document describes how Markdown is to be parsed into an abstract syntax tree, it would have made sense to use an abstract representation of the syntax tree instead of HTML. But HTML is capable of representing the structural distinctions we need to make, and the choice of HTML for the tests makes it possible to run the tests against an implementation without writing an abstract syntax tree renderer.
This document is generated from a text file, spec.txt
, written
in Markdown with a small extension for the side-by-side tests.
The script tools/makespec.py
can be used to convert spec.txt
into
HTML or CommonMark (which can then be converted into other formats).
In the examples, the →
character is used to represent tabs.
2Preliminaries
2.1Characters and lines
Any sequence of characters is a valid CommonMark document.
A character is a Unicode code point. Although some code points (for example, combining accents) do not correspond to characters in an intuitive sense, all code points count as characters for purposes of this spec.
This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.
A line is a sequence of zero or more characters
other than newline (U+000A
) or carriage return (U+000D
),
followed by a line ending or by the end of file.
A line ending is a newline (U+000A
), a carriage return
(U+000D
) not followed by a newline, or a carriage return and a
following newline.
A line containing no characters, or a line containing only spaces
(U+0020
) or tabs (U+0009
), is called a blank line.
The following definitions of character classes will be used in this spec:
A whitespace character is a space
(U+0020
), tab (U+0009
), newline (U+000A
), line tabulation (U+000B
),
form feed (U+000C
), or carriage return (U+000D
).
Whitespace is a sequence of one or more whitespace characters.
A Unicode whitespace character is
any code point in the Unicode Zs
general category, or a tab (U+0009
),
carriage return (U+000D
), newline (U+000A
), or form feed
(U+000C
).
Unicode whitespace is a sequence of one or more Unicode whitespace characters.
A space is U+0020
.
A non-whitespace character is any character that is not a whitespace character.
An ASCII punctuation character
is !
, "
, #
, $
, %
, &
, \'
, (
, )
,
*
, +
, ,
, -
, .
, /
(U+0021–2F),
:
, ;
, <
, =
, >
, ?
, @
(U+003A–0040),
[
, \\
, ]
, ^
, _
, `
(U+005B–0060),
,
|
, , or
~
(U+007B–007E).
A punctuation character is an ASCII
punctuation character or anything in
the general Unicode categories Pc
, Pd
, Pe
, Pf
, Pi
, Po
, or Ps
.
2.2Tabs
Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.
Thus, for example, a tab can be used instead of four spaces in an indented code block. (Note, however, that internal tabs are passed through as literal tabs, not expanded to spaces.)
In the following example, a continuation paragraph of a list item is indented with a tab; this has exactly the same effect as indentation with four spaces would:
Normally the >
that begins a block quote may be followed
optionally by a space, which is not considered part of the
content. In the following case >
is followed by a tab,
which is treated as if it were expanded into three spaces.
Since one of these spaces is considered part of the
delimiter, foo
is considered to be indented six spaces
inside the block quote context, so we get an indented
code block starting with two spaces.
- foo
- bar
→ - baz
<ul>
<li>foo
<ul>
<li>bar
<ul>
<li>baz</li>
</ul>
</li>
</ul>
</li>
</ul>
2.3Insecure characters
For security reasons, the Unicode character U+0000
must be replaced
with the REPLACEMENT CHARACTER (U+FFFD
).
3Blocks and inlines
We can think of a document as a sequence of blocks—structural elements like paragraphs, block quotations, lists, headings, rules, and code blocks. Some blocks (like block quotes and list items) contain other blocks; others (like headings and paragraphs) contain inline content—text, links, emphasized text, images, code spans, and so on.
3.1Precedence
Indicators of block structure always take precedence over indicators of inline structure. So, for example, the following is a list with two items, not a list with one item containing a code span:
This means that parsing can proceed in two steps: first, the block structure of the document can be discerned; second, text lines inside paragraphs, headings, and other block constructs can be parsed for inline structure. The second step requires information about link reference definitions that will be available only at the end of the first step. Note that the first step requires processing lines in sequence, but the second can be parallelized, since the inline parsing of one block element does not affect the inline parsing of any other.
3.2Container blocks and leaf blocks
We can divide blocks into two types: container blocks, which can contain other blocks, and leaf blocks, which cannot.
4Leaf blocks
This section describes the different kinds of leaf block that make up a Markdown document.
4.1Thematic breaks
A line consisting of 0-3 spaces of indentation, followed by a sequence
of three or more matching -
, _
, or *
characters, each followed
optionally by any number of spaces or tabs, forms a
thematic break.
Wrong characters:
Not enough characters:
One to three spaces indent are allowed:
Four spaces is too many:
More than three characters may be used:
Spaces are allowed between the characters:
Spaces are allowed at the end:
However, no other characters may occur in the line:
It is required that all of the non-whitespace characters be the same. So, this is not a thematic break:
Thematic breaks do not need blank lines before or after:
Thematic breaks can interrupt a paragraph:
If a line of dashes that meets the above conditions for being a thematic break could also be interpreted as the underline of a setext heading, the interpretation as a setext heading takes precedence. Thus, for example, this is a setext heading, not a paragraph followed by a thematic break:
When both a thematic break and a list item are possible interpretations of a line, the thematic break takes precedence:
If you want a thematic break in a list item, use a different bullet:
4.2ATX headings
An ATX heading
consists of a string of characters, parsed as inline content, between an
opening sequence of 1–6 unescaped #
characters and an optional
closing sequence of any number of unescaped #
characters.
The opening sequence of #
characters must be followed by a
space or by the end of line. The optional closing sequence of #
s must be
preceded by a space and may be followed by spaces only. The opening
#
character may be indented 0-3 spaces. The raw contents of the
heading are stripped of leading and trailing spaces before being parsed
as inline content. The heading level is equal to the number of #
characters in the opening sequence.
Simple headings:
# foo
## foo
### foo
#### foo
##### foo
###### foo
<h1>foo</h1>
<h2>foo</h2>
<h3>foo</h3>
<h4>foo</h4>
<h5>foo</h5>
<h6>foo</h6>
More than six #
characters is not a heading:
At least one space is required between the #
characters and the
heading’s contents, unless the heading is empty. Note that many
implementations currently do not require the space. However, the
space was required by the
original ATX implementation,
and it helps prevent things like the following from being parsed as
headings:
This is not a heading, because the first #
is escaped:
Contents are parsed as inlines:
Leading and trailing whitespace is ignored in parsing inline content:
One to three spaces indentation are allowed:
Four spaces are too much:
A closing sequence of #
characters is optional:
It need not be the same length as the opening sequence:
Spaces are allowed after the closing sequence:
A sequence of #
characters with anything but spaces following it
is not a closing sequence, but counts as part of the contents of the
heading:
The closing sequence must be preceded by a space:
Backslash-escaped #
characters do not count as part
of the closing sequence:
ATX headings need not be separated from surrounding content by blank lines, and they can interrupt paragraphs:
ATX headings can be empty:
4.3Setext headings
A setext heading consists of one or more lines of text, each containing at least one non-whitespace character, with no more than 3 spaces indentation, followed by a setext heading underline. The lines of text must be such that, were they not followed by the setext heading underline, they would be interpreted as a paragraph: they cannot be interpretable as a code fence, ATX heading, block quote, thematic break, list item, or HTML block.
A setext heading underline is a sequence of
=
characters or a sequence of -
characters, with no more than 3
spaces indentation and any number of trailing spaces. If a line
containing a single -
can be interpreted as an
empty list items, it should be interpreted this way
and not as a setext heading underline.
The heading is a level 1 heading if =
characters are used in
the setext heading underline, and a level 2 heading if -
characters are used. The contents of the heading are the result
of parsing the preceding lines of text as CommonMark inline
content.
In general, a setext heading need not be preceded or followed by a blank line. However, it cannot interrupt a paragraph, so when a setext heading comes after a paragraph, a blank line is needed between them.
Simple examples:
Foo *bar*
=========
Foo *bar*
---------
<h1>Foo <em>bar</em></h1>
<h2>Foo <em>bar</em></h2>
The content of the header may span more than one line:
The contents are the result of parsing the headings’s raw content as inlines. The heading’s raw content is formed by concatenating the lines and removing initial and final whitespace.
The underlining can be any length:
The heading content can be indented up to three spaces, and need not line up with the underlining:
Four spaces indent is too much:
The setext heading underline can be indented up to three spaces, and may have trailing spaces:
Four spaces is too much:
The setext heading underline cannot contain internal spaces:
Trailing spaces in the content line do not cause a line break:
Nor does a backslash at the end:
Since indicators of block structure take precedence over indicators of inline structure, the following are setext headings:
`Foo
----
`
<a title="a lot
---
of dashes"/>
<h2>`Foo</h2>
<p>`</p>
<h2><a title="a lot</h2>
<p>of dashes"/></p>
The setext heading underline cannot be a lazy continuation line in a list item or block quote:
A blank line is needed between a paragraph and a following setext heading, since otherwise the paragraph becomes part of the heading’s content:
But in general a blank line is not required before or after setext headings:
Setext headings cannot be empty:
Setext heading text lines must not be interpretable as block constructs other than paragraphs. So, the line of dashes in these examples gets interpreted as a thematic break:
If you want a heading with > foo
as its literal text, you can
use backslash escapes:
Compatibility note: Most existing Markdown implementations do not allow the text of setext headings to span multiple lines. But there is no consensus about how to interpret
Foo
bar
---
baz
One can find four different interpretations:
- paragraph “Foo”, heading “bar”, paragraph “baz”
- paragraph “Foo bar”, thematic break, paragraph “baz”
- paragraph “Foo bar — baz”
- heading “Foo bar”, paragraph “baz”
We find interpretation 4 most natural, and interpretation 4 increases the expressive power of CommonMark, by allowing multiline headings. Authors who want interpretation 1 can put a blank line after the first paragraph:
你也可以这样做
<!--- Comments are Fun --->
请记住,markdown 只是编写 HTML 内容的一种更简单的方法。 (注意三个破折号)
一些项目 带有内嵌注释的东西【讨论】:
【参考方案2】:更仔细地查看 this solution 内联 cmets 的可能解决方法:
- [x] some item
- [ ] another item with meta info [//]: # (attempt at meta info as inline comment)
- [ ] using @ig0774's recomendation [](with an inline comment hidden inside an empty link)
[//]: # (This may be the most platform independent comment)
[//]: # (https://***.com/questions/4823468/comments-in-markdown)
[](and another comment down here too using the empty link method)
【讨论】:
【参考方案3】:我被 <?put anything &%#$ here ?>
这件事绊倒了,这似乎是一个相当有力的评论。
它似乎也适用于 *** :-)
【讨论】:
以上是关于GitHub Flavored Markdown Spec的主要内容,如果未能解决你的问题,请参考以下文章