GitHub Flavored Markdown Spec

Posted Lywon的博客

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了GitHub Flavored Markdown Spec相关的知识,希望对你有一定的参考价值。

引用来源: https://github.github.com/gfm
GitHub Flavored Markdown Spec

GitHub Flavored Markdown Spec

Version 0.29-gfm (2019-04-06)
This formal specification is based on the CommonMark Spec by John MacFarlane and licensed under

1Introduction

1.1What is GitHub Flavored Markdown?

GitHub Flavored Markdown, often shortened as GFM, is the dialect of Markdown that is currently supported for user content on GitHub.com and GitHub Enterprise.

This formal specification, based on the CommonMark Spec, defines the syntax and semantics of this dialect.

GFM is a strict superset of CommonMark. All the features which are supported in GitHub user content and that are not specified on the original CommonMark Spec are hence known as extensions, and highlighted as such.

While GFM supports a wide range of inputs, it’s worth noting that GitHub.com and GitHub Enterprise perform additional post-processing and sanitization after GFM is converted to HTML to ensure security and consistency of the website.

1.2What is Markdown?

Markdown is a plain text format for writing structured documents, based on conventions for indicating formatting in email and usenet posts. It was developed by John Gruber (with help from Aaron Swartz) and released in 2004 in the form of a syntax description and a Perl script (Markdown.pl) for converting Markdown to HTML. In the next decade, dozens of implementations were developed in many languages. Some extended the original Markdown syntax with conventions for footnotes, tables, and other document elements. Some allowed Markdown documents to be rendered in formats other than HTML. Websites like Reddit, StackOverflow, and GitHub had millions of people using Markdown. And Markdown started to be used beyond the web, to author books, articles, slide shows, letters, and lecture notes.

What distinguishes Markdown from many other lightweight markup syntaxes, which are often easier to write, is its readability. As Gruber writes:

The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. (http://daringfireball.net/projects/markdown/)

The point can be illustrated by comparing a sample of AsciiDoc with an equivalent sample of Markdown. Here is a sample of AsciiDoc from the AsciiDoc manual:

1. List item one.
+
List item one continued with a second paragraph followed by an
Indented block.
+
.................
$ ls *.sh
$ mv *.sh ~/tmp
.................
+
List item continued with a third paragraph.

2. List item two continued with an open block.
+
--
This paragraph is part of the preceding list item.

a. This list is nested and does not require explicit item
continuation.
+
This paragraph is part of the preceding list item.

b. List item b.

This paragraph belongs to item two of the outer list.
--

And here is the equivalent in Markdown:

1.  List item one.

    List item one continued with a second paragraph followed by an
    Indented block.

        $ ls *.sh
        $ mv *.sh ~/tmp

    List item continued with a third paragraph.

2.  List item two continued with an open block.

    This paragraph is part of the preceding list item.

    1. This list is nested and does not require explicit item continuation.

       This paragraph is part of the preceding list item.

    2. List item b.

    This paragraph belongs to item two of the outer list.

The AsciiDoc version is, arguably, easier to write. You don’t need to worry about indentation. But the Markdown version is much easier to read. The nesting of list items is apparent to the eye in the source, not just in the processed document.

1.3Why is a spec needed?

John Gruber’s canonical description of Markdown’s syntax does not specify the syntax unambiguously. Here are some examples of questions it does not answer:

  1. How much indentation is needed for a sublist? The spec says that continuation paragraphs need to be indented four spaces, but is not fully explicit about sublists. It is natural to think that they, too, must be indented four spaces, but Markdown.pl does not require that. This is hardly a “corner case,” and divergences between implementations on this issue often lead to surprises for users in real documents. (See this comment by John Gruber.)

  2. Is a blank line needed before a block quote or heading? Most implementations do not require the blank line. However, this can lead to unexpected results in hard-wrapped text, and also to ambiguities in parsing (note that some implementations put the heading inside the blockquote, while others do not). (John Gruber has also spoken in favor of requiring the blank lines.)

  3. Is a blank line needed before an indented code block? (Markdown.pl requires it, but this is not mentioned in the documentation, and some implementations do not require it.)

    paragraph
        code?
    
  4. What is the exact rule for determining when list items get wrapped in <p> tags? Can a list be partially “loose” and partially “tight”? What should we do with a list like this?

    1. one
    
    2. two
    3. three
    

    Or this?

    1.  one
        - a
    
        - b
    2.  two
    

    (There are some relevant comments by John Gruber here.)

  5. Can list markers be indented? Can ordered list markers be right-aligned?

     8. item 1
     9. item 2
    10. item 2a
    
  6. Is this one list with a thematic break in its second item, or two lists separated by a thematic break?

    * a
    * * * * *
    * b
    
  7. When list markers change from numbers to bullets, do we have two lists or one? (The Markdown syntax description suggests two, but the perl scripts and many other implementations produce one.)

    1. fee
    2. fie
    -  foe
    -  fum
    
  8. What are the precedence rules for the markers of inline structure? For example, is the following a valid link, or does the code span take precedence ?

    [a backtick (`)](/url) and [another backtick (`)](/url).
    
  9. What are the precedence rules for markers of emphasis and strong emphasis? For example, how should the following be parsed?

    *foo *bar* baz*
    
  10. What are the precedence rules between block-level and inline-level structure? For example, how should the following be parsed?

    - `a long code span can contain a hyphen like this
      - and it can screw things up`
    
  11. Can list items include section headings? (Markdown.pl does not allow this, but does allow blockquotes to include headings.)

    - # Heading
    
  12. Can list items be empty?

    * a
    *
    * b
    
  13. Can link references be defined inside block quotes or list items?

    > Blockquote [foo].
    >
    > [foo]: /url
    
  14. If there are multiple definitions for the same reference, which takes precedence?

    [foo]: /url1
    [foo]: /url2
    
    [foo][]
    

In the absence of a spec, early implementers consulted Markdown.pl to resolve these ambiguities. But Markdown.pl was quite buggy, and gave manifestly bad results in many cases, so it was not a satisfactory replacement for a spec.

Because there is no unambiguous spec, implementations have diverged considerably. As a result, users are often surprised to find that a document that renders one way on one system (say, a GitHub wiki) renders differently on another (say, converting to docbook using pandoc). To make matters worse, because nothing in Markdown counts as a “syntax error,” the divergence often isn’t discovered right away.

1.4About this document

This document attempts to specify Markdown syntax unambiguously. It contains many examples with side-by-side Markdown and HTML. These are intended to double as conformance tests. An accompanying script spec_tests.py can be used to run the tests against any Markdown program:

python test/spec_tests.py --spec spec.txt --program PROGRAM

Since this document describes how Markdown is to be parsed into an abstract syntax tree, it would have made sense to use an abstract representation of the syntax tree instead of HTML. But HTML is capable of representing the structural distinctions we need to make, and the choice of HTML for the tests makes it possible to run the tests against an implementation without writing an abstract syntax tree renderer.

This document is generated from a text file, spec.txt, written in Markdown with a small extension for the side-by-side tests. The script tools/makespec.py can be used to convert spec.txt into HTML or CommonMark (which can then be converted into other formats).

In the examples, the character is used to represent tabs.

2Preliminaries

2.1Characters and lines

Any sequence of characters is a valid CommonMark document.

A character is a Unicode code point. Although some code points (for example, combining accents) do not correspond to characters in an intuitive sense, all code points count as characters for purposes of this spec.

This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.

A line is a sequence of zero or more characters other than newline (U+000A) or carriage return (U+000D), followed by a line ending or by the end of file.

A line ending is a newline (U+000A), a carriage return (U+000D) not followed by a newline, or a carriage return and a following newline.

A line containing no characters, or a line containing only spaces (U+0020) or tabs (U+0009), is called a blank line.

The following definitions of character classes will be used in this spec:

A whitespace character is a space (U+0020), tab (U+0009), newline (U+000A), line tabulation (U+000B), form feed (U+000C), or carriage return (U+000D).

Whitespace is a sequence of one or more whitespace characters.

A Unicode whitespace character is any code point in the Unicode Zs general category, or a tab (U+0009), carriage return (U+000D), newline (U+000A), or form feed (U+000C).

Unicode whitespace is a sequence of one or more Unicode whitespace characters.

A space is U+0020.

A non-whitespace character is any character that is not a whitespace character.

An ASCII punctuation character is !, ", #, $, %, &, \', (, ), *, +, ,, -, ., / (U+0021–2F), :, ;, <, =, >, ?, @ (U+003A–0040), [, \\, ], ^, _, ` (U+005B–0060), , |, , or ~ (U+007B–007E).

A punctuation character is an ASCII punctuation character or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

2.2Tabs

Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

Thus, for example, a tab can be used instead of four spaces in an indented code block. (Note, however, that internal tabs are passed through as literal tabs, not expanded to spaces.)

→foo→baz→→bim
<pre><code>foo→baz→→bim
</code></pre>
  →foo→baz→→bim
<pre><code>foo→baz→→bim
</code></pre>
    a→a
    ὐ→a
<pre><code>a→a
ὐ→a
</code></pre>

In the following example, a continuation paragraph of a list item is indented with a tab; this has exactly the same effect as indentation with four spaces would:

  - foo

→bar
<ul>
<li>
<p>foo</p>
<p>bar</p>
</li>
</ul>
- foo

→→bar
<ul>
<li>
<p>foo</p>
<pre><code>  bar
</code></pre>
</li>
</ul>

Normally the > that begins a block quote may be followed optionally by a space, which is not considered part of the content. In the following case > is followed by a tab, which is treated as if it were expanded into three spaces. Since one of these spaces is considered part of the delimiter, foo is considered to be indented six spaces inside the block quote context, so we get an indented code block starting with two spaces.

>→→foo
<blockquote>
<pre><code>  foo
</code></pre>
</blockquote>
-→→foo
<ul>
<li>
<pre><code>  foo
</code></pre>
</li>
</ul>
    foo
→bar
<pre><code>foo
bar
</code></pre>
 - foo
   - bar
→ - baz
<ul>
<li>foo
<ul>
<li>bar
<ul>
<li>baz</li>
</ul>
</li>
</ul>
</li>
</ul>
#→Foo
<h1>Foo</h1>
*→*→*→
<hr />

2.3Insecure characters

For security reasons, the Unicode character U+0000 must be replaced with the REPLACEMENT CHARACTER (U+FFFD).

3Blocks and inlines

We can think of a document as a sequence of blocks—structural elements like paragraphs, block quotations, lists, headings, rules, and code blocks. Some blocks (like block quotes and list items) contain other blocks; others (like headings and paragraphs) contain inline content—text, links, emphasized text, images, code spans, and so on.

3.1Precedence

Indicators of block structure always take precedence over indicators of inline structure. So, for example, the following is a list with two items, not a list with one item containing a code span:

- `one
- two`
<ul>
<li>`one</li>
<li>two`</li>
</ul>

This means that parsing can proceed in two steps: first, the block structure of the document can be discerned; second, text lines inside paragraphs, headings, and other block constructs can be parsed for inline structure. The second step requires information about link reference definitions that will be available only at the end of the first step. Note that the first step requires processing lines in sequence, but the second can be parallelized, since the inline parsing of one block element does not affect the inline parsing of any other.

3.2Container blocks and leaf blocks

We can divide blocks into two types: container blocks, which can contain other blocks, and leaf blocks, which cannot.

4Leaf blocks

This section describes the different kinds of leaf block that make up a Markdown document.

4.1Thematic breaks

A line consisting of 0-3 spaces of indentation, followed by a sequence of three or more matching -, _, or * characters, each followed optionally by any number of spaces or tabs, forms a thematic break.

***
---
___
<hr />
<hr />
<hr />

Wrong characters:

+++
<p>+++</p>
===
<p>===</p>

Not enough characters:

--
**
__
<p>--
**
__</p>

One to three spaces indent are allowed:

 ***
  ***
   ***
<hr />
<hr />
<hr />

Four spaces is too many:

    ***
<pre><code>***
</code></pre>
Foo
    ***
<p>Foo
***</p>

More than three characters may be used:

_____________________________________
<hr />

Spaces are allowed between the characters:

 - - -
<hr />
 **  * ** * ** * **
<hr />
-     -      -      -
<hr />

Spaces are allowed at the end:

- - - -    
<hr />

However, no other characters may occur in the line:

_ _ _ _ a

a------

---a---
<p>_ _ _ _ a</p>
<p>a------</p>
<p>---a---</p>

It is required that all of the non-whitespace characters be the same. So, this is not a thematic break:

 *-*
<p><em>-</em></p>

Thematic breaks do not need blank lines before or after:

- foo
***
- bar
<ul>
<li>foo</li>
</ul>
<hr />
<ul>
<li>bar</li>
</ul>

Thematic breaks can interrupt a paragraph:

Foo
***
bar
<p>Foo</p>
<hr />
<p>bar</p>

If a line of dashes that meets the above conditions for being a thematic break could also be interpreted as the underline of a setext heading, the interpretation as a setext heading takes precedence. Thus, for example, this is a setext heading, not a paragraph followed by a thematic break:

Foo
---
bar
<h2>Foo</h2>
<p>bar</p>

When both a thematic break and a list item are possible interpretations of a line, the thematic break takes precedence:

* Foo
* * *
* Bar
<ul>
<li>Foo</li>
</ul>
<hr />
<ul>
<li>Bar</li>
</ul>

If you want a thematic break in a list item, use a different bullet:

- Foo
- * * *
<ul>
<li>Foo</li>
<li>
<hr />
</li>
</ul>

4.2ATX headings

An ATX heading consists of a string of characters, parsed as inline content, between an opening sequence of 1–6 unescaped # characters and an optional closing sequence of any number of unescaped # characters. The opening sequence of # characters must be followed by a space or by the end of line. The optional closing sequence of #s must be preceded by a space and may be followed by spaces only. The opening # character may be indented 0-3 spaces. The raw contents of the heading are stripped of leading and trailing spaces before being parsed as inline content. The heading level is equal to the number of # characters in the opening sequence.

Simple headings:

# foo
## foo
### foo
#### foo
##### foo
###### foo
<h1>foo</h1>
<h2>foo</h2>
<h3>foo</h3>
<h4>foo</h4>
<h5>foo</h5>
<h6>foo</h6>

More than six # characters is not a heading:

####### foo
<p>####### foo</p>

At least one space is required between the # characters and the heading’s contents, unless the heading is empty. Note that many implementations currently do not require the space. However, the space was required by the original ATX implementation, and it helps prevent things like the following from being parsed as headings:

#5 bolt

#hashtag
<p>#5 bolt</p>
<p>#hashtag</p>

This is not a heading, because the first # is escaped:

\\## foo
<p>## foo</p>

Contents are parsed as inlines:

# foo *bar* \\*baz\\*
<h1>foo <em>bar</em> *baz*</h1>

Leading and trailing whitespace is ignored in parsing inline content:

#                  foo                     
<h1>foo</h1>

One to three spaces indentation are allowed:

 ### foo
  ## foo
   # foo
<h3>foo</h3>
<h2>foo</h2>
<h1>foo</h1>

Four spaces are too much:

    # foo
<pre><code># foo
</code></pre>
foo
    # bar
<p>foo
# bar</p>

A closing sequence of # characters is optional:

## foo ##
  ###   bar    ###
<h2>foo</h2>
<h3>bar</h3>

It need not be the same length as the opening sequence:

# foo ##################################
##### foo ##
<h1>foo</h1>
<h5>foo</h5>

Spaces are allowed after the closing sequence:

### foo ###     
<h3>foo</h3>

A sequence of # characters with anything but spaces following it is not a closing sequence, but counts as part of the contents of the heading:

### foo ### b
<h3>foo ### b</h3>

The closing sequence must be preceded by a space:

# foo#
<h1>foo#</h1>

Backslash-escaped # characters do not count as part of the closing sequence:

### foo \\###
## foo #\\##
# foo \\#
<h3>foo ###</h3>
<h2>foo ###</h2>
<h1>foo #</h1>

ATX headings need not be separated from surrounding content by blank lines, and they can interrupt paragraphs:

****
## foo
****
<hr />
<h2>foo</h2>
<hr />
Foo bar
# baz
Bar foo
<p>Foo bar</p>
<h1>baz</h1>
<p>Bar foo</p>

ATX headings can be empty:

## 
#
### ###
<h2></h2>
<h1></h1>
<h3></h3>

4.3Setext headings

A setext heading consists of one or more lines of text, each containing at least one non-whitespace character, with no more than 3 spaces indentation, followed by a setext heading underline. The lines of text must be such that, were they not followed by the setext heading underline, they would be interpreted as a paragraph: they cannot be interpretable as a code fence, ATX heading, block quote, thematic break, list item, or HTML block.

A setext heading underline is a sequence of = characters or a sequence of - characters, with no more than 3 spaces indentation and any number of trailing spaces. If a line containing a single - can be interpreted as an empty list items, it should be interpreted this way and not as a setext heading underline.

The heading is a level 1 heading if = characters are used in the setext heading underline, and a level 2 heading if - characters are used. The contents of the heading are the result of parsing the preceding lines of text as CommonMark inline content.

In general, a setext heading need not be preceded or followed by a blank line. However, it cannot interrupt a paragraph, so when a setext heading comes after a paragraph, a blank line is needed between them.

Simple examples:

Foo *bar*
=========

Foo *bar*
---------
<h1>Foo <em>bar</em></h1>
<h2>Foo <em>bar</em></h2>

The content of the header may span more than one line:

Foo *bar
baz*
====
<h1>Foo <em>bar
baz</em></h1>

The contents are the result of parsing the headings’s raw content as inlines. The heading’s raw content is formed by concatenating the lines and removing initial and final whitespace.

  Foo *bar
baz*→
====
<h1>Foo <em>bar
baz</em></h1>

The underlining can be any length:

Foo
-------------------------

Foo
=
<h2>Foo</h2>
<h1>Foo</h1>

The heading content can be indented up to three spaces, and need not line up with the underlining:

   Foo
---

  Foo
-----

  Foo
  ===
<h2>Foo</h2>
<h2>Foo</h2>
<h1>Foo</h1>

Four spaces indent is too much:

    Foo
    ---

    Foo
---
<pre><code>Foo
---

Foo
</code></pre>
<hr />

The setext heading underline can be indented up to three spaces, and may have trailing spaces:

Foo
   ----      
<h2>Foo</h2>

Four spaces is too much:

Foo
    ---
<p>Foo
---</p>

The setext heading underline cannot contain internal spaces:

Foo
= =

Foo
--- -
<p>Foo
= =</p>
<p>Foo</p>
<hr />

Trailing spaces in the content line do not cause a line break:

Foo  
-----
<h2>Foo</h2>

Nor does a backslash at the end:

Foo\\
----
<h2>Foo\\</h2>

Since indicators of block structure take precedence over indicators of inline structure, the following are setext headings:

`Foo
----
`

<a title="a lot
---
of dashes"/>
<h2>`Foo</h2>
<p>`</p>
<h2>&lt;a title=&quot;a lot</h2>
<p>of dashes&quot;/&gt;</p>

The setext heading underline cannot be a lazy continuation line in a list item or block quote:

> Foo
---
<blockquote>
<p>Foo</p>
</blockquote>
<hr />
> foo
bar
===
<blockquote>
<p>foo
bar
===</p>
</blockquote>
- Foo
---
<ul>
<li>Foo</li>
</ul>
<hr />

A blank line is needed between a paragraph and a following setext heading, since otherwise the paragraph becomes part of the heading’s content:

Foo
Bar
---
<h2>Foo
Bar</h2>

But in general a blank line is not required before or after setext headings:

---
Foo
---
Bar
---
Baz
<hr />
<h2>Foo</h2>
<h2>Bar</h2>
<p>Baz</p>

Setext headings cannot be empty:


====
<p>====</p>

Setext heading text lines must not be interpretable as block constructs other than paragraphs. So, the line of dashes in these examples gets interpreted as a thematic break:

---
---
<hr />
<hr />
- foo
-----
<ul>
<li>foo</li>
</ul>
<hr />
    foo
---
<pre><code>foo
</code></pre>
<hr />
> foo
-----
<blockquote>
<p>foo</p>
</blockquote>
<hr />

If you want a heading with > foo as its literal text, you can use backslash escapes:

\\> foo
------
<h2>&gt; foo</h2>

Compatibility note: Most existing Markdown implementations do not allow the text of setext headings to span multiple lines. But there is no consensus about how to interpret

Foo
bar
---
baz

One can find four different interpretations:

  1. paragraph “Foo”, heading “bar”, paragraph “baz”
  2. paragraph “Foo bar”, thematic break, paragraph “baz”
  3. paragraph “Foo bar — baz”
  4. heading “Foo bar”, paragraph “baz”

We find interpretation 4 most natural, and interpretation 4 increases the expressive power of CommonMark, by allowing multiline headings. Authors who want interpretation 1 can put a blank line after the first paragraph: