Add more high-level documentation: theory.md
parent
7775812cfd
commit
eac73c9a7b
|
@ -0,0 +1,313 @@
|
|||
# Introduction
|
||||
|
||||
[The readme](../../master/README.md) mentions a couple of goals for this
|
||||
project, including the following two:
|
||||
|
||||
- Be clever about using the available horizontal space while not overflowing
|
||||
it if it cannot be avoided;
|
||||
- Have linear complexity in the size of the input.
|
||||
|
||||
These two goals stand in conflict, and this project chooses a solution
|
||||
to this that distinguishes it from (all) other existing formatters. This
|
||||
idea was the motivation to write Brittany in the first place, and it might
|
||||
even be applicable to more general-purposes structured-document formatters
|
||||
(although we will not expand on that here). Before explaining the idea, we
|
||||
will first consider
|
||||
|
||||
## The problem
|
||||
|
||||
Every haskell module can be written in a single line - of course, in most
|
||||
cases, an unwieldly long one. We humans prefer our lines limitted to some
|
||||
laughingly small limit like 80 or 160 or whatever. Further, we generally
|
||||
prefer the indentation of our expressions(, statements etc.) line up with
|
||||
its syntactic structure. This preferences (and the layouting rule which
|
||||
already enforces it partially) still leaves a good amount of choice for
|
||||
where exactly to insert newlines. For example, these are all consistent
|
||||
in their newline/indentation usage:
|
||||
|
||||
~~~~.hs
|
||||
myRecord = MyRecord { abc = "abc", def = "def"}
|
||||
myRecord = MyRecord
|
||||
{ abc = "abc"
|
||||
, def = "def"
|
||||
}
|
||||
myList = [ "abc", "def"]
|
||||
myList =
|
||||
[ "abc"
|
||||
, "def"
|
||||
]
|
||||
~~~~
|
||||
|
||||
While consistency has the first priority, we also prefer short code: If it
|
||||
fits, we prefer the version/layout with less lines of code. So we wish to trade
|
||||
more lines for less columns, but only until things fit.
|
||||
|
||||
For simple cases we can give a trivial rule: If there is space
|
||||
for the one-line layout, use it; otherwise use the indented-multiline
|
||||
layout (this applies to `myRecord` example). The
|
||||
straight-forward approach works well: Calculate the length of the
|
||||
one-line layout for each node, then recurse top-down and make the choices
|
||||
according to that rule. Linear runtime with good results.
|
||||
|
||||
Things get more interesting with nesting, and with structures with more than
|
||||
two alternatives. Consider the following four alternative layouts of the
|
||||
same expression:
|
||||
|
||||
~~~~.hs
|
||||
-- 10 20 30 40
|
||||
-- 1) -- | lets assume the user wants
|
||||
nestedCaseExpr = case e1 of -- | 40 columns max.
|
||||
Left x -> if func x then "good" else "bad" -- too long
|
||||
-- 2) --
|
||||
nestedCaseExpr = case e1 of --
|
||||
Left x -> if func x --
|
||||
then "good" --
|
||||
else "bad" --
|
||||
-- 3) --
|
||||
nestedCaseExpr = case e1 of --
|
||||
Left x -> --
|
||||
if func x then "good" else "bad" --
|
||||
-- 4) --
|
||||
nestedCaseExpr = case e1 of --
|
||||
Left x -> --
|
||||
if func x --
|
||||
then "good" --
|
||||
else "bad" --
|
||||
~~~~
|
||||
|
||||
- We nest a case-expression and an if-expression; for both we consider two
|
||||
alternative layouts, meaning that
|
||||
- There are a total of four, and in general an exponential amount of
|
||||
possible layouts;
|
||||
- With a top-down approach, one would choose 2), as "if func x" has space in
|
||||
the current line. But..
|
||||
- Layout 3) is the optimal solution considering lines-of-code and the 40-column
|
||||
limit.
|
||||
|
||||
So our question: Is there an algorithm which, given some input syntax tree,
|
||||
returns the/an optimal (least lines-of-code while respecting max-column limit)
|
||||
layout (out of a potentially exponential number of valid layouts). Further,
|
||||
this algorithm's time/space usage should be linear in the size of the input
|
||||
(we might weaken this to some polynomial upper bound - but that should be the
|
||||
limit).
|
||||
|
||||
If we pessimistically assume that such algorithm does not exist, we may ask
|
||||
alternatively: What linear algorithm returns solutions which,
|
||||
on average, are closest to the optimal solution?
|
||||
|
||||
In the following we will describe _one_ such algorithm that seems to return
|
||||
the optimal solution in several non-trivial cases, and near-optimal solutions
|
||||
in many other. We won't try to prove that it is the best algorithm, but we will
|
||||
consider the circumstances for which a non-optimal solution is returned.
|
||||
|
||||
## The Reasoning
|
||||
|
||||
A top-down approach is so bad, because when there are exponentially many
|
||||
layouts to consider, there information passed down from the parents does
|
||||
not help at all in pruning the alternatives on a given layer. In the above
|
||||
`nestedCaseExpr` example, we might obtain a better solution by looking not
|
||||
at just the first, but the first n possible layouts, but against an exponential
|
||||
search-space, this does not scale: Just consider the possibility that there
|
||||
are exponentially many sub-solutions for layout 2) (replace "good" with some
|
||||
slightly more complex expression). You basically always end up with either
|
||||
"the current line is not yet full, try to fill it up" or
|
||||
"more than n columns used, abort".
|
||||
|
||||
But a (pure) bottom-up approach does not work either: If we have no clue about
|
||||
the "current" indentation while layouting some node of our syntax tree,
|
||||
information about the (potential) size of (the layout of) child-nodes does
|
||||
not allow us to make good decisions.
|
||||
|
||||
So we need information to flow bottom-to-top to allow for pruning whole trees
|
||||
of possible layouts, and top-to-bottom for making the actual decisions.. well,
|
||||
this can be arranged.
|
||||
|
||||
# The Algorithm
|
||||
|
||||
This algorithm works in two passes over the input syntax tree. The first pass
|
||||
is bottom-up and adds a label to each node. This label contains (a set of)
|
||||
"spacings" - certain information regarding the number of columns/lines needed
|
||||
for the textual representation of potential layouts of that node.
|
||||
For example we might return two possible spacings for the following (same)
|
||||
expression:
|
||||
|
||||
~~~~.hs
|
||||
if func x then "good" else "bad"
|
||||
-- => 1 line, 32 columns used
|
||||
if func x
|
||||
then "good"
|
||||
else "bad"
|
||||
-- => 3 lines, 13 columns used
|
||||
~~~~
|
||||
|
||||
This is heavily simplified; in Brittany spacing information is (as usual) a
|
||||
bit more complex.
|
||||
|
||||
We restrict the size of these sets. Given the sets of spacings for the
|
||||
child-nodes in the syntax-tree, we generate a limited number of possible
|
||||
spacings in the current node. We then prune nodes that already violate desired
|
||||
properties, e.g. any spacing that already uses more columns locally than
|
||||
globally available.
|
||||
|
||||
The second pass is top-down and uses the spacing-information to decide on one
|
||||
of the possible layouts for the current node. It passes the current
|
||||
"cursor position" (indentation, etc.) downwards, allowing to check that the
|
||||
layout fits (e.g. ensure "current indentation + columns used < max columns").
|
||||
|
||||
This algorithm is trivially linear - two traversals and only linear space
|
||||
required per node of the input.
|
||||
|
||||
### Consequences
|
||||
|
||||
- when calculating spacings during bottom-up, several spacings are combined
|
||||
into a single one via min/max/some other associative operation. Perhaps
|
||||
consider how in a multi-line list layout the "columns used" will be derived
|
||||
from the element spacings. For additional information embedded into the
|
||||
spacings, we will need at least one such Semigroup instance.
|
||||
|
||||
- We require an order on the alternatives for each syntactical construct.
|
||||
The first alternative where we find some combination of spacings for the
|
||||
children which is acceptable will be used.
|
||||
|
||||
In unlucky cases, all spacings might get pruned. In that case, we will
|
||||
default to the last alternative, which therefor should be the most
|
||||
"conservative" choice, i.e. the one that gives the child-nodes the most
|
||||
space.
|
||||
|
||||
The first alternative should always be the "one-liner" layout, if it exists;
|
||||
this alternative is preferable in general and will be filtered first should
|
||||
it not fit.
|
||||
|
||||
In between those two extremes can be other choices that gradually trade
|
||||
"columns used" for "lines used".
|
||||
|
||||
- Sometimes there exist several (near) optimal solutions. E.g. when there exist
|
||||
multiple nodes where we can trade lines for columns in such a way that the
|
||||
whole result fits.
|
||||
|
||||
In such cases, Brittany makes the following choice: We prefer "trading"
|
||||
in the node higher up in the syntax tree. The reasoning is that we rather
|
||||
spend one line early on to gain additional horizontal space for _all_ the
|
||||
children than the other way around. (In those cases where there are
|
||||
alternatives to choose from, there are often several children.) We can apply
|
||||
this to the `nestedCaseExpr` example above: The options are to either
|
||||
put the right-hand-side of the case-alternative into a new line, or
|
||||
split up the if-then-else. The "case" is the outer one, so Brittany will
|
||||
prefer 3) over 2), which proves to be the right choice (here).
|
||||
|
||||
As a consequence, we are most interested in the maximum spacings; yet, we do
|
||||
not have a total order because we cannot generally prefer one of two spacings
|
||||
where the first uses more columns, the second more lines.
|
||||
|
||||
- The number of syntactical constructs in haskell is large. If we were to
|
||||
work directly on the syntax tree to do our traversals, we'd have to write two
|
||||
(or even three - one to generate spacings, one to make the choices, one to
|
||||
do the output and insert comments) different functions each respecting every
|
||||
syntactical construct in haskell. It would be an incredible amount of code
|
||||
(and work) times three.
|
||||
|
||||
Instead we can use some recursive data-type describing a structured document,
|
||||
which abstracts of different syntactical constructs and only considers the
|
||||
things relevant for layouting. This data-type is called `BriDoc`.
|
||||
|
||||
- The `BriDoc` tree has an exponential number of nodes, but it is linear when
|
||||
sharing is considered - the child-nodes can (and must) be re-used across
|
||||
different alternatives. In the `nestedCaseExpr` example above, note how
|
||||
there are four layouts, but essentially only two ways in which the "if" is
|
||||
layouted. Either as a single line or with then/else on new lines. We can
|
||||
handle spacings in such a way that we can share them for 1/3 and 2/4. This
|
||||
already hints at how "columns used" will need to be redesigned slightly so
|
||||
that 2/4 really have the same spacing label at the "if".
|
||||
|
||||
### Concessions/non-Optimality
|
||||
|
||||
- Prefering to trade in the higher node can give non-optimal results, e.g.
|
||||
when there is only one child.
|
||||
|
||||
- Spacings and the pruning of spacings happens in bottom-up fashion, meaning
|
||||
that we might not prune a "70 columns used" spacing even though we later
|
||||
notice that the current indentation already needs to be 20 and we only have
|
||||
60 columns left. If all spacings found in one label have this property,
|
||||
the top-down traversal might be forced to make the "conservative" choice,
|
||||
trading more lines for less indentation to prevent potential (but unknown)
|
||||
overflow during child-node layouting.
|
||||
|
||||
- We can increase the limits arbitrarily (i.e. increase the constant factor
|
||||
while remaining linear) to get better (optimal) results in more cases. We can
|
||||
choose constant factors in such a way that we get optimal results for all
|
||||
inputs up to a certain size. Of course calling them "constant factors" is
|
||||
delusional if they grow exponentially - but it is still nice that this
|
||||
heuristical algorithm can trivially be transformed into the perfect
|
||||
(but exponential) algorithm.
|
||||
|
||||
### Examples of non-Optimality and Border-Cases in General
|
||||
|
||||
TODO
|
||||
|
||||
# Practical details
|
||||
|
||||
## Comments
|
||||
|
||||
Comments don't affect semantics and thus are free-form; as the user we almost
|
||||
never want them reformatted in any way. The tempting approach is to keep them
|
||||
entirely separate: Don't layout comments, and don't let comments affect
|
||||
layout. However, this not possible always:
|
||||
|
||||
~~~~.hs
|
||||
val = f -- useful comment here
|
||||
x
|
||||
-- reformatted to
|
||||
val = f -- useful comment here x
|
||||
~~~~
|
||||
|
||||
Does not work too well, even when the one-liner "f x" would certainly be the
|
||||
layout of choice.
|
||||
|
||||
Brittany handles this in two ways:
|
||||
|
||||
1) Some alternatives are pruned based on the existence of comments;
|
||||
2) Always insert newlines after end-of-line comments, and restore the
|
||||
current indetation to prevent violation of the layouting rule:
|
||||
|
||||
~~~~.hs
|
||||
val = do
|
||||
myAction -- useful comment here
|
||||
x
|
||||
g
|
||||
~~~~
|
||||
|
||||
## CPP
|
||||
|
||||
Conditional text-based source-code insertion is horrible.
|
||||
"#if"-guarded code in one line can easily affect in which way all other code
|
||||
is parsed. This makes it more or less impossible to layout code involving
|
||||
the preprocessor. One might be able to re-implement the preprocessor to then
|
||||
determine when things are safe to layout, but that is no fun, and Brittany
|
||||
does not bother.
|
||||
|
||||
Brittany allows one thing: Working on the already pre-processed code, treating
|
||||
any disabled sections like comments. This is not safe in general, but is nice
|
||||
when the user makes only responsible usage of CPP (e.g.: only put "#if"-guards
|
||||
around module top-level constructs).
|
||||
|
||||
## Horizontal alignment
|
||||
|
||||
Sometimes horizontal alignment can make things more readable. However there
|
||||
are good reasons against using such whitespace: It can cause larger diffs
|
||||
on simple changes and it is rather subjective in which cases things are
|
||||
"more readable" rather than "annoyingly spaced".
|
||||
|
||||
Nonetheless Brittany has a fully-featured implementation for horizontal
|
||||
alignment:
|
||||
|
||||
~~~~.hs
|
||||
func (MyLongFoo abc def) = 1
|
||||
func (Bar a d ) = 2
|
||||
func _ = 3
|
||||
~~~~
|
||||
|
||||
Again we have to ask the question: Does alignment affect the layouting choices,
|
||||
and does layouting affect alignment? The answer is clear in this case: No.
|
||||
First we make layouting-choices, then, independently, we add alignment but only
|
||||
in those cases where it does not cause overflows.
|
||||
|
Loading…
Reference in New Issue