[ Index ]

PHP Cross Reference of Phabricator

title

Body

[close]

/src/docs/user/userguide/ -> utf8.diviner (source)

   1  @title User Guide: UTF-8 and Character Encoding
   2  @group userguide
   3  
   4  How Phabricator handles character encodings.
   5  
   6  = Overview =
   7  
   8  Phabricator stores all internal text data as UTF-8, processes all text data
   9  as UTF-8, outputs in UTF-8, and expects all inputs to be UTF-8. Principally,
  10  this means that you should write your source code in UTF-8. In most cases this
  11  does not require you to change anything, because ASCII text is a subset of
  12  UTF-8.
  13  
  14  If you have a repository with source files that do not have UTF-8, you have two
  15  options:
  16  
  17    - Convert all files in the repository to ASCII or UTF-8 (see "Detecting and
  18      Repairing Files" below). This is recommended, especially if the encoding
  19      problems are accidental.
  20    - Configure Phabricator to convert files into UTF-8 from whatever encoding
  21      your repository is in when it needs to (see "Support for Alternate
  22      Encodings" below). This is not completely supported, and repositories with
  23      files that have multiple encodings are not supported.
  24  
  25  = Detecting and Repairing Files =
  26  
  27  It is recommended that you write source files only in ASCII text, but
  28  Phabricator fully supports UTF-8 source files.
  29  
  30  If you have a project which isn't valid UTF-8 because a few files have random
  31  binary nonsense in them, there is a script in libphutil which can help you
  32  identify and fix them:
  33  
  34    project/ $ libphutil/scripts/utils/utf8.php
  35  
  36  Generally, run this script on all source files with "-t" to find files with bad
  37  byte ranges, and then run it without "-t" on each file to identify where there
  38  are problems. For example:
  39  
  40    project/ $ find . -type f -name '*.c' -print0 | xargs -0 -n256 ./utf8 -t
  41    ./hello_world.c
  42  
  43  If this script exits without output, you're in good shape and all the files that
  44  were identified are valid UTF-8. If it found some problems, you need to repair
  45  them. You can identify the specific problems by omitting the "-t" flag:
  46  
  47    project/ $ ./utf8.php hello_world.c
  48    FAIL  hello_world.c
  49  
  50      3  main()
  51      4  {
  52      5      printf ("Hello World<0xE9><0xD6>!\n");
  53      6  }
  54      7
  55  
  56  This shows the offending bytes on line 5 (in the actual console display, they'll
  57  be highlighted). Often a codebase will mostly be valid UTF-8 but have a few
  58  scattered files that have other things in them, like curly quotes which someone
  59  copy-pasted from Word into a comment. In these cases, you can just manually
  60  identify and fix the problems pretty easily.
  61  
  62  If you have a prohibitively large number of UTF-8 issues in your source code,
  63  Phabricator doesn't include any default tools to help you process them in a
  64  systematic way. You could hack up ##utf8.php## as a starting point, or use other
  65  tools to batch-process your source files.
  66  
  67  = Support for Alternate Encodings =
  68  
  69  Phabricator has some support for encodings other than UTF-8.
  70  
  71  NOTE: Alternate encodings are not completely supported, and a few features will
  72  not work correctly. Codebases with files that have multiple different encodings
  73  (for example, some files in ISO-8859-1 and some files in Shift-JIS) are not
  74  supported at all.
  75  
  76  To use an alternate encoding, edit the repository in Diffusion and specify the
  77  encoding to use.
  78  
  79  Optionally, you can use the `--encoding` flag when running `arc`, or set
  80  `encoding` in your `.arcconfig`.


Generated: Sun Nov 30 09:20:46 2014 Cross-referenced by PHPXref 0.7.1