API

Build Unihan into tabular friendly format and export it.

unihan_tabular.process.ALLOWED_EXPORT_TYPES = [u'json', u'csv', u'yaml']

Allowed export types

unihan_tabular.process.DESTINATION_DIR = u'/home/docs/.local/share/unihan_tabular'

Filepath to output built CSV file to.

unihan_tabular.process.INDEX_FIELDS = [u'ucn', u'char']

Default index fields for unihan csv’s. You probably want these.

unihan_tabular.process.UNIHAN_FIELDS = [u'kAccountingNumeric', u'kBigFive', u'kCCCII', u'kCNS1986', u'kCNS1992', u'kCangjie', u'kCantonese', u'kCheungBauer', u'kCheungBauerIndex', u'kCihaiT', u'kCompatibilityVariant', u'kCowles', u'kDaeJaweon', u'kDefinition', u'kEACC', u'kFenn', u'kFennIndex', u'kFourCornerCode', u'kFrequency', u'kGB0', u'kGB1', u'kGB3', u'kGB5', u'kGB7', u'kGB8', u'kGSR', u'kGradeLevel', u'kHDZRadBreak', u'kHKGlyph', u'kHKSCS', u'kHanYu', u'kHangul', u'kHanyuPinlu', u'kHanyuPinyin', u'kIBMJapan', u'kIICore', u'kIRGDaeJaweon', u'kIRGDaiKanwaZiten', u'kIRGHanyuDaZidian', u'kIRGKangXi', u'kIRG_GSource', u'kIRG_HSource', u'kIRG_JSource', u'kIRG_KPSource', u'kIRG_KSource', u'kIRG_MSource', u'kIRG_TSource', u'kIRG_USource', u'kIRG_VSource', u'kJIS0213', u'kJapaneseKun', u'kJapaneseOn', u'kJis0', u'kJis1', u'kKPS0', u'kKPS1', u'kKSC0', u'kKSC1', u'kKangXi', u'kKarlgren', u'kKorean', u'kLau', u'kMainlandTelegraph', u'kMandarin', u'kMatthews', u'kMeyerWempe', u'kMorohashi', u'kNelson', u'kOtherNumeric', u'kPhonetic', u'kPrimaryNumeric', u'kPseudoGB1', u'kRSAdobe_Japan1_6', u'kRSJapanese', u'kRSKanWa', u'kRSKangXi', u'kRSKorean', u'kRSUnicode', u'kSBGY', u'kSemanticVariant', u'kSimplifiedVariant', u'kSpecializedSemanticVariant', u'kTaiwanTelegraph', u'kTang', u'kTotalStrokes', u'kTraditionalVariant', u'kVietnamese', u'kXHC1983', u'kXerox', u'kZVariant']

Default Unihan fields

unihan_tabular.process.UNIHAN_FILES = [u'Unihan_RadicalStrokeCounts.txt', u'Unihan_NumericValues.txt', u'Unihan_Variants.txt', u'Unihan_DictionaryIndices.txt', u'Unihan_DictionaryLikeData.txt', u'Unihan_OtherMappings.txt', u'Unihan_Readings.txt', u'Unihan_IRGSources.txt']

Default Unihan Files

unihan_tabular.process.UNIHAN_URL = u'http://www.unicode.org/Public/UNIDATA/Unihan.zip'

URI of Unihan.zip data.

unihan_tabular.process.UNIHAN_ZIP_PATH = u'/home/docs/.cache/unihan_tabular/downloads/Unihan.zip'

Filepath to download Zip file.

unihan_tabular.process.WORK_DIR = u'/home/docs/.cache/unihan_tabular/downloads'

Directory to use for processing intermittent files.

unihan_tabular.process.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)

Download a file to a destination.

Parameters:
  • url (str) – URL to download from.
  • dest (str) – file path where download is to be saved.
  • urlretrieve_fn (function) – function to download file
  • reporthook (function) – Function to write progress bar to stdout buffer.
Returns:

destination where file downloaded to.

Return type:

str

unihan_tabular.process.extract_zip(zip_path, dest_dir)

Extract zip file. Return zipfile.ZipFile instance.

Parameters:
  • zip_path (str) – filepath to extract.
  • dest_dir (str) – (optional) directory to extract to.
Returns:

The extracted zip.

Return type:

zipfile.ZipFile

unihan_tabular.process.files_exist(path, files)

Return True if all files exist in specified path.

unihan_tabular.process.filter_manifest(files)

Return filtered UNIHAN_MANIFEST from list of file names.

unihan_tabular.process.get_fields(d)

Return list of fields from dict of {filename: [‘field’, ‘field1’]}.

unihan_tabular.process.get_parser()

Return argparse.ArgumentParser instance for CLI.

Returns:argument parser for CLI use.
Return type:argparse.ArgumentParser
unihan_tabular.process.has_valid_zip(zip_path)

Return True if valid zip exists.

Parameters:zip_path (str) – absolute path to zip
Returns:True if valid zip exists at path
Return type:bool
unihan_tabular.process.in_fields(c, fields)

Return True if string is in the default fields.

unihan_tabular.process.listify(data, fields)

Convert tabularized data to a CSV-friendly list.

Parameters:data (list) – List of dicts
Params fields:keys/columns, e.g. [‘kDictionary’]
unihan_tabular.process.load_data(files)

Extract zip and process information into CSV’s.

Parameters:files (list) –
Return type:str
Returns:string of combined data from files
unihan_tabular.process.normalize(raw_data, fields)

Return normalized data from a UNIHAN data files.

Parameters:
  • raw_data (str) – combined text files from UNIHAN
  • fields (list) – list of columns to pull
Returns:

list of unihan character information

Return type:

list

unihan_tabular.process.not_junk(line)

Return False on newlines and C-style comments.

unihan_tabular.process.zip_has_files(files, zip_file)

Return True if zip has the files inside.

Parameters:
  • files (list) – list of files inside zip
  • zip_file (zipfile.ZipFile) – zip file to look inside.
Returns:

True if files inside of :py:meth:`zipfile.ZipFile.namelist().

Return type:

bool

Utility and helper methods for script.

util

unihan_tabular.util.ucn_to_unicode(ucn)

Return a python unicode value from a UCN.

Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

unihan_tabular.util.ucnstring_to_python(ucn_string)

Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).

unihan_tabular.util.ucnstring_to_unicode(ucn_string)

Return ucnstring as Unicode.

Test helpers functions for downloading and processing Unihan data.