Monthly Archives: June 2012

GIS references in data.un.org

I’ve been playing with most of the data.un.org dataset: all 5577 csv files and 21195188 rows of it. And it’s a fascinating dataset when you see it all together. I’ve already written about how to access the data.un.org datasets from an external application: now it’s time to look at the headings and indices (first row and first column) in them all.

Details: I looked at 5577 csv files in the data.un.org dataset, and automatically excluded the footnotes at the end of each dataset. Data.un.org data that was excluded from the investigation were as follows: the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and 25 files were excluded because they’re in a format (multi-sheet Excel files) that needs further work to separate comments from data. In all, there are 21195188 rows of data in the remaining dataset, so much of the following work had to be automated. Every cell in the first row of each of these files was added to file “headings.txt”; every cell in the first column of each of these files (excluding the first row) was added to file “indices.txt”.

Data.un.org files excluded (because they were in Excel format) were:

  • Human Development Indices: A statistical update 2011
  • Indicators on Women and Men
  • OECD Data
  • World Tourism Data
  • World Fertility Data
  • World Marriage Data
  • World Contraceptive Use
  • Key Indicators of the Labour Market, 7th Edition
  • WTI Data

My goal here was to survey the types of headers used, so I could create lists of errors, inconsistencies etc. for known index types (e.g. country names), and start the work of cross-matching indices and headers to each other.

So. The results are: the indices contain these types:

  • Most files: Country, region and economic group names, with a variety of spelling errors.
  • International Monetary Fund files: a code called OID. The second column (country name) from these files has also been added to the list of indices.

Headings are more varied:

  • Dates/times (Jan, Feb, year, number of years etc)
  • Age-related (e.g. age, age groups)
  • Geographical (e.g. country, OID, national station id, station name, residence area, WMO station number, city, country of birth, country of citizenship, country or territory, country or area, reference area, area etc)
  • City type
  • Birth/death details (cause of death, birth weight etc)
  • Number of people (e.g. number of refugees, children etc)
  • Personal (gender, sex, religion, marital status)
  • Nationality/language (citizenship, ethnic group, native or foreign born, language)
  • Household (household size, type of household etc)
  • Financial value (USD, wealth quintile, currency, SNA)
  • Employment/education (occupation, industry, education, literacy)
  • Trade (Commodity/ISIC rev 3, flow=import or export, etc)
  • General measure words (observation value, quantity, quantity name, weight, measure etc)
  • Classification words (subgroup, item, type etc)
  • Metadata (Description/source/series, survey coverage, activity status, record type, reliability, variant, footnotes ““ these headings are used to refer to footnotes at the end of the file)

More detailed data should be attached to this post.

Indices and headings in data.un.org

‘Data.un.org is the UN Statistics Division’s Internet-accessible repository for data. It’s potentially incredibly useful to the world, but is lagging behind sites like data.worldbank.org and data.gov because it doesn’t have a machine API, it isn’t normalized (the datasets aren’t in a form that makes them easy to use with each other) and there are spelling errors in some of the headers and indices, notably in the names used for geographical locations (e.g. “USA” and “United States” are both used).

I’ve written already about how to access the data.un.org datasets from an external application and about the types of headers and indices within it. Now it’s time to make the geographical references more usable.

I looked at 5577 csv files in the data.un.org dataset, and automatically excluded the footnotes at the end of each dataset. Data.un.org data that was excluded from the investigation were as follows: the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and 25 files were excluded because they’re in a format (multi-sheet Excel files) that needs further work to separate comments from data. In all, there are 21195188 rows of data in the remaining dataset, so much of the following work had to be automated.

I collected the indices and headers from all the datasets into lists: the headers list was searched for geographical references, and the indices list was used to produce a list of corrections from the data.un.org geographical indices into both ISO standard 3166 (countrynames) and UNSTATS’ list “Country and Region Codes for Statistical Use” of region, country and economic group names on data.un.org.

Geographical references in the headers are:

  • country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin, reference area.
  • OID.
  • WMO station number, station name, national station id number.
  • City.
  • Area, residence area, city type.

The GIS naming standards found in the dataset were:

Country names: Most of the indices and country references are a close match to the UNSTATS standard or ISO3166 (as used by the World Bank etc), although country names in particular are very inconsistent in this dataset. The indices discussion below gives instructions on

  • how to correct all of the country and region names to either of these standards. This applies to headers country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin and reference area. OID: International Monetary Fund’s internal GIS standard.
  • WMO station number, Station Name, National Station Id Number: World Meteorological Organisation references to meteorological stations.
  • City: UNSD Demographic Statistics (code: POP) includes city names. No standard has yet been identified for these names.
  • Area: UNSD Demographic Statistics (code: POP) and World Health Organisation (code: WHO) use the headings Area and Residence Area to classify the geographic extent of coverage (e.g. “Total”, “Urban”, “Rural”). UNSD Demographic Statistics also uses the heading “City type” to subclassify cities too (e.g. “City proper”, “Urban agglomeration”).

Most of the data.un.org datasets contain information that is listed by country (e.g. Yemen), region (e.g. West Africa) or economic group (e.g. Developing Regions). Looking at the placenames in the indices, we see that they are a mix of country, region and economic group names, with different spellings and formats for similar names (e.g. “Yemen”, “YEMEN”, “Yemen,Rep.”, “Yemen, Republic of” etc).

Two standards are similar to the placenames used in these files: ISO3166 and the “composition of regions” list published by data.un.org.

  • ISO3166 is a widely-used standard, but contains code for countries and their subregions only (e.g. has no official lists of larger regions or economic areas) and is published as tables online and available (although without the list of withdrawn codes) in the Python library pycountry.
  • The Unstats list (which ISO3166 is partially based on) contains countries, regions and economic areas, but is available only as an html table at http://unstats.un.org/unsd/methods/m49/m49regin.htm. The countries list is available as a ScraperWiki dataset which needs some editing to make it usable. The regions list has been scraped by hand for now. There are two main lists in it: the regions, subregions and countries by physical location, and the economic status (e.g. “Developing regions”, “Least developed countries”) of each country and region. These are mostly consistent, with a couple of oddities. For instance, Netherland Antilles doesn’t appear on the list of countries, but does appear on the list of small island developing states. Luckily it has a country code, so it’s been included in the list of countries.

The indices were checked against both these standards. Suggested improvements to the indices and standards include:

  • Make the regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years).
  • Make the economic status list available as a csv file online.
  • Lobby ISO to create a region (Africa, West Africa, North America etc.) code standard, if it doesn’t already exist.
  • Lobby ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in Bolivia’s name).
  • Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN online data should attempt to adhere to.
  • Change all the data.un.org datafiles to meet this standard.

Against the ISO3166 standard, the data.un.org csv index errors were:

  • Withdrawn countries with no ISO3166 code: “East Timor\”, \”Czechoslovakia, Czechoslovak Socialist Republic”, \”USSR, Union of Soviet Socialist Republics\”, “Yemen, Yemen Arab Republic\”, “Yemen, Democratic, People’s Democratic Republic of\”, “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands\”, “Wake Island\”, “Serbia and Montenegro\”.
  • Abbreviation, e.g. “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”.
  • Added markers, e.g. “+” added to the end of region names, to differentiate them from countrynames.
  • Capitalisation, e.g. “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the” and the” for “The”.
  • Brackets: UNICEF in particular uses brackets “()” instead of commas in placenames
  • Standards confusion: the ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not).

Tables of other misspellings against both standards are given below. Some of these errors are the use of familiar names (e.g. Brunei, Ivory Coast, China) or issues with character translation (e.g. Cote d’Ivoire). Some names could not be resolved: remaining queries include the code for French Polynesia, whether “Christmas Is.(Aust)” is Christmas Island, whether St. Helena refers to just the island of Saint Helena, or “Saint Helena, Ascension and Tristan da Cunha” and whether Palestine and Palestinian Territories refer to “Palestinian Territory, Occupied”