The coolest thing about data analysis…

…is that it makes you think very hard about the world.  I’m having a short break between jobs. After 2 weeks of cycling, kayaking, camping, tidying, diy, meditation, niggling chores and piano playing, I’m finally ready to start thinking about data again.

Kaggle is a lovely thing. Not only does it encourage learning and cooperation between data scientists and create better algorithms for worthy causes, it also provides nice clean data to play with (believe me, raw data is way messier than this).  I’m currently geeking out on the San Francisco crime dataset – basically a huge CSV containing the date, category (and details), location (with police district) and outcomes of SF crimes from 2003 to 2015.

I could just throw a bunch of learning algorithms at this dataset (and, in truth, am, because that’s what Kaggle’s about) but I’m finding it interesting to think first about what I know about crime patterns from my time in the UK, whether those transfer to the US, and what else might be interesting to look for if I know more about the area.  For instance, are car crimes grouped near easy escape routes (thank you SFdata.gov for the major roads map); are there more thefts on area ‘boundaries’ (a UK study a while back showed more crimes near areas that perpetrators felt comfortable in); are there more juvenile and petty crimes near schools (also in SFdata.gov)?  I know that this is consciously testing out my own biases, but I’m on holiday and it’s an interesting thought experiment.

There’s a similar style of dataset (Taarifa’s water point challenge) on DrivenData. Some similar thinking might be interesting there too.

This is not my journey

I spent some of my Christmas break thinking about work styles: what worked last year, what didn’t, and what I could do to improve my own.  I’ve got it down to just two things: “this is not my journey” and “do what the boss asks for”.

People often talk of their jobs (and themselves) as something that they do now, as in at one particular point in time. That’s a little like saying “I’m in seat 29C” instead of “I’m flying from New York to Japan and when I get there I’m going to try out the heated toilet seats” when someone asks you where you are.  We are all on journeys – sometimes literally, but always on journeys through time, careers, relationships.  And if you want to think about your career, a journey is a useful idea.

So last year I got really frustrated because I ended up doing lots of things that needed to be done, leaving little time for the things I loved (and am good at). And at the same time, I watched people around me refusing to do those things, but doing so much better in their careers and the respect they gained for what they did. “This is not my journey” came out of that.  The big thing that I realised was that the people who were doing well were playing by big-company rules, and I was still working as though I was in a startup or community organisation.

Startups are different. Over the years I’ve started companies, helped start companies and communities, been in small companies that grew, worked in government agencies and 90,000-strong multinationals.  And I’ve noticed two transition points: around 5 and 120 people.  With 5 people, you’re completely flat, and to be honest, also flat-out: everyone does what needs to be done without “but I don’t do that”, and although you may have “roles”, you’re generally just working as a team.  That’s often the founding team: once they start hiring, the concept of staff happens, and it takes leadership not to divide into ‘founders’ and ‘others’.  Communities tend to run this way too, with everyone pitching in and helping where they can. Up to about 120 people, companies are “small”, with everyone knowing each other and helping each other out where they can.  But at about 120 people, divisions start: you can remember the names of about 100-120 people, but beyond that new people become faces unless they’re directly working with you.  At this point, people have defined job roles and a limited group of connections in the organisation, and there are just too many things that need to be done to be able to do them informally any more. And at this point (or sooner), big company rules start to apply.  And can be basically summarised as “do what the boss asks for” and “do what’s on your journey”.

“Do what the boss asks for” is obvious when you have a defined boss – in the small flat organisation (no defined “bosses”) it doesn’t make sense, but it’s absolutely the path to happiness under big-company rules (with negotiation about what’s a fair workload etc of course). It’s a simple sorting question for any new task: “is this what the boss asked for?”.

“Do what’s on your journey” covers what you do with the interstitial time: the times when you’ve done what the boss asked for (or just plain need a short break from it) and are filling-in with small jobs, training courses etc. It’s about doing the things that grow you as a person, in the directions that you want to grow and become stronger, but to do this you do need a journey: a knowledge of who you are, what you want to be, what you want to be able to do and be known for.  I spent part of Christmas working on that too. My journey is the same as it was 3, 5, 10 years ago, but now I have a clear description (thank you, social-worker sister-in-laws and “Business Model You”) of where I’m going.  My sorting question for this isn’t “do what’s on your journey” because that’s a terrible way to test anything; instead it’s “this is not my journey” – if I can say that about a non-boss task, then it’s now not on my list.

It’s already a much saner year. Apart from the odd boss-overload, my filters have kept my work down to both manageable and relevant to my career. I seem to have a bit more respect for this: when I mentioned the plan to a Wall Street friend, she said “oh, you’re playing by the blokes’ rules” and explained that nobody respects the person who picks up odd jobs, so perhaps that shouldn’t be too much of a surprise.  What has been a surprise is that in remote organisations, the switch from small to big-company rules happens at a much much smaller number of people (around 20, sometimes as low as 12), although again the larger distances, timezones and smaller bandwidths (as in you’re not seeing people across the office floor, and it’s hard to have watercooler conversations with everyone) should make that somewhat less surprising.

Web Scraping, part 1: files and APIs

Web scraping is extracting information from webpages, usually (but not always) as tables of data that you can save to csv files, json/xml files or databases.

Design it first, then scrape it

When you start on any piece of code, try asking yourself some design questions first; definitely do this if you’re thinking about something as potentially complex as web scraping code.   So you’ve seen a dataset hiding in a website – it might be a table of data that you need, or lists of data, or data spread across multiple pages on the site. Here are some questions for you:

1. Do you need to write scraper code at all?

    • Is the dataset very small – if you’re talking about 10 values that don’t get updated, writing a scraper will take longer than just typing all the numbers into a spreadsheet.
    • Has the site owner made this data available via an API (search for this on the site) or downloadable datafiles, either on this site or connected sites?  APIs are much much faster (and less messy) to use than writing scraper code.
    • Did someone already scrape this dataset and put it online somewhere else?  Check sites like scraperwiki.com and datahub.io.

2. What do you need to scrape?

    • What format is the data currently in?  Is it on html webpages, or in documents attached to the site?  Is it in excel or pdf files?
    • Does the data on this site get updated?  e.g. do you need to scrape this data regularly, or just once?

3. What’s your maintenance plan?

    • Where are you planning to put the data when you’ve scraped it?  How are you going to make it available to other people (this being the good and polite thing to do when you’ve scraped out some data)? How will you tell people that the data is one-off or regularly updated, and who’s responsible for doing that?
    • Who’s going to maintain your scraper code?  Website owners change their sites all the time, often breaking scraper code – who’s going to be around in a year, two years etc to fix this?

Reading files

Okay. Questions over. Let’s get down to business, working from easy to less-easy, starting with “you got to the website, there’s a file waiting for you to download and you only need to do this once”.

You’ve got the data, and it’s a CSV file

Lucky you: pretty much any visualization package and language out there can read CSV files.  You’ll still have to check (e.g. look for things like messed-up text, be suspicious if all the biggest files the same size, etc) and clean the data (e.g. check that your date columns all contain formatted dates, you have the right number of codes for gender – and no, its not always two  – etc) but as far as scraping goes, you’re done here.

You’ve got the data, and it’s a file with loads of brackets in it

Also, the file extension (the part after the last “.” in a filename) is probably “json”.  This is a json file  – not all data packages will read in this format, so you might have to convert it to CSV (and it might not quite fit the rows-by-columns format so you’ll have to do some work there too), but again, no scraping needed.

You’ve got the data, and it’s a file with loads of <>s in it

Either you’ve got html files (look for obvious things like HTMl tabs: <html>, <head>, <p>, <h1>, etc and text outside the opening <name> and closing <\name> brackets) or you’ve got an xml file.  Another big hint is if the file extension is “.xml”.  Like json, xml is read in by many but not all data visualization packages, and might need converting to csv files; a few quirks make this a little harder than converting json, but there’s a lot of help out there on this online.

You’ve got the data and it’s a PDF file

Ah, the joys of scraping PDF files. PDF files are difficult because even though they *look* like text files on your screen, they’re not nearly as tidy as that behind the scenes.  You need a PDF converter: these take PDF files and (usually) convert them into machine-readable formats (CSV, Excel etc). This means that the 800-page PDF of data tables someone sent you isn’t necessarily the end of your plan to use that data.

First, check that your PDF can be scraped.  Open it, and try to select some of the text in it (as though you were about to cut and paste).  If you can select the text, that’s a good sign – you can probably scrape this pdf file. Try one of these:

  • If you’ve got a small, one-off PDF table, either type it in yourself or use Acrobat’s “save as tables” function
  • If you’ve got just one large PDF to scrape, try a tool like pdftables  or cometdocs
  • If you want to use open source code – and especially if you want to contribute to an open-source project for scraping PDF data, use Tabula.

If you can’t select the text, that’s bad: the PDF has probably been created as an image of the text – your best hope at this point is using OCR software to try to read it in.

Using APIs

An Application Programming Interface (API) is a piece of software that allows websites and programs to communicate with each other.

So how do you check if a site or group has an API?  Usually a Google search for their name plus “API” will do, but you might also want to try  http://api.theirsitename.com/ and http://theirsitename.com/api (APIs are often in these places on a website).  If you still can’t find an API, try contacting the group and asking them if they have one.

Using APIs without coding

APIs are often used to output datasets requested using a RESTful interface, where the dataset request is contained in the address (URL) used to ask for it.   For example, http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:2015 is a RESTful call to the World Bank API that gives you the rural population (as a percentage) of all countries for the years 2000 to 2015 (try it!).  If you’ve entered in the RESTful URL and you’ve got a page with all the data that you need, you don’t have to code: just save the page to a file and use that. Note that you could also get the information on rural populations from http://data.worldbank.org/indicator/SP.RUR.TOTL.ZS but you won’t be able to use that directly in your program (although most good data pages like this will also have a “download” button someone too).

APIs with code

Using code to access an API means you can read data straight into a program and either process it there or combine and use it with other datasets (as a “mashup”). I’ll use Python here, but more other modern languages (PHP, Ruby etc) have functions that do the same things.  So, in Python, we’re going to use the requests library (here’s why).

import requests
worldbank_url = “http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:2015”
r = requests.get(github_url)

Erm. That was it.  The variable r contains several things: information about whether your request to the API was successful or not, error codes if it isn’t, the data you requested if it is. You can quickly check for success by seeing if r.ok is equal to True; if it is, your data is in r.text; if it isn’t, then look at r.status_code and r.text, take a big deep breath and start working out why (you’ll probably see 200, 400 and 500 status codes: here’s a list to get you started).

Many APIs will offer a choice of formats for their output data. The World Bank API outputs its data in XML format unless you ask for either json (add “&format=json” to the end of worldbank_url) or CSV (add “&format=csv”), and it’s always worth checking for these if you don’t want to handle a site’s default format.

Sometimes you’ll need to give a website a password before you can get data from its API.  Here, you add a second parameter to requests.get, “authenticating” that it’s you using the API:

import requests
r = requests.get('https://api.github.com/user', auth=('yourgithubname', ‘yourgithubpassword'))
dataset = r.text

That’s enough to get you started. The second half of this post is on scraping websites when you don’t have an API. In the meantime, please use the above to get yourself some more data.  Places you could start at include HDX and datahub.io (these are both CKAN APIs).

Ruby day 9: Local power!

OSM Changes for Typhoon Ruby

OSM Edits for Typhoon Ruby

This. Just this: local mappers made more changes to the map of the Philippines during Typhoon Ruby than anyone else in the world (by a very very big margin). Anyone who doesn’t believe in the strength of local people to build their own resilience should look very, very hard at these numbers.

Ruby’s all over now for the mappers – DHN is de-activated, everyone’s gone back to work.  There’s still a lot of work to do on the cleanup: MarkC mentioned 35000+ houses destroyed and 200000 people without shelter, and there will still be OSM mapping to do for that.  This weekend Celina’s running a “train the trainers” OSM event in Manila: if you’re one of the people who created the figures above, please please go and help spread your skills further!

Ruby day 8: Next

Ruby has gone now – “goodbye Ruby, Tuesday” is apparently becoming a popular song here.  But the cleanup work is only just starting.  Celina spends a lot of the day trying to UAV stuff sorted out; we get word that the team is getting imagery in the worst affected areas, and she works on getting that data stored and back to the mapping groups that need it.

Response teams are moving into the field – many by boat because they can’t get flights. Requests still come in – one for an assessment of damage to communications and media stations (I suggest that Internews might have a list for this, then find that Agos is tracking communications outages. I stop with the lunchtime work and get back to the day job.

ACAPS puts out an overview map but still need a rainfall map for it. Maning from the Philippines observatory just happens to have one (he’s been working on this for a while).

6pm: A government sit rep has data in it that needs scraping: I get out the pdf tools. Cometdocs dies on it, so I move over to PDFTables. We will slowly teach governments to release tables as datasets not pdfs. We will.  The ACAPS HIRA work starts. First, I have some pdfs to scrape. Except the sitreps from http://www.ndrrmc.gov.ph are PDFs with images of excel spreadsheets cut-and-pasted in. Start reaching out, trying to find out who has the original datafiles, but rein in when gently reminded that these things are political.

Connecting Ushahidi data to the HDX repository

I’ve been talking to the HDX team for some time now (well, since before HDX was a thing, but then so have many of us).  HDX is a data repository for humanitarian data: basically, it’s a place to put machine- and human-readable datasets so that other people can use them too.

Ushahidi tools (Ushahidi platform instances, Crowdmap instances) often have datasets in them that could be useful to other people, so part of the conversation has been about how to share data from Ushahidi sites, both on the HDX site and in the HXL humanitarian standard.

Ushahidi CSV to HDX CSV

First let’s look at how to share the CSV file that Ushahidi creates when you click on the “download reports” button. Before you do this, please, please, please read my post about mitigating potential risks to people from sharing your Ushahidi data.

Converting that CSV into HXL format is pretty easy: it’s as simple as adding a row to your CSV file.  Under the heading row (“#”, “INCIDENT TITLE”, ”INCIDENT DATE”, “LOCATION”, “DESCRIPTION”, “CATEGORY”, “APPROVED”, “VERIFIED”, “LATITUDE”, ”LONGITUDE”), add another row with the HXL tags (#x_uniqueid, #x_reporttitle, #report_date, #loc, #x_description, #x_category, #x_approved, #x_verified, #lat_deg, #lon_deg).  Erm… that’s it.

Loading the CSV file into HDX is also pretty easy: you go to HDX, create an account (and an organisation, if you need to), then click on the ‘upload data’ button and follow the instructions.

Ushahidi API to HDX CSV

I did this recently for some Ushahidi location data (I used the API to pull out all latitude-longitude-placename rows from the Ushahidi instances used for the last Philippines typhoons, then wrote them to a CSV which I cleaned up a bit before posting manually to HDX).

If you can write (or run) Python, you can access a Ushahidi Platform API and get all the reports (or locations or categories etc) from an Ushahidi Platform instance – IF that instance has turned on its API (most have). This allows you to do more complex things, like get data from multiple Ushahidi instances, process it a bit (e.g. removing all those repeat locations) or merge it with other data (e.g. Pcodes).

Geeky bit (non-coders, look away now): This API to HDX transfer could be done as a 1-click button on the reports page, if we add an “HDX” plugin to the Ushahidi Platform.  The design for the plugin is basically: a) allow users to add their HDX settings using the plugin’s “settings” page, and b) add a  button to the reports page, and use the existing Ushahidi CSV creation code to create a CSV file, then push that file through the HDX API using the settings that the user’s already added. This would be a pretty sweet Ushahidi plugin for a community coder to write: it’s well-defined, and most of the code needed is already there.

Ushahidi API to HDX API call

HDX not only allows you to upload datasets, it also allows you to add a link to an API, so HDX sees the current information from a site (not a snapshot that you’ve turned into a CSV file).  You can do this by adding one of the Ushahidi API calls to the HDX “link” field, e.g. if your site is called www.yoursite.com, 

Again, please please please read my notes about mitigating potential risks to people from sharing your Ushahidi data before doing any of these.

HDX to Ushahidi

The HDX datastores include files in CSV, Json, XML and Shapefile formats.  You can import Ushahidi reports data from HDX into an Ushahidi instance (but be sure you have permission to do this before you do); you can also upload shapefiles as layers (converted to WMS or KML format: the ogr2ogr tool can do this for you) into an Ushahidi site.

Helping out with HDX and Ushahidi

HDX is an open-source project (https://github.com/OCHA-DAP/): if you’re a Python coder, please consider helping out with its  codebase.

The HDX team are also working on unlocking useful data that’s still in PDFs. If you want a small task to help them with that, try this: unlocking humanitarian planning data.

The Ushahidi Platform is also open-source (http://github.com/ushahidi): if you’re a Php coder, please consider helping out there too.  The plugin specification above could be a nice place to start…

Design: Knowing what you want to build and why

Cross-posted from http://icanhazdatascience.blogspot.com/.

I recently taught a coding course for international development students (yes guys, I’m still marking your assignments). But teaching coding isn’t useful in itself – it’s like teaching someone a new language without understanding what they’d want to communicate in it – so woven throughout the course was a design exercise, so students would end the course with a design for a system they wanted to build, and a tangible use for their new skills.

We focussed on user experience (UX): the process of designing a system from the user’s point of view, rather than the coder’s assumptions about what the user might like. And specifically on a set of UX tools that together define a system:

  • personas
  • user journeys
  • sitemaps
  • wireframes
  • graphic designs
  • user stories

I pointed students at two great resources on this: the beginner’s guide to UX and UX Apprentice, but here’s a quick run-down on the basics.

Design tools

If you’re doing visual designs (wireframes, site maps etc), pen and paper is both free and easy to edit, but not so good at storing and sharing edits between people electronically. Better tools for that are Balsamiq (free for students) and Pencil (less powerful, but free for everyone).  And Trello is a good tool for user stories.

Top-level design

The top-level UX elements are personas, user journeys and sitemaps.

Personas are examples of each type of user you expect to interact with your system. They’re used to get to know the people who will use your system, understand the problem that they’re trying to solve with it, and how people already solve that problem.

User journeys (aka scenarios: they’re almost the same thing) describe how the user will solve their problem.  A user journey contains elements including

  • Context – where?
  • Progression – what are the steps in this journey?
  • Devices – mobile? laptop? smartphone?
  • Functionality – what do they expect?
  • Emotion – angry? tired? scared? time-pressured

A sitemap lists all the pages on your site, and how they’re related to each other. Sitemaps can be generated by a “card sort”, where you write all your system functions (or user stories) on post-its, then sort them into groups.

Page design

The UX elements for individual pages on your site are wireframes, mockups and graphics.

A wireframe is a deliberately rough sketch of what’s on the page. It’s deliberately rough to force your stakeholders (users and customers) to concentrate on *what’s* on the page, not what colour it is, which images you’re using etc.  You might see the text “lorum ipsum” on mockups: this is standard ‘nonsense’ text used by designers, and can be found on sites like http://www.lipsum.com/.

A mockup is more detailed, and used as a proxy for the completed page in exercises like talking potential users through the functions of a site. Clickable mockups are good for this (Balsamiq wireframes are clickable, as in you can click a button on one wireframe and the code will display the page that the button goes to), as are printouts of mockups if you don’t have a machine available.

Graphics are the look-and-feel of your site, its branding and things that make it seem ‘familiar’ to your users.  This includes the color palette you use, fonts, shapes and images (icons, header images, backgrounds etc). Not everyone who codes can design graphics, so you main choice is either to find someone who can design and build all the graphic elements, or use packages of ready-made graphical elements like Bootstrap or Foundation with tools like DivShot.

When you design your pages, it helps to look at similar websites, and ask yourself what you do and don’t like about them, then consider that in your designs.

Back-end design

The above has been all about designing the front-end of a system, e.g. the things that a user can see and the interactions the user has with them.  Back-end design (e.g. algorithm design) is a different beastie, and one to be covered later.

Project management design

Design elements used for communicating with a coding team (including back-end coders) include user stories, kanban and minimum viable products.  These are the foundation of “agile development”, a process designed around building parts of each system quickly in “sprints” (short cycles of coding and feedback) so you can get user feedback on them (and adapt your design to that user feedback) as you go.

User stories are used for planning code sprints, and look like this: as a <role> I want to <goal> in order to <benefit>

Kanban charts are those cards (or post-its) on the wall that you see in every photo of a modern software company.  They usually have cards organized into 5 columns: backlog, ready, coding, testing and done, with a user story or task on each card. An agile project manager (“scrum master”) will pick the most important user stories on their list, and add them to the kanban chart at the start of each code sprint. As programmers write code for each user story, the story will move from left to right across the chart until it ends up in the “done” column.

A minimum viable product (MVP) is the smallest system that you can build and still get useful feedback on. You build this, get feedback, adapt your design and repeat.

That’s it: a basic run-through of design tools for coding.  Now go out there and design something!

Ruby Day 7: Back to Work

7am. Today is a work day, and my deal with work was that I’d be working-from-Philippines, not bunking-off-to-do-disasters, so I’ll only be popping in and out of the chats and here from now.

Jean Cailton from VISOV (French VOST) has popped up overnight: many VOSTies are online working under the flags of other groups, which is kinda normal for mappers. Someone asks for rainfall data: the image a couple of days ago was based on TRMM data, so I wonder if NASA has daily updates to share (I already have scraper code for this from another project, and I remember that Lea Shanley is working on communities there now): Maning points out that the data is updated 3-hourly; is good. SBTF data is going to be posted on the Rappler map. And ACAPS puts out their first briefing note: volunteers are working with them, gathering secondary data (news, data etc) for the notes.  Teams have all settled down into a rhythm, so I don’t feel too bad about having to drop out of helping for the day.

11am. Eat breakfast, with triple espressos all round (twice). Stock up on boko juice (coconut water) and head back to work.  Typhoon is winding down:  it’s now rated as a severe tropical storm (yawn worthy in this land of 20 typhoons a year), Borongan airport is open to receive disaster response teams and relief; Guiuan airport is being cleared, and “national roads affected by #RubyPH are now passable” (DWPH). Rappler people are reporting from some of the typhoon-hit areas (Dagami Leyte, Calarbazon, Camarines Sur) and connected to SBTF team.  Investigate flights to Dumaguete tomorrow, to catch up with some old friends (and work from local coffeeshop).

Do dayjob work today. Spend lunchtime being massaged: hunch over laptop 20/7 for a week has left me with two shoulder knots so big that I’ve given them names.  Try to book flight to Dumaguete: tomorrow’s flights get switched from airline to an online reseller whilst I’m trying to book, and the reseller’s site is down. *sigh*. Book flight for Wednesday – I haven’t come all this way *not* to check in on friends here.  It rains – drizzles – all day; at 5pm the sky is dark enough to need lighting on indoors, but there’s nothing more than a little light rain and wind going on outside. Skypechats are now silent, save for the occasional person coming in to start volunteering. Turning up 3 days later isn’t cool dudes: the volunteer help is usually needed as a disaster hits, not afterwards.

9pm.  Ruby isn’t quite done with me yet: find myself in another Skypechat, this time the one for helping ACAPS gather the background data needed for future event reports. Will be doing a bit each day on that for a couple of weeks, hopefully alongside lots of filipino volunteers too. It’s still raining a bit here in Manila, but if we didn’t know a typhoon <del><del>tropical storm was passing overhead, we really wouldn’t notice it.

Sharing community data

I’ve been thinking lately about how open data and community data fit together.  Actually, I’ve been thinking about it for a long time – since we launched OpenCrisis to try to get more data tools and ideas into the hands of crisismappers, and started the long work of trying to archive and share as much mapping data and information as we could. Here’s some first thoughts on restarting this work.

Some of the big questions that came up then, and continue to come up, are about ownership, risk and responsibility.  For instance:

  • Ownership. If a community of people add reports to a site, and that site also sucks in data from social media, who owns that data?  That question’s already been asked and answered in many open data and social media sites, often involving much work and lost data as licenses are transferred (see OpenStreetMap’s license moves, for example).  Having your mappers sign a contributor agreement, and having a data license statement on any dataset you put out is a good start here.
  • Risk. Is something we’ve also always dealt with. The types of information that mappers handle mean you need to do data risk analysis that covers population, mappers, organizations and leads.
  • Responsibility. If you make a dataset public, you have a responsibility to, to the best of your knowledge, skills and advice you can obtain, do no harm to the people connected to that dataset. That’s a big responsibility, and one that has to do a balancing act between making data available to people who can do good with it and protecting the data subjects, sources and managers.

This week, I used some Python code to share all the geolocations (latitude, longitude and location name) from 2013 Ushahidi instances for Philippines typhoons.  That was a pretty clear-cut example: it wasn’t a conflict situation, none of the locations were sensitive (e.g. personal homes or addresses of places like rape crisis centers that need to be kept secret), and the data was gleaned from public sites created by large NGOs and open data teams.  Life isn’t always that clear-cut.

Datasets: where and what

Some of the places that mappers hold data are:

  • Google spreadsheets (tables of information),
  • Skypechats (informal information transfers) and googledocs/emails (e.g. annotated highlights from a day’s Skype discussions),
  • OpenStreetMap
  • Micromappers data (often visible as google spreadsheets) and
  • Ushahidi instances (unfortunately afaik, there weren’t any Ushahidi instances updated for Typhoon Ruby, so I couldn’t compare the two sets of data).

Some of the data collected by those mappers includes:

  • Geolocation: Latitude and longitude for each location name in a list.  These are usually either a) found automatically by the platform using a gazetteer like Nominatim, b) input by site administrators, or c) input by the person submitting a direct report.
  • Direct reports: Messages sent by reporters or general public via SMS or web form.  These generally have the standard set of Ushahidi report entries (title, description, category list etc), but might also include custom form fields to capture items of specific interest to the map owner.
  • Indirect reports: Messages scraped from Twitter, Facebook etc.
  • Category lists: The categories that reports are tagged with; these lists are usually created by the site administrator.
  • API data: data input into a platform by another system, using the platform’s API. This includes sensor data, and datasets scraped and aggregated from another platform instance.
  • Media: Images, video, audio files added to a report by the reporter or an administrator.

Not all of this data is owned by the site owner.  For example, third party data (e.g. Twitter messages) has restrictions on storage and ownership that even with the original sender’s permission could make it illegal for you to distribute or keep on your site.

Who and why?

Open data is, by its nature, open, and it’s difficult to anticipate all the uses that people have for a dataset you release.  Some examples from experience are:

  • Academics – analysis of social media use, group dynamics etc.
  • People in your organization – for lessons learned reports, for illustration, for visualizations, for analysis of things like report tempo (how frequently did information arrive for processing, translation etc)
  • Data subjects – to check veracity of data about them, and to ask for data removal (sometimes under laws designed for this purpose, e.g. EU privacy laws). I haven’t seen this happen yet, but I really really want it to.

If you release a dataset to anyone, you have to assume a risk that that dataset will make its way into the public domain. We’ve seen too many instances of datasets that should have been kept private making it into the public domain (and, to be fair, also instances of datasets that should have become public, and datasets that have been carefully pruned being criticized for release too).  Always have the possibility of accidental release in mind when you assess the risks of opening up data.

How?

Sharing data shouldn’t just be about clicking on a “share” button. There are processes to think about, and risk analysis to do:

  • Ethical process: Assessing the potential risks in sharing a dataset; selecting which data you should and should not share. Alway think what the potential harms from sharing information from the deployment is, versus the potential good.  If you’re not sure, don’t share, but if you’ve checked, cleaned, double-checked and the risk is minimal (and ethical: you’re working with other people’s information here), seriously consider it. If it’s come from a personal source (SMS, email etc), check it. At least twice. I generally do a manual investigation by someone who already has access to the deployment dataset first, with them weeding out all the obvious PII and worrisome data points, then ask someone local to the deployment area to do a manual investigation for problems that aren’t obvious to people outside the area (see under: Homs bakery locations).
  • Legal process: choosing who to share with, writing nondisclosure agreements, academic ethics agreements etc.  You might want to share data that’s been in the media because it’s already out there, but you could find yourself in interesting legal territory if you do (see under: GDELT). In some countries, slander and libel laws could also be a consideration.
  • Physical process: where to put cleaned data, how to make it available.  There are many “data warehouses” which specialise in hosting datasets.   Data warehouses include the Humanitarian Data Exchange (HDX), which specialises in disaster-related data.  You can also share data by making it public on an Ushahidi instance (e.g. crowdmap), or by making data available to people on request. See crowdmap.com’s api and csv public download button.

Some of the things I look for on a first pass include:

  • Identification of reports and subjects: Phone numbers, email addresses, names, personal addresses
  • Military information: actions, activities, equipment
  • Uncorroborated crime reports: violence, corruption etc that aren’t also supported by local media reports
  • Inflammatory statements (these might re-ignite local tensions)
  • Veracity:  Are these reports true – or at least, are they supported by external information.

Things that will mess up your day doing this include untranslated sections of text (you’ll need a native speaker or good auto translate software), codes (e.g. what does “41” mean as a message?) and the amount of time it takes to check through every report by hand.  But if you don’t do these things, you’re not doing due diligence on your data, and that needs to happen.

Other questions that you might want to ask (and that could make your checking easier) include:

  • How geographically accurate does your data release have to be?  E.g. would it be okay/ better to release data at a lower geographical precision (e.g. to town level rather than street)?
  • Do you need to release every report?  Most deployments have a lot of junk messages (usually tagged unverified) – remember, the smaller amount of data you release, the less you have to manage (and worry about).
  • Would aggregated data match the need of the person asking for data?  e.g. if they just need dates, locations and categories, do you need to release text too?
  • Time. You might want to leave time after your deployment for the dataset to be less potentially damaging. When you release a dataset, you should also have a “data retirement” plan in place, detailing whether there’s a last date that you want that data to be available, and a process to archive it and any associated comments on it.

Further reading

There’s a worldwide discussion going on right now about how private individuals, groups and companies can release open data, and on the issues and considerations needed to do that. More places to look include: