Why spreadsheets are hard

Thinking about machine-reading human-generated spreadsheets today, and I think I’ve got a handle on why this is a problem.

  • Data nerds think of data in its back-end sense, of “what form of data do I need to be able to analyse/ visualise this”.  We normalise, we worry about consistency, we clean out the formatting.
  • People used to creating spreadsheets for other people think of it more in the front-end sense, of “how do I make this data easily comprehensible to someone looking at this”.

Each has merits/demerits (e.g. reading normalised data and seeing patterns in it can be hard for a human; reading human-formatted data is hard for machines) and part of our work as data nerds is working out how to bridge that divide.  Which is going to take work in both directions, but it’s necessary and important work to do.

WriteSpeakCode/ PyLadies joint meetup 2015-10-22: Tales of Open Source: rough notes

Pyladies: international mentorship program for female python coders

  • meetup,com, NYC Pyladies
  • Lisa moderating, Panelists: Maia McCormick, Anna Herlihy, Julian Berman, Ben Darnell, David Turner
  • Intros:
    • Maia: worked on Outreachy (formerly OPW) – gives stipends to women and minorities to work on OS code; currently at Spring
    • Anna: works at MongoDb, does a lot of Mongo OS work.
    • Julian: works at Magnetic (ad company); worked on Twisted, started OS project (schema for validating Json projects)
    • Ben: Tornado maintainer, working on OS distributed database on Go.
    • David: ex FSF, OpenPlans, now at Twitter, “making git faster”.
  • Q: how to find OS projects, how to get started?
    • D: started contributing to Xchat… someone said “wish chat had the following feature”… silence… recently, whatever the company is working on. Advice: find the right project, see if they’re interested, then write the feature.
    • B: started on python interpreter, was using game library, needed bindings for library
    • J: looked at OpenHatch OS projects.  Found Twisted – told that if want to get code in there, there’s a review process. Found feature/bug, wrote patch, waited for response – that got him in… vehicle for other people to read and respond to code.
    • A: first OS commit was to Mongodb – interned there after college. Couldn’t get feature to work on her mac, fixed it ’til it ran, then someone asked “are you going to put in a core request”…was first experience of request politics.  Hard to find projects that both need help, and want help. Best to contact first, e.g. “are you interested in a fix for OSX”. Most people’s experience of OS has been rejection or a negatively tinged experience.
    • M: first pull request got landed… top-down approach, “how do I get work experience on a big codebase – obvious answer is OS… applied to outreachy, who have a list of orgs who want donations”… found Gnome music on the list… iTunes for Gnome… looked at list of beginner-friendly bugs, built that (“approx a million years”) on own machine.  Gnome are particularly newbie-friendly.
    • Outreach deadline is Nov 2nd.
  • Q: how do you find a project that wants your contribution? (or tips for what to avoid)
    • D: avoid people who are loudly mean (e.g. Linux kernalists).  Responsiveness beyond everything… e.. friendly community who took a month to fix their instructions… sat on patch for 1-2 months.  Good: active community, can see closed pull requests (but linux/git have mailing lists, but that’s active)
    • B: has a list of newbie-friendly bugs.
    • J: gauge on whether want to use that software or not.
    • A: bugs are best place to start. Filing a bug report tells you a lot about the maintainers, e.g. on it immediately, starting a conversation about it, you can follow the progress of the bug – see the conversations between the contributors, reminds you that there are humans behind it… “any kind of form of life”.
    • M: probably would have started on bpython (shinier ipython), because peer was really excited about it… peer recommendation, people excited about a project = project probably doesn’t suck.
  • Q: suggestions for good places to find lists of welcoming OS projects
    • OpenHatch
    • Hacktoberfest (organised by digital ocean) – everyone submitting 4 projects from the list gets a free t-shirt
    • Look at the projects that OS projects include… those tools are also interesting projects.
    • Go to your bosses and ask if you can release the company software as OS.
  • Q: about your projects, features, bugs – something you’d like to share
    • M: dev environment – hard to build these. Long slog through virtual machine (e.g. fedora 2.1 was still in alpha)… lots of patience, and a new computer. Taking notes – wrote everything down, error messages etc so can do on next install, take to project maintainer as suggestions for things to go into instructions.
    • A: pymongo sometimes gets a bug that spirals out of control, and ends up being a python bug (that’s already been reported)… e.g. multiprocessing bug that took time to figure out. Getting a copy of the project is a big step towards actually contributing.
    • J: like perfect storm types of bugs, e.g. json schema had a bug… likes semantic versioning, maintaining backwards compatibility… a release was broken and put out, got bug report 6 hours after release from people in big orgs (e.g. openstack, mediawiki)… tiny detail – pip environment markers – broke the release; lots of people; doesn’t like fixing bugs until have regression test in place = pressure is on… did in 24 hours…
    • B: asynch in Tornado Async is interator returning awaitable objects, python library asynchio had different interpretation- trying to mix them, got stack overflows endlessly trying to convert objects. Still an open issue- did a workaround, but other code will have similar problems with it.
    • D: rewrote hash table function in git, git merge started crashing… because of the fix… git index is also called cache and staging area, depending of which part of code you’re in… created nightmares on macs… weirdass pointer being pulled out from under code whilst still in use – only happened on a mac on certain large merges… but patch not accepted the first way was written, so rewrote a different way
  • Q: anything your OS projects want help with now?
    • M: has a bug list – look for Gnome Music getting started page. “Gnome Love Bugs” https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=gnome-music;dist=unstable
    • A: lot of mongodb driver things to work on… any time release, looking for people to test for bugs – finding one = starting a conversation. Drivers have various levels of accessible bugs… mongodb is too big a place to start in.  And Mongodb is hiring.
    • J: Twisted has tedious but beginner-friendly work; J has proof of concept projects that wrote parts he needed (e.g. docker python bindings are literal translations of command line commands –  can jump in and extend out that library), etc. code lives on github https://github.com/Julian… not heavily organised.
    • B: cockroachdb has well-organised bug list. https://github.com/cockroachdb/cockroach – can talk to B about stuff that’s not well-organised.
    • D: git doesn’t have a public bug list, but can look at unit tests and see known failures… need to ask git if they’re things that people care about.  Also e.g. “git rm” removes entire account? (is not filed yet). (all panelists are hiring!)
  • Audience questions:
  • Q: Dropbox might be a good starter project.
  • Q: Setting aside time to work on OS? A: motivated by other people – find someone interested in working on a project. Take advantage of frustration – immediately after frustration, try to work something out.
  • Q: How do you deal with ownership in companies based on OS? Ordinary employee = work for hire. Contract employee = 20-point test, but can override that in the contract. Ownership matters if you want to enforce the license – need copyrights to do this.
  • Q: licenses? Apache vs MIT vs GPL? Prefer for most things copyleft (e.g. GPL), otherwise adoption. More permissive, e.g. Apache, MIT. But use FSF-approved license, e.g. Apache, MIT or GPL.

Looking at data with Python: Matplotlib and Pandas

I like python. R and Excel have their uses too for data analysis, but I just keep coming back to Python.

One of the first things I want to do once I’ve finally wrangled a dataset out of various APIs, websites and pieces of paper, is to have a good look at what’s in it.  Two python libraries are useful here: Pandas and Matplotlib.

  • Pandas is Wes McKinney’s library for R-style dataframe (data in rows and columns) manipulation, summary and analysis.
  • Matplotlib is John D Hunter’s library for Matlab-style plots of data.

Before you start, you’ll need to type “pip install pandas” and “pip install matplotlib” in the terminal window.   It’s also convention to load the libraries into your code with these two lines:

import pandas as pd
import matplotlib.pyplot as plt

Some things in Pandas (like reading in datafiles) are wonderfully easy; others take a little longer to learn. I’ll meander through a few of them here.  I’ll be using Nepal medical shipments data as an example.

Reading in data

Pandas makes this easy. Reading in a CSV file is as simple as:

df = pd.read_csv(csvfilename) # Comma-separated file
df = pd.read_csv(csvfilename, sep='\t') # Tab-separated file

There’s also pd.read_json, pd.read_html, pd.read_sas, pd.read_stata and pd.read_sql_table to read in other data formats.  Be careful with read_html though: it only reads in html tables, and you’ll need lxml, beautifulsoup or suchlike if you want to read tables straight from a webpage.

First-look at the dataframe

I like to know what I’m dealing with before starting analysis.  I usually use Tableau or R for this, but that’s not always possible, and Pandas is a good alternative.

df.columns # List all the column headings
df.head(4) # The first 4 rows of data, same as df.head(n=4)
df[['column1','column2', 'column3']].head(10) # Just some of the columns
df.describe() # Basic statistics for every numerical column

That tells you what your columns are, what your first few rows look like (df.tail(4) will give the last rows) and some basic statistics for numerical columns, but you’re probably more curious than that.


Value_counts will tell you what’s in a single column.  If you want to know what’s in a pair or combination of columns, you’ll need to start using pivot tables or group_by.

You might know pivot tables from Excel.  They’re ways of creating a new datatable whose rows, columns and content are defined by you.  This function, for example, gives you a new table whose rows are column x values, columns are column y values, and contents are the number of rows that contained those combinations of x and y values.

x_by_y = df.pivot_table(index='columnx',columns='columny', values='columnz', aggfunc='count', fill_value=0)

Column z gets involved here because if you don’t nominate a column for the values, Pandas will return an array with the counts for every combination of columns. I’ve included fill_value=0 because I’m counting, and Pandas would otherwise include NaN (not a number) in its counts.

x_by_y is a data frame. You can plot this, for example:

x_by_y.head(10).plot(kind='bar', stacked=True)

You’re now using Matplotlib.  And that was quite a complex plot: a stacked bar chart, with a legend.   Note that .plot creates a plot object: if you want to *see* your plot, you need to type “plt.show()”.  This will put up a plot window and stop your code until you close the window again.

Basic data manipulation

I’ve had a look at the dataset, got some ideas for more things to look at in it, and some of them need calculations. Pandas handles this too.  More soon. Meanwhile, here’s some stuff I did with the Nepal dataset.

pivot1 = df.pivot_table(
 index='Material Hierarchy Family',
 columns='Final Recipient Name',
 values='Dollar Value',
 aggfunc='count').plot(kind='barh', stacked=True)

recipientsize = df.groupby('Final Recipient Name').size()

pivot2 = df.pivot_table(
 columns='Material Hierarchy Family',
 index='Final Recipient Name',
 values='Dollar Value',

pivot2.plot(kind='bar', stacked=True, legend=False)



i’ve been thinking today about the singularity: the point at which machines become smarter than humans, about an internet of things so smart that we don’t know how to manage it with our existing software paradigms.  And I wondered: a good manager will already be managing entities that are much smarter than them (because you don’t want your best thinkers doing the paperwork, management is another discipline/ skill etc etc); is it perhaps time to think about how to use those management skills on clusters of machines?

Notes from meetup: data-driven design 2.0 (Data-driven architecture), 2015-08-24

Meetup: data-driven design 2.0 (Data-driven architecture), 2015-08-24


Melissa Marsh on intros and bios… 

  • “Transforming architectural practice series” = thinking differently about the process of arch: tools, practice, how they run their business (leads to thinking differently about product). 
  • Panelists showing how taken on data-led practice changes how arch does their work… incorporating different methodologies, s/m/l/xl data. 
  • Today = moving from data sources and collection to examples within projects, how to set up projects and client relationships differently.  Came out of feedback from June event.  Continuing looking at future of design relationships. 
  • Panelists: 
    • Jeff Ferzoco (linepointpath), 
    • Zak Kostura (ARUP, hiph performance structures – currently form found roof system for MX city)… thinking about project setup and info sharing and how it’s changing client relationships. 
    • Darrick Borowski – on tools and techniques… data-driven design = ask better questions at the beginning… back and forth with Q&A. 
    • Shawn Rickenbacker – intersection of data and design; urban, architectural, interior design: opportunity to link the scales of design and learn from the scales how to apply design. “love of systems, problem-solving skills”. 
    • (Panelists all teach class: uni?)
  • Later:
    • Oct 12th: coworking and the future of architecture.
    • tues pm: measuring architecture (shift from geometric to other measures, including financial and social)… exhibit and event.  
  • Program: 10-min presentations from each panelist, then Q&A on each topic. 

Jeff Ferzoco (@zingbot): 

  • Practice: info design, mapping and experience design.  Tonight: examples of what he does with data.  Practice is almost all mapping at this point.  Looking at 3 problems and how he solved them. 
  • “hardest part about data is understanding what to do with it when you get it”
  • Old job: 8 years at regional scale, 31 NY counties, NYC not as a city, but as counties and larger areas. e.g. America2050 project, on high-speed rail in US… maps, event, work with stakeholders on what HS rail would look like (america2050.org, linepointpath.com).  Train, traffic, cultural data, produced massive-scale maps. 
  • Since then, looking only at neighbourhood and city scale issues. 
    • NYC released open data. Started working with Geonyc, Betanyc on this… data for NYC policy… 
    • 2013, citibike called clients… opening data, called Sarah Kaufman NYU… wanted to see what people were doing.  Dataset was 4Gb for the year, 1 row per ride (1.5M rides; pickup place, droopy place, times, casual/member, gender, zipcode of member… more data got released later).  Put into OpenRefine, looked at favourite stops… 
    • Next was DOB site… has 4’ profile of all jobs done.  (http://www.nyc.gov/html/dob/html/bis/bis.shtml?).  Jean wanted to know about renovations in the city.  Had database from 2004, lots of human errors.  Pulled into map, can see where all residential renovations have been.  Very messy data.  See sweeten.com http://linepointpath.com/111242/4498744/work/sweeten-renovation-map
    • Current project: finding where all gay nightlife spots have ever been in NYC.  Went through old guidebooks(60s, 70s) about gay life, scraped them… about 7 sources, 2 archives (gay and lesbian centre) – about 1000 gap spots since 1859.
  • Citibike – automatically generated data; DOB = human-generated; gay = historic data.  
  • Need to understand the tools.  
    • Tools better over last 3 years 
    • Favorite tool is CartoDb for mapping. 
    • Take data and put through other tools, like google refine (“sophisticated pivot tables, non-destructive”), google drive.  Tools for  big datasets listed on last slide of his presentation.
    • Jeff posting presentation on the meetup.

Zak Kostura (ARUP)

  • ARUP people – lots of then, different fascinations including data. 
  • Excited can get a lot of data, not know why looking at it, but by exploring it, find opportunities wouldn’t otherwise have known. 
  • Arup were doing taxi commission data on rides: 10-15 years of data collected by the commission.  Arup were handed a hard drive of data… taxi commission went to them with the data.  
  • Zak = structural engineer. “never accept data unless I know what I need it for”. Tonight, talking about 2 projects as examples of this. 
  • Need all structural systems to be sized for any event. 
  • First project was Fulton Centre – the net on the inside, alongside Jamie Carter on the engineering of it. 
    • Soft structure, takes the form that it wants to take, like a hammock.  When forces on the net change (e.g. wind), the shape changes too… in design, needed to understand all the changes possible… e.g. smoke exhausts create wind forces, building moves, heat rises and changes it.  About 1000 panels.  Was 2007… needed algorithms could interrogate on the fly. 
    • Can stop the process of having to idealise.. can look at exact systems, not estimates, can aggregate info and not be afraid of it. 
  • Mexico City airport.  Largest roof in the world… giant X… road around it is 2.9 miles. Only 21 touchdown points, big space-frame system. 
    • Principles simple: forces on it, how much force it takes an element to yield, how much force before it buckles… do this calc for each of 1m+ elements, or approximate it (but approx = inefficiency). 
    • Started data-driven… from a database. Used to use Excel, but not think about using a database… excel not designed for e.g. fast v-lookups. 
    • Could only design because knew exactly what the forces would be on each panel. 
    • <SJT: wasn’t the Guggenheim Bilbao designed like this some time ago?>
    • Processing comes down to a pipe, calculated 3/4m times.  To do this in the timeframe, everyone on the team has to be confident interacting with the data, e.g. using the right SQL commands to do this. Currently working with fire teams, architect etc on a database. Hoping to get to a point where can hand the database over to contractors. 

Darrick Borowski (design director, ARExA)

  • Follow-up to a talk at the AIA.  This time, talking about the tools used to do projects. 
  • “The generative nature of information, or the dirty things people do with data”
  • Tools: grasshoppers, cellular mod, dijkstras, A* etc. 
  •  Last time were talking about material experiments, human behavior as a source of data, extracting learnings from natural systems as a data source, and cultural systems as a source of data (and particularly cities), all in Grasshopper. 
  • Grasshopper is a graphical system editor.  Can model behavior in this.  Looking for the learnings that are available in our world (human, natural, cultural systems). https://en.wikipedia.org/wiki/Grasshopper_3D
  • System: data in, computational model, data out.  Use simulation in all 3 aspects of that.  
    • Data in = pounding pavements, streetcorner counters, material experiments, simulation. 
    • Model: create algorithm to process data and map onto problem trying to solve.  Sometimes data is in the natural system that are extrapolating data from.  Key this is that mapping from what you’re trying to accomplish to the model you’re using. 
    • Data out. Can do many things with this. Data in itself, e.g. 3d model.  Data for decision making, e.g. hand back to designer. Data fed back into another algorithm, adding complexity to that solution. 
  • Project: workspace layout based on connectivity that different layouts enabled. 
    • Tied into: more frequent conversations people have, the more innovation comes out; the closer you are to other people, the more likely you are tho communicate. (Allen Curve).
    • Put in different design layouts, measured distance between, looked for layout with best average distance. 
    • Desks are agents; agents are sent to every other person in the room, by navigating around obstacles in the room… draw a vector and test options for routes… 
    • Different desk layouts, paths travelled… fed into excel, averaged, extracted average distance travelled.
    • Inefficiencies in building that behavior from scratch: phase 2 = pre-developed algorithms using efficiencies they couldn’t see in phase 1.  Eg. use A* etc for routes (like Google Maps does), and looking at how slime mould aggregates (slime mould aggegates). http://christinehastie.com/2015/01/collaboration-can-learn-slime-mould/
  • Project: greenhouse. 
    • Problem distributing structural nodes across a dome. Looked at phyllotaxis (e.g. how sunflowers distribute seed heads – how to pack things into a space)… extrapolated principles of algorithm, added into Grasshopper (who scales out, repeats) – produced a structural network with equally-sized members over a surface.  
    • Examples of data in a natural system serving as a computational model
  • Example: based on caloric intake of average american, land needed to create that, designed city tissues – areas of land to feed people; reaction to fuel crisis, global warming etc. 
    • Showed power of building things up in layers. Complicated algorithm: tempting to pack everything into one Grasshopper algorithm… robotics etc will build one small piece, e.g. walk, and build on that and build on that again to create the systems. 

Shawn Rickenbacker, Urban Data + Design.

  • Use Rhino and Grasshopper quite a bit for complex geometries. 
  • Example: designing a wavy, moving wall.
  • “Somehow ended up as data analysts, working with china partnership in downtown manhattan on how tourism were shaping their environment”.
    • One characteristic: knock-off goods; this black market enormously important part of downtown economy. 
    • Video of group of swallows… look choreographed: interested in why they do this: barometric pressure, predators etc? This is the problem of data: understanding why. 
    • Big data is about… tracking individuals… Borrowed S/M/L/XL from Kolleeny
    • R visualisation… R as a useful tool.
    • Tourism used counters (interns) on streets.  Generated heat maps from the dataset (easy to get from R; had timestamps). R still needs human input to recognize the patterns.  
  • User groups: did some work with Sony. 
    • Client understands the user… each one of the data points is a different individual.  CAn’t understand the flock because you don’t understand the individual.  Buildings are reacting to the environment, e.g. wind blows – spatial element. 
  • Chinatown again:
    • Distributed QR codes across Chinatown.  Borrowed idea from stores showing pictures with QR codes to purchase things instantly on the street. 
    • Client asked “what happens at night?”.  Started projecting QR codes on walls at night… Chinatown has different night and day populations. 
    • Marc Andresson: “software is eating the world”… shakes at the core of arch, which is about physicality.  SW was designed as a tool to solve physical problems. 
    • Got into distributed networks… Xbee+ Arduino to create a wireless mesh network, can add sensors and cameras to these. 
      • Software eating the world: working less with physical architecture, more with non-discrete architecture (e.g. no physical component, like wireless networks). 
      • E.g. QR code as a pass to get into a Nike event.  this is convergence between the digital and physical.  “felt lame using QR codes back in 2012”.  Nike was pop-up event, large size… had huge LED basketball court projected inside… access only gained by using QR codes set up around the event. 
  • How to deal with software eating the world spatially, architecturally. 
    • Video is best way to track the flock of birds.  Programming xbox connect, then manipulating real-time data from it.  Used processing, arduino, R, gestures as data compiled by the camera then processed through p/a/r; creates action = movement of swallows. 
  • Kimono Labs are good place for data
    • https://www.kimonolabs.com/
    • Can go to any website, scrape data and produce an API from it. 
    • <SJT: can also do this with Google Spreadsheet!>
    • e.g. taking data from video – bus going by, using data for next time bus goes by… creating interactive.  
  • M: “If software is eating the world, architects are making that world more nutritious”? “N… maybe get a lot skinnier”. 

Q&A session with panelists:

  • Q: If dealing with large amounts of data, how do you share raw data and communicate results with other team members (consultants etc)?
    • A: Zac: newforma, file transfer protocols; lawyers need legal protocol for sending data, e.g. timestamped, who sent, who received: this is barrier for collaborating over data.  e.g. airport model can’t be done in live collaborative fashion because strips have to be packaged and sent.  Tech: set up database server and use that, but lose ability to see who did what and why?  This hampers ability to share effectively.
    • M: need a next version that takes on responsibility and ownership of data by parties, e.g. all becoming stakeholders collaboratively: share blame, but can do great design.
    • Derek: problem with Grasshopper, are the only people int he team interested in behavior; everyone else wants geometry; tend to bake data into 3d model for rest of team to work with; sometimes geometry, sometimes excel spreadsheet, sometimes evaluation of spreadsheet. All about shared language.
    • Jeff: use SQL to access mass amounts of taxi data; communicating, usually a layer on top, e.g. google chat. BigQuery for large storage; if smaller, then Dropbox (or Box).  
    • Shawn: AutoDesk pilot project called Dreamcapture (=AutoCad Artificial Intelligence). Future of large data affecting design isn’t as large as commerce data. Dreamcatcher premise is collaborative environment, where can see effect of team member changes.  Security: if you have a secure cloud with limited access, get round problem of waiting for save, send, send back etc.
    • M: data volumes; if look at buildings as they’re lived an occupied, will get to consumer-driven sized volumes of data. Important that archs have a thoughtfulness about that data. NB Autodesk is a sponsor of the global self quant annual meeting. 

The coolest thing about data analysis…

…is that it makes you think very hard about the world.  I’m having a short break between jobs. After 2 weeks of cycling, kayaking, camping, tidying, diy, meditation, niggling chores and piano playing, I’m finally ready to start thinking about data again.

Kaggle is a lovely thing. Not only does it encourage learning and cooperation between data scientists and create better algorithms for worthy causes, it also provides nice clean data to play with (believe me, raw data is way messier than this).  I’m currently geeking out on the San Francisco crime dataset – basically a huge CSV containing the date, category (and details), location (with police district) and outcomes of SF crimes from 2003 to 2015.

I could just throw a bunch of learning algorithms at this dataset (and, in truth, am, because that’s what Kaggle’s about) but I’m finding it interesting to think first about what I know about crime patterns from my time in the UK, whether those transfer to the US, and what else might be interesting to look for if I know more about the area.  For instance, are car crimes grouped near easy escape routes (thank you SFdata.gov for the major roads map); are there more thefts on area ‘boundaries’ (a UK study a while back showed more crimes near areas that perpetrators felt comfortable in); are there more juvenile and petty crimes near schools (also in SFdata.gov)?  I know that this is consciously testing out my own biases, but I’m on holiday and it’s an interesting thought experiment.

There’s a similar style of dataset (Taarifa’s water point challenge) on DrivenData. Some similar thinking might be interesting there too.

This is not my journey

I spent some of my Christmas break thinking about work styles: what worked last year, what didn’t, and what I could do to improve my own.  I’ve got it down to just two things: “this is not my journey” and “do what the boss asks for”.

People often talk of their jobs (and themselves) as something that they do now, as in at one particular point in time. That’s a little like saying “I’m in seat 29C” instead of “I’m flying from New York to Japan and when I get there I’m going to try out the heated toilet seats” when someone asks you where you are.  We are all on journeys – sometimes literally, but always on journeys through time, careers, relationships.  And if you want to think about your career, a journey is a useful idea.

So last year I got really frustrated because I ended up doing lots of things that needed to be done, leaving little time for the things I loved (and am good at). And at the same time, I watched people around me refusing to do those things, but doing so much better in their careers and the respect they gained for what they did. “This is not my journey” came out of that.  The big thing that I realised was that the people who were doing well were playing by big-company rules, and I was still working as though I was in a startup or community organisation.

Startups are different. Over the years I’ve started companies, helped start companies and communities, been in small companies that grew, worked in government agencies and 90,000-strong multinationals.  And I’ve noticed two transition points: around 5 and 120 people.  With 5 people, you’re completely flat, and to be honest, also flat-out: everyone does what needs to be done without “but I don’t do that”, and although you may have “roles”, you’re generally just working as a team.  That’s often the founding team: once they start hiring, the concept of staff happens, and it takes leadership not to divide into ‘founders’ and ‘others’.  Communities tend to run this way too, with everyone pitching in and helping where they can. Up to about 120 people, companies are “small”, with everyone knowing each other and helping each other out where they can.  But at about 120 people, divisions start: you can remember the names of about 100-120 people, but beyond that new people become faces unless they’re directly working with you.  At this point, people have defined job roles and a limited group of connections in the organisation, and there are just too many things that need to be done to be able to do them informally any more. And at this point (or sooner), big company rules start to apply.  And can be basically summarised as “do what the boss asks for” and “do what’s on your journey”.

“Do what the boss asks for” is obvious when you have a defined boss – in the small flat organisation (no defined “bosses”) it doesn’t make sense, but it’s absolutely the path to happiness under big-company rules (with negotiation about what’s a fair workload etc of course). It’s a simple sorting question for any new task: “is this what the boss asked for?”.

“Do what’s on your journey” covers what you do with the interstitial time: the times when you’ve done what the boss asked for (or just plain need a short break from it) and are filling-in with small jobs, training courses etc. It’s about doing the things that grow you as a person, in the directions that you want to grow and become stronger, but to do this you do need a journey: a knowledge of who you are, what you want to be, what you want to be able to do and be known for.  I spent part of Christmas working on that too. My journey is the same as it was 3, 5, 10 years ago, but now I have a clear description (thank you, social-worker sister-in-laws and “Business Model You”) of where I’m going.  My sorting question for this isn’t “do what’s on your journey” because that’s a terrible way to test anything; instead it’s “this is not my journey” – if I can say that about a non-boss task, then it’s now not on my list.

It’s already a much saner year. Apart from the odd boss-overload, my filters have kept my work down to both manageable and relevant to my career. I seem to have a bit more respect for this: when I mentioned the plan to a Wall Street friend, she said “oh, you’re playing by the blokes’ rules” and explained that nobody respects the person who picks up odd jobs, so perhaps that shouldn’t be too much of a surprise.  What has been a surprise is that in remote organisations, the switch from small to big-company rules happens at a much much smaller number of people (around 20, sometimes as low as 12), although again the larger distances, timezones and smaller bandwidths (as in you’re not seeing people across the office floor, and it’s hard to have watercooler conversations with everyone) should make that somewhat less surprising.

Web Scraping, part 1: files and APIs

Web scraping is extracting information from webpages, usually (but not always) as tables of data that you can save to csv files, json/xml files or databases.

Design it first, then scrape it

When you start on any piece of code, try asking yourself some design questions first; definitely do this if you’re thinking about something as potentially complex as web scraping code.   So you’ve seen a dataset hiding in a website – it might be a table of data that you need, or lists of data, or data spread across multiple pages on the site. Here are some questions for you:

1. Do you need to write scraper code at all?

    • Is the dataset very small – if you’re talking about 10 values that don’t get updated, writing a scraper will take longer than just typing all the numbers into a spreadsheet.
    • Has the site owner made this data available via an API (search for this on the site) or downloadable datafiles, either on this site or connected sites?  APIs are much much faster (and less messy) to use than writing scraper code.
    • Did someone already scrape this dataset and put it online somewhere else?  Check sites like scraperwiki.com and datahub.io.

2. What do you need to scrape?

    • What format is the data currently in?  Is it on html webpages, or in documents attached to the site?  Is it in excel or pdf files?
    • Does the data on this site get updated?  e.g. do you need to scrape this data regularly, or just once?

3. What’s your maintenance plan?

    • Where are you planning to put the data when you’ve scraped it?  How are you going to make it available to other people (this being the good and polite thing to do when you’ve scraped out some data)? How will you tell people that the data is one-off or regularly updated, and who’s responsible for doing that?
    • Who’s going to maintain your scraper code?  Website owners change their sites all the time, often breaking scraper code – who’s going to be around in a year, two years etc to fix this?

Reading files

Okay. Questions over. Let’s get down to business, working from easy to less-easy, starting with “you got to the website, there’s a file waiting for you to download and you only need to do this once”.

You’ve got the data, and it’s a CSV file

Lucky you: pretty much any visualization package and language out there can read CSV files.  You’ll still have to check (e.g. look for things like messed-up text, be suspicious if all the biggest files the same size, etc) and clean the data (e.g. check that your date columns all contain formatted dates, you have the right number of codes for gender – and no, its not always two  – etc) but as far as scraping goes, you’re done here.

You’ve got the data, and it’s a file with loads of brackets in it

Also, the file extension (the part after the last “.” in a filename) is probably “json”.  This is a json file  – not all data packages will read in this format, so you might have to convert it to CSV (and it might not quite fit the rows-by-columns format so you’ll have to do some work there too), but again, no scraping needed.

You’ve got the data, and it’s a file with loads of <>s in it

Either you’ve got html files (look for obvious things like HTMl tabs: <html>, <head>, <p>, <h1>, etc and text outside the opening <name> and closing <\name> brackets) or you’ve got an xml file.  Another big hint is if the file extension is “.xml”.  Like json, xml is read in by many but not all data visualization packages, and might need converting to csv files; a few quirks make this a little harder than converting json, but there’s a lot of help out there on this online.

You’ve got the data and it’s a PDF file

Ah, the joys of scraping PDF files. PDF files are difficult because even though they *look* like text files on your screen, they’re not nearly as tidy as that behind the scenes.  You need a PDF converter: these take PDF files and (usually) convert them into machine-readable formats (CSV, Excel etc). This means that the 800-page PDF of data tables someone sent you isn’t necessarily the end of your plan to use that data.

First, check that your PDF can be scraped.  Open it, and try to select some of the text in it (as though you were about to cut and paste).  If you can select the text, that’s a good sign – you can probably scrape this pdf file. Try one of these:

  • If you’ve got a small, one-off PDF table, either type it in yourself or use Acrobat’s “save as tables” function
  • If you’ve got just one large PDF to scrape, try a tool like pdftables  or cometdocs
  • If you want to use open source code – and especially if you want to contribute to an open-source project for scraping PDF data, use Tabula.

If you can’t select the text, that’s bad: the PDF has probably been created as an image of the text – your best hope at this point is using OCR software to try to read it in.

Using APIs

An Application Programming Interface (API) is a piece of software that allows websites and programs to communicate with each other.

So how do you check if a site or group has an API?  Usually a Google search for their name plus “API” will do, but you might also want to try  http://api.theirsitename.com/ and http://theirsitename.com/api (APIs are often in these places on a website).  If you still can’t find an API, try contacting the group and asking them if they have one.

Using APIs without coding

APIs are often used to output datasets requested using a RESTful interface, where the dataset request is contained in the address (URL) used to ask for it.   For example, http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:2015 is a RESTful call to the World Bank API that gives you the rural population (as a percentage) of all countries for the years 2000 to 2015 (try it!).  If you’ve entered in the RESTful URL and you’ve got a page with all the data that you need, you don’t have to code: just save the page to a file and use that. Note that you could also get the information on rural populations from http://data.worldbank.org/indicator/SP.RUR.TOTL.ZS but you won’t be able to use that directly in your program (although most good data pages like this will also have a “download” button someone too).

APIs with code

Using code to access an API means you can read data straight into a program and either process it there or combine and use it with other datasets (as a “mashup”). I’ll use Python here, but more other modern languages (PHP, Ruby etc) have functions that do the same things.  So, in Python, we’re going to use the requests library (here’s why).

import requests
worldbank_url = “http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:2015”
r = requests.get(github_url)

Erm. That was it.  The variable r contains several things: information about whether your request to the API was successful or not, error codes if it isn’t, the data you requested if it is. You can quickly check for success by seeing if r.ok is equal to True; if it is, your data is in r.text; if it isn’t, then look at r.status_code and r.text, take a big deep breath and start working out why (you’ll probably see 200, 400 and 500 status codes: here’s a list to get you started).

Many APIs will offer a choice of formats for their output data. The World Bank API outputs its data in XML format unless you ask for either json (add “&format=json” to the end of worldbank_url) or CSV (add “&format=csv”), and it’s always worth checking for these if you don’t want to handle a site’s default format.

Sometimes you’ll need to give a website a password before you can get data from its API.  Here, you add a second parameter to requests.get, “authenticating” that it’s you using the API:

import requests
r = requests.get('https://api.github.com/user', auth=('yourgithubname', ‘yourgithubpassword'))
dataset = r.text

That’s enough to get you started. The second half of this post is on scraping websites when you don’t have an API. In the meantime, please use the above to get yourself some more data.  Places you could start at include HDX and datahub.io (these are both CKAN APIs).

Ruby day 9: Local power!

OSM Changes for Typhoon Ruby

OSM Edits for Typhoon Ruby

This. Just this: local mappers made more changes to the map of the Philippines during Typhoon Ruby than anyone else in the world (by a very very big margin). Anyone who doesn’t believe in the strength of local people to build their own resilience should look very, very hard at these numbers.

Ruby’s all over now for the mappers – DHN is de-activated, everyone’s gone back to work.  There’s still a lot of work to do on the cleanup: MarkC mentioned 35000+ houses destroyed and 200000 people without shelter, and there will still be OSM mapping to do for that.  This weekend Celina’s running a “train the trainers” OSM event in Manila: if you’re one of the people who created the figures above, please please go and help spread your skills further!

Ruby day 8: Next

Ruby has gone now – “goodbye Ruby, Tuesday” is apparently becoming a popular song here.  But the cleanup work is only just starting.  Celina spends a lot of the day trying to UAV stuff sorted out; we get word that the team is getting imagery in the worst affected areas, and she works on getting that data stored and back to the mapping groups that need it.

Response teams are moving into the field – many by boat because they can’t get flights. Requests still come in – one for an assessment of damage to communications and media stations (I suggest that Internews might have a list for this, then find that Agos is tracking communications outages. I stop with the lunchtime work and get back to the day job.

ACAPS puts out an overview map but still need a rainfall map for it. Maning from the Philippines observatory just happens to have one (he’s been working on this for a while).

6pm: A government sit rep has data in it that needs scraping: I get out the pdf tools. Cometdocs dies on it, so I move over to PDFTables. We will slowly teach governments to release tables as datasets not pdfs. We will.  The ACAPS HIRA work starts. First, I have some pdfs to scrape. Except the sitreps from http://www.ndrrmc.gov.ph are PDFs with images of excel spreadsheets cut-and-pasted in. Start reaching out, trying to find out who has the original datafiles, but rein in when gently reminded that these things are political.