Greater Greater Washington

Posts about Data Openness

Government


DC's laws aren't yours

There's a deep, persistent, and crippling problem with the laws of DC: you can't download a copy.


Photo by PublicResource.org on Flickr.

Due to a weak contract and a variety of legal techniques, it's not possible to create better ways to read the law or download it for offline access, or even to try to do better than the crummy online portal that serves as its official source.

It also means that it's hard to discuss legal matters online, since you can't link to specific lawsthis Salon.com article about David Gregory has had a broken link to the law in question since 40 minutes after it was posted, months ago.

How the law became scarce

How did this happen? It's a tricky answer of access, ownership, and contracts.

The DC Council writes and publishes bills, which are additions and subtractions to the law itself. The law is compiled by a contractorpreviously WestLaw, now LexisNexis. So the contractor holds a complete copy of the law.

The contractor publishes a few different versions of the "compiled law," each of which with restrictions:

Unfortunately, courts have upheld these types of restrictions in the CD and website Terms of Service. They get further support from the wire fraud statute, which prosecutors used in the Aaron Swartz case to escalate charges to felonies. And in all of these versions, the contractor tries to claim copyright through compilation copyright and additional content like citations and prefaces.

In the face of these strong guards against freeing the law, the most reasonable avenue for creating a freely-accessible copy is buying and scanning the printed copies, which is exactly what some citizens are starting to do.

Why this matters

This has effects in many places. Advocacy organizations pushing for changes can't reference laws by linking to them, so they have to copy & paste relevant sections and hope that people trust their versions. Of course, when laws go out of date, these copy and pasted guides stop working.

The goal of better educating the police about laws (like the rules of the road for bicyclists) is harder. Police can't have an offline copy of the law for quick access in the field, and the online version is near-useless on smartphones.

It's also locking the DC Council into using a contractor for this purpose. DC's contracts with WestLaw and LexisNexis aren't strong enough to force the contractors to provide them with a copyright-cleaned version, so the council itself doesn't have a compiled copy of the law that they can publish by themselves if they want to take this in-house.

What's Next

This is a hard problem to unwrap and fix, and there are multiple efforts afoot.

Waldo Jaquith is building The State Decoded, an open-source system for storing and displaying state codes. It's already deployed with Virginia's laws. Public Resource.org is working on the long task of scanning and digitizing the print edition. And a group of residents are encouraging the council to write a better contract than the current one with LexisNexis, which doesn't provide for copyright-free copies.

Meanwhile, it'll be months or years until it's possible to download DC's laws onto your iPhone and clarify whether it is, indeed, legal to bike on a sidewalk (sometimes) or drink in public space (never).

Education


How school tiers match up with Walk Score

One of the best effects of open data is when people correlate data sets from very different places to generate interesting information. This graph cleverly combines DC's school quality tiers (known as "accountability categories") with Walk Score:

Sandra Moscoso wrote yesterday about how Code for DC's School Decisions Project has been gathering coders who want to use open data to help parents, students, and policymakers. This is one of the graphs they created at the recent Open Data Day using data from the Office of State Superintendent of Eduaction (OSSE).

I've asked to get access to the raw spreadsheet for this graph so we can look at, for example, which schools each dot represents. Here are the accountability categories by school. I will add the spreadsheet with WalkScore matched up with category when it's available. Update: here's the data as a CSV file.

A few things immediately jump out. The most successful DCPS schools have high Walk Scores, while the least successful ones mostly (but not entirely) cluster in the lower range. This may reflect the fact that a public school's success has a lot to do with the socioeconomic status of the neighborhood, and the local retail that is a big part of Walk Score locates in areas with higher incomes.

That income effect is also very pronounced in the graph Sandra posted yesterday:

That's not the case with charter schools. 3 of the 5 "reward" charters are in low-Walk Score areas (which could mean something, or just be a consequence of little data), while the "Rising" charters are basically all over the place. This may have a lot to do with the simple fact that since charters have to find and pay for their own space, they're in all manner of locations.

An interesting future step might be to correlate the school tiers with some data set about land prices or rents, or resident incomes. That could help illuminate whether charters end up locating in less-expensive areas, because they want to serve poorer residents and/or because they need cheaper land.

What do you see from looking at this data?

Bicycling


Another great Capital Bikeshare visualization

Starting at 12:06, Greater Greater Washington contributor Veronica Davis, WABA head Shane Farthing, and Arlington bike planner Chris Eatough will talk about bicycling in DC on the Kojo Nnamdi Show. Listen live or catch the archived audio once it's posted this afternoon.

They also posted this video which visualizes a few days of Capital Bikeshare trips:

This is yet another consequence of Capital Bikeshare's excellent decision to provide anonymous trip data. People have done all kinds of useful things with the data, like MV Jantzen's similar video and interactive visualization tool.

Budget


Visualize the DC budget

At the recent International Open Data Hackathon, Justin Grimes put the DC budget into a "treemap," a chart that shows a lot of items as rectangles of different sizes. This makes it very easy to understand how much money is going to different functions.

View larger chart.

Since Justin's spreadsheet was public, I was able to make a copy to tweak a few things. I modified some of the titles to get the agency's abbreviation to the start, so that you can understand more of them in the top-level chart, and revised the color scale to one that should be more perceptible to color-blind readers.

The colors represent which categories increased or decreased in FY2013, the budget approved last year for the fiscal year we're in now. Green boxes increased more, while purple boxes decreased. Though sometimes categories in the DC budget grow and shrink because functions get shifted from one to another, so it can be tricky to really understand increase and decrease numbers without delving into the budget deeply.

What do you notice in the budget?

And if you make a better treemap using a tool without some of the limitations of the Google one, or make a treemap for another area jurisdiction's budget, let us know at info@ggwash.org.

Thanks to Sandra Moscoso for the tip.

Transit


WMATA might offer open data for all regional transit

WMATA planners helped STLTransit create an animation of transit across the entire Washington region. That's possible because WMATA has a single data file with all regional agencies' schedules. They hope to make that file public; that would fuel even more tools that aid the entire region.


Click full screen and HD to see the most detail.

One of the obstacles for people who want to build trip planners, analyze what areas are accessible by transit, design visualizations, or create mobile apps is that our region has a great many transit agencies, each with their own separate data files.

Want to build a tool that integrates Metrobus, Fairfax Connector, and Ride On? You have to chase down a number of separate files from different agencies in a number of different places, and not all agencies offer open data at all.

The effect is that many tool builders, especially those outside the region, don't bother to include all of our regional systems. For example, the fun tool Mapnificent, which shows you everywhere you can reach in a set time from one point by transit, only includes WMATA, DC Circulator, and ART services. That means it just won't know about some places you can reach in Fairfax, Alexandria, Montgomery, or Prince George's.

Sites like this can show data for many cities all across the world without the site's author having to do a bunch of custom work in every city, because many transit agencies release their schedules in an open file format called the General Transit Feed Specification (GTFS). Software developer Matt Caywood has been maintaining a list of which local agencies offer GTFS files as well as open real-time data.

We've made some progress. Fairfax Connector, for example, recently started offering its own GTFS feed. But while DASH has one, you have to email them for it, and there's none for Prince George's The Bus.

The best way to foster more neat tools and apps would be to have a single GTFS file that includes all systems. As it turns out, there is such a beast. WMATA already has all of the schedules for all regional systems for its own trip planner. It even creates a single GTFS file now.

Michael Eichler wrote on PlanItMetro that they give this file to the regional Transportation Planning Board for its modeling, and offered it to STLTransit, who have been making animations showing all transit in a region across a single day.

This is one of many useful ways people could use the file. How about letting others get it? Eichler writes, "We are working to make this file publicly available."

Based on the STLTransit video, WMATA's file apparently includes 5 agencies that Caywood's list says have no public GTFS files: PG's TheBus, PRTC OmniLink and OmniRide, Fairfax CUE, Frederick TransIT, and Loudoun County Transit. It also covers Laurel Connect-a-Ride, Reston LINK, Howard Transit, the UM Shuttle, and Annapolis Transit, which aren't even on that list and which most software developers might not even think to look for even if they did have available files.

Last I heard, the obstacles to the file being public included WMATA getting permission from the regional transit agencies, and some trepidation by folks inside the agency about whether they should take on the extra work to do this or would get criticized if the file has any errors.

Let's hope they can make this file public as soon as possible. Since it already exists, it should be a no-brainer. If any regional agencies or folks at WMATA don't understand why this is good for transit, a look at this video should bring it into clear focus.

Transit


What's up with NextBus, part 3: Where Ride On is the leader

Which Washington-area bus system was the first to offer its bus position data in an open standard? Would you believe it's Ride On?

In part 2, we talked about how there are many different APIsapplication programming interfaces, the way one computer system, like an app, gets data from to another, like bus positions from a transit agency. The fact that there are so many APIs means many apps don't include all of the types of buses in the region that have real-time positions and predictions.

Prince George's The Bus, Fairfax City CUE and the DC Circulator are available using NextBus, Inc.'s API, which is one of the most common because many agencies contract with NextBus, Inc. WMATA also contracts with NextBus, Inc. but doesn't use its API; WMATA built its own. ART has a different one entirely.

Since NextBus is most common, some residents asked Montgomery County officials why RideOn is not part of NextBus, too. One was Evan Glass, who tweeted last May:

Why is MoCo's Ride On bus system not accessible on the #NextBus app when all other jurisdiction are? #14bus cc @hansriemer (@EvanMGlass)

Note that Glass was talking about the "NextBus DC" app, the one that died this past December and, people discovered, actually wasn't from the same company as the one that provides bus prediction services to many transit agencies.

Councilmember Hans Riemer passed the question on to Ride On officials. Carolyn Biggins replied:

Recently, our staff met with a representative of NextBus to discuss products and costs. Although NextBus has not yet given Montgomery County a firm price quote, they offered a ballpark figure of approximately $55,000 per year for operating costs. This would cover a barebones system which would only have their mobile and desktop web site along with a suite of management tools. There are also undetermined setup fees, probably starting around $15,000 but possibly much higher. ...

At this point the inclusion of NextBus into the Ride On Real Time customer information product line is actively open for discussion. Feedback from our customers and industry critics point us in various directions and toward various apps; and, interestingly, NextBus is not at the top of our customer's request list.

Besides our Eastbanc/Nerds Ride On Real Time App (available for iPhone and android) which, by the way, includes integrated real time data from Metrobus, Metrorail and several Northern Virginia jurisdictions, our customers have asked us to integrate into the "DC Metro Transit App" and "OneBusAway."

We have been working with developers for DC Metro Transit App who recently responded to us with a very encouraging post about our open data: "This seems well thought out and documented. It is also nice that you can get the data in both JSON or XML [the 2 most popular formats for getting data from APIs] in a restful service [basically, a way of making APIs easier for the app developer to use]. I'll give it a try in the app and let you know if I have any questions. You guys are ahead of the curve compared to other agencies."

As you mentioned in your recent e-letter, Open Data and public/private initiatives, such as 3rd party app development, is the wave of the futureto "disrupt and create." 3rd party app development not only unleashes the initiative of the private sector but also provides varied choices for our citizens: the delivery of information in many different formats to suit different consumers with varied needs and tastes.

In developing our Ride On Real Time system, Transit Services has taken this approach, both through internal product development but also by providing its data in as many different formats as possible while trying to maintain fiscal responsibility. We will continue to work with NextBus and other vendors to try and provide Montgomery County citizens the very best in transit information and customer service.

(Notes in brackets added.)

Biggins is right. The solution to the problem of Ride On not being part of many existing apps is not to work with any particular vendor, but to provide open data in more formats.

It's particularly good to hear this from Ride On, because at first they did it wrong, and contracted with a software developer just to build them a website where people can track buses, but with no way for 3rd party app developers (in other words, people who aren't the agency or one of its contractors) to access the data.

Following prodding from Kurt Raschke, us, and others, Ride On started offering an API, and even fairly quickly improved it based on feedback from Raschke and other developers.

Why doesn't everyone just use GTFS?

In the area of transit schedules, one standard has largely emerged as the most common, and one all transit agencies ought to offer: the General Transit Feed Specification, or GTFS. GTFS is basically a set of big files that contain every single stop location and all of the schedules for the transit system. You can download it, write code to analyze it, and then do whatever you want.

There's an analogue of GTFS for real-time buses, called GTFS-realtime. However, real-time is not the same as schedules. With schedules, you can download the whole thing once and it basically won't change except every few months. With real-time bus tracking, the positions change every minute.

GTFS-realtime lets you download the entire set of bus positions as they constantly change. It's a huge amount of data. For some applications, like if you're making a live map showing buses, that's what you want. For the typical smartphone app, where you just want one bus position at a time, it's too much. That much data would overtax the user's data plan and burden the phone trying to deal with it all.

Other APIs, like the NextBus and WMATA APIs, work differently. For those, an app sends it only the very specific question it wants answered, like asking for next arrivals at a particular bus stop.

Twitter, as an analogy, has both types of APIs. For most uses, you use a more transactional API. You ask Twitter for a list of recent tweets matching a hashtag, or ask it to post a specific tweet. But Twitter also offers a "firehose" API where certain users, who have to be approved ahead of time, can get the entire stream of all tweets, everywhere.

We need GTFS-realtime AND a transactional API

Ultimately, for transit, there needs to be both. If you're building a smartphone app, it's too hard to get the firehose of all bus positions, and easier to ask one simple question. But if you're designing a real-time screen, it's a burden to ask for each possible bus and bus stop every minute; you'd rather just get all the data at once.

WMATA's API also goes through another service, called Mashery, which limits how many of these API questions you can ask in a set period of time. The intent is to keep someone from overwhelming WMATA's systems and crashing them. But when Eric Fidler was building the real-time screen demos, he found that just asking for a few bus lines at nearby bus stops every minute, his system quickly hit the limit.

Plus, since one server was running many screens at once, the more screens, the quicker you hit the limit. We kept asking WMATA to increase the limit, and they did, but for many applications these limits will quickly become untenable.

Every transit agency ought to provide GTFS-realtime feeds for those that need them. ART's vendor, Connexionz, now also offers it, making 2 area agencies that do. Others should join Ride On and ART and offer this feed as well. Often it will be the agency's API contractor that offers it; agencies that pay NextBus for bus tracking services should require NextBus to offer a GTFS-realtime feed.

What's the common transactional API?

At the same time, we need a transactional API, ideally a common one. If everyone used the same API, it would be really easy for app developers to support all of the region's (or the nation's or world's) bus systems.

Unfortunately, there is no consensus here, unlike with GTFS. Most APIs are nonstandard ones an agency's IT staff or its contractor devised. New York uses the European standard SIRI, but had to make some changes of its own, and few US agencies use that. NextBus's is pretty widespread since that company serves a lot of agencies.

What to do? There are a few solutions.

First of all, everyone could get together and try to coalesce around an existing standard. It doesn't really matter which. It doesn't have to be the best one. Most standards are pretty imperfect; we type on QWERTY keyboards, which are one of the least efficient keyboard layouts you could devise, but any effort to come up with something else has failed. There's a strong lock-in, but to some extent, it doesn't really matter; we manage to type fine.

We could use SIRI; Europe does. Or NextBus could make their API a standard. Google did this with GTFS. Google initially created GTFS, but then they stopped controlling it and let the community of developers and agencies take control. They changed the "G" to stand for "General" instead of "Google." Many standards in computing started out as some company's property, but they transferred it to some national or international committee to shepherd.

If NextBus wanted to do this, they would probably want to give it a different trademark, so an agency offering the API wouldn't be saying they offer "NextBus" (we've had enough problems with NextBus trademark confusion already). And they would need to let other agencies and developers make changes, through some process, without the company having control.

Another approach would be to not worry about this at all. It's not all that hard to write some code to interact with multiple APIs, as long as they have a few features that you need to make them interoperable, like common identifiers. In the next part, we'll talk about this.

Some other company or entity could also set up an intermediary computer system that takes in all of the data on one end, and lets app developers connect to it. It would have to get the "firehose" style data from the agencies, and can then even offer 5, 10, or 50 different styles of APIs on the other.

What has to happen for that to be possible? For one, someone has to maintain it and pay for the bandwidth. An organization like COG, or a partnership of the DC, Maryland, and Virginia state DOTs, could do it. Or, to go national, a group like APTA or a federal agency could provide it. Or, perhaps some private entity would find it worthwhile, though the amount of revenue they could make is probably limited.

But for that to happen, the agencies have to offer the "firehose" of GTFS-realtime. For that reason, while there isn't consensus around all of the APIs, our region's transit agencies can and should take one step now, to offer GTFS-realtime, as Ride On and ART now do.

Bicycling


Where are people riding CaBi?

MV Jantzen has created another one of his interactive visualization tools, this time for Capital Bikeshare's public trip data. The tool lets you see where people ride to and from a particular station:

The tool looks at the trips in the third quarter of 2012. Click on any station to see what other stations people ride to and from, as lines of varying thickness. Click on any line to see how many trips there were between that pair of stations, and how "unbalanced" the trips are (whether the number of trips in one direction is close to that in the other, or if one direction dominates).

This isn't Jantzen's first interactive tool. He made one earlier this year for Metro ridership patterns. Around the same time, graduate student Rahul Nair made a tool for CaBi that has a lot in common with Jantzen's new one.

What do you notice with the tool?

Transit


Watch the patterns of Metro ridership

As a Metro train rolls along the tracks, who gets on and off? Where are they going? You can't read minds, but thanks to Metro's ridership data, you can watch patterns of riders on a typical train in a great new tool.


Morning peak riders at Union Station on a Shady Grove-bound Red Line train.
Image from RidingMetro.com.

When public agencies release data sets, people can do all kinds of fascinating things with them. Yesterday, Matt Johnson used the Metro ridership data to show us which stations are busiest (with more to come), and Aaron Wiener looked at the most popular trips on different lines.

Reader Graham MacDonald sent along this interactive tool he created, RidingMetro.com. Pick a train line, a direction, and a time of day, click play, and see a simulated train pick up and drop off passengers.

At each stop, the symbol for the train gets larger or smaller as the number of passengers on board changes. Meanwhile, circles at other stations on the map show where the passengers on the train are going.

Look below the map, and bar graphs show how the ridership of trains at this particular stop compare to equivalent stops along other lines.

It's all aggregate data showing a typical train total numbers of riders along segments of the lines, not one actual train, but you can almost imagine the riders on board a train all going to their many destinations.

What interesting patterns do you notice from playing with this tool?

Support Us
DC Maryland Virginia Arlington Alexandria Montgomery Prince George's Fairfax Charles Prince William Loudoun Howard Anne Arundel Frederick Tysons Corner Baltimore Falls Church Fairfax City
CC BY-NC