Greater Greater Washington

Posts about Data Openness


How school tiers match up with Walk Score

One of the best effects of open data is when people correlate data sets from very different places to generate interesting information. This graph cleverly combines DC's school quality tiers (known as "accountability categories") with Walk Score:

Sandra Moscoso wrote yesterday about how Code for DC's School Decisions Project has been gathering coders who want to use open data to help parents, students, and policymakers. This is one of the graphs they created at the recent Open Data Day using data from the Office of State Superintendent of Eduaction (OSSE).

I've asked to get access to the raw spreadsheet for this graph so we can look at, for example, which schools each dot represents. Here are the accountability categories by school. I will add the spreadsheet with WalkScore matched up with category when it's available. Update: here's the data as a CSV file.

A few things immediately jump out. The most successful DCPS schools have high Walk Scores, while the least successful ones mostly (but not entirely) cluster in the lower range. This may reflect the fact that a public school's success has a lot to do with the socioeconomic status of the neighborhood, and the local retail that is a big part of Walk Score locates in areas with higher incomes.

That income effect is also very pronounced in the graph Sandra posted yesterday:

That's not the case with charter schools. 3 of the 5 "reward" charters are in low-Walk Score areas (which could mean something, or just be a consequence of little data), while the "Rising" charters are basically all over the place. This may have a lot to do with the simple fact that since charters have to find and pay for their own space, they're in all manner of locations.

An interesting future step might be to correlate the school tiers with some data set about land prices or rents, or resident incomes. That could help illuminate whether charters end up locating in less-expensive areas, because they want to serve poorer residents and/or because they need cheaper land.

What do you see from looking at this data?


Another great Capital Bikeshare visualization

Starting at 12:06, Greater Greater Washington contributor Veronica Davis, WABA head Shane Farthing, and Arlington bike planner Chris Eatough will talk about bicycling in DC on the Kojo Nnamdi Show. Listen live or catch the archived audio once it's posted this afternoon.

They also posted this video which visualizes a few days of Capital Bikeshare trips:

This is yet another consequence of Capital Bikeshare's excellent decision to provide anonymous trip data. People have done all kinds of useful things with the data, like MV Jantzen's similar video and interactive visualization tool.


Visualize the DC budget

At the recent International Open Data Hackathon, Justin Grimes put the DC budget into a "treemap," a chart that shows a lot of items as rectangles of different sizes. This makes it very easy to understand how much money is going to different functions.

View larger chart.

Since Justin's spreadsheet was public, I was able to make a copy to tweak a few things. I modified some of the titles to get the agency's abbreviation to the start, so that you can understand more of them in the top-level chart, and revised the color scale to one that should be more perceptible to color-blind readers.

The colors represent which categories increased or decreased in FY2013, the budget approved last year for the fiscal year we're in now. Green boxes increased more, while purple boxes decreased. Though sometimes categories in the DC budget grow and shrink because functions get shifted from one to another, so it can be tricky to really understand increase and decrease numbers without delving into the budget deeply.

What do you notice in the budget?

And if you make a better treemap using a tool without some of the limitations of the Google one, or make a treemap for another area jurisdiction's budget, let us know at

Thanks to Sandra Moscoso for the tip.


WMATA might offer open data for all regional transit

WMATA planners helped STLTransit create an animation of transit across the entire Washington region. That's possible because WMATA has a single data file with all regional agencies' schedules. They hope to make that file public; that would fuel even more tools that aid the entire region.

Click full screen and HD to see the most detail.

One of the obstacles for people who want to build trip planners, analyze what areas are accessible by transit, design visualizations, or create mobile apps is that our region has a great many transit agencies, each with their own separate data files.

Want to build a tool that integrates Metrobus, Fairfax Connector, and Ride On? You have to chase down a number of separate files from different agencies in a number of different places, and not all agencies offer open data at all.

The effect is that many tool builders, especially those outside the region, don't bother to include all of our regional systems. For example, the fun tool Mapnificent, which shows you everywhere you can reach in a set time from one point by transit, only includes WMATA, DC Circulator, and ART services. That means it just won't know about some places you can reach in Fairfax, Alexandria, Montgomery, or Prince George's.

Sites like this can show data for many cities all across the world without the site's author having to do a bunch of custom work in every city, because many transit agencies release their schedules in an open file format called the General Transit Feed Specification (GTFS). Software developer Matt Caywood has been maintaining a list of which local agencies offer GTFS files as well as open real-time data.

We've made some progress. Fairfax Connector, for example, recently started offering its own GTFS feed. But while DASH has one, you have to email them for it, and there's none for Prince George's The Bus.

The best way to foster more neat tools and apps would be to have a single GTFS file that includes all systems. As it turns out, there is such a beast. WMATA already has all of the schedules for all regional systems for its own trip planner. It even creates a single GTFS file now.

Michael Eichler wrote on PlanItMetro that they give this file to the regional Transportation Planning Board for its modeling, and offered it to STLTransit, who have been making animations showing all transit in a region across a single day.

This is one of many useful ways people could use the file. How about letting others get it? Eichler writes, "We are working to make this file publicly available."

Based on the STLTransit video, WMATA's file apparently includes 5 agencies that Caywood's list says have no public GTFS files: PG's TheBus, PRTC OmniLink and OmniRide, Fairfax CUE, Frederick TransIT, and Loudoun County Transit. It also covers Laurel Connect-a-Ride, Reston LINK, Howard Transit, the UM Shuttle, and Annapolis Transit, which aren't even on that list and which most software developers might not even think to look for even if they did have available files.

Last I heard, the obstacles to the file being public included WMATA getting permission from the regional transit agencies, and some trepidation by folks inside the agency about whether they should take on the extra work to do this or would get criticized if the file has any errors.

Let's hope they can make this file public as soon as possible. Since it already exists, it should be a no-brainer. If any regional agencies or folks at WMATA don't understand why this is good for transit, a look at this video should bring it into clear focus.


What's up with NextBus, part 3: Where Ride On is the leader

Which Washington-area bus system was the first to offer its bus position data in an open standard? Would you believe it's Ride On?

In part 2, we talked about how there are many different APIsapplication programming interfaces, the way one computer system, like an app, gets data from to another, like bus positions from a transit agency. The fact that there are so many APIs means many apps don't include all of the types of buses in the region that have real-time positions and predictions.

Prince George's The Bus, Fairfax City CUE and the DC Circulator are available using NextBus, Inc.'s API, which is one of the most common because many agencies contract with NextBus, Inc. WMATA also contracts with NextBus, Inc. but doesn't use its API; WMATA built its own. ART has a different one entirely.

Since NextBus is most common, some residents asked Montgomery County officials why RideOn is not part of NextBus, too. One was Evan Glass, who tweeted last May:

Why is MoCo's Ride On bus system not accessible on the #NextBus app when all other jurisdiction are? #14bus cc @hansriemer (@EvanMGlass)

Note that Glass was talking about the "NextBus DC" app, the one that died this past December and, people discovered, actually wasn't from the same company as the one that provides bus prediction services to many transit agencies.

Councilmember Hans Riemer passed the question on to Ride On officials. Carolyn Biggins replied:

Recently, our staff met with a representative of NextBus to discuss products and costs. Although NextBus has not yet given Montgomery County a firm price quote, they offered a ballpark figure of approximately $55,000 per year for operating costs. This would cover a barebones system which would only have their mobile and desktop web site along with a suite of management tools. There are also undetermined setup fees, probably starting around $15,000 but possibly much higher. ...

At this point the inclusion of NextBus into the Ride On Real Time customer information product line is actively open for discussion. Feedback from our customers and industry critics point us in various directions and toward various apps; and, interestingly, NextBus is not at the top of our customer's request list.

Besides our Eastbanc/Nerds Ride On Real Time App (available for iPhone and android) which, by the way, includes integrated real time data from Metrobus, Metrorail and several Northern Virginia jurisdictions, our customers have asked us to integrate into the "DC Metro Transit App" and "OneBusAway."

We have been working with developers for DC Metro Transit App who recently responded to us with a very encouraging post about our open data: "This seems well thought out and documented. It is also nice that you can get the data in both JSON or XML [the 2 most popular formats for getting data from APIs] in a restful service [basically, a way of making APIs easier for the app developer to use]. I'll give it a try in the app and let you know if I have any questions. You guys are ahead of the curve compared to other agencies."

As you mentioned in your recent e-letter, Open Data and public/private initiatives, such as 3rd party app development, is the wave of the futureto "disrupt and create." 3rd party app development not only unleashes the initiative of the private sector but also provides varied choices for our citizens: the delivery of information in many different formats to suit different consumers with varied needs and tastes.

In developing our Ride On Real Time system, Transit Services has taken this approach, both through internal product development but also by providing its data in as many different formats as possible while trying to maintain fiscal responsibility. We will continue to work with NextBus and other vendors to try and provide Montgomery County citizens the very best in transit information and customer service.

(Notes in brackets added.)

Biggins is right. The solution to the problem of Ride On not being part of many existing apps is not to work with any particular vendor, but to provide open data in more formats.

It's particularly good to hear this from Ride On, because at first they did it wrong, and contracted with a software developer just to build them a website where people can track buses, but with no way for 3rd party app developers (in other words, people who aren't the agency or one of its contractors) to access the data.

Following prodding from Kurt Raschke, us, and others, Ride On started offering an API, and even fairly quickly improved it based on feedback from Raschke and other developers.

Why doesn't everyone just use GTFS?

In the area of transit schedules, one standard has largely emerged as the most common, and one all transit agencies ought to offer: the General Transit Feed Specification, or GTFS. GTFS is basically a set of big files that contain every single stop location and all of the schedules for the transit system. You can download it, write code to analyze it, and then do whatever you want.

There's an analogue of GTFS for real-time buses, called GTFS-realtime. However, real-time is not the same as schedules. With schedules, you can download the whole thing once and it basically won't change except every few months. With real-time bus tracking, the positions change every minute.

GTFS-realtime lets you download the entire set of bus positions as they constantly change. It's a huge amount of data. For some applications, like if you're making a live map showing buses, that's what you want. For the typical smartphone app, where you just want one bus position at a time, it's too much. That much data would overtax the user's data plan and burden the phone trying to deal with it all.

Other APIs, like the NextBus and WMATA APIs, work differently. For those, an app sends it only the very specific question it wants answered, like asking for next arrivals at a particular bus stop.

Twitter, as an analogy, has both types of APIs. For most uses, you use a more transactional API. You ask Twitter for a list of recent tweets matching a hashtag, or ask it to post a specific tweet. But Twitter also offers a "firehose" API where certain users, who have to be approved ahead of time, can get the entire stream of all tweets, everywhere.

We need GTFS-realtime AND a transactional API

Ultimately, for transit, there needs to be both. If you're building a smartphone app, it's too hard to get the firehose of all bus positions, and easier to ask one simple question. But if you're designing a real-time screen, it's a burden to ask for each possible bus and bus stop every minute; you'd rather just get all the data at once.

WMATA's API also goes through another service, called Mashery, which limits how many of these API questions you can ask in a set period of time. The intent is to keep someone from overwhelming WMATA's systems and crashing them. But when Eric Fidler was building the real-time screen demos, he found that just asking for a few bus lines at nearby bus stops every minute, his system quickly hit the limit.

Plus, since one server was running many screens at once, the more screens, the quicker you hit the limit. We kept asking WMATA to increase the limit, and they did, but for many applications these limits will quickly become untenable.

Every transit agency ought to provide GTFS-realtime feeds for those that need them. ART's vendor, Connexionz, now also offers it, making 2 area agencies that do. Others should join Ride On and ART and offer this feed as well. Often it will be the agency's API contractor that offers it; agencies that pay NextBus for bus tracking services should require NextBus to offer a GTFS-realtime feed.

What's the common transactional API?

At the same time, we need a transactional API, ideally a common one. If everyone used the same API, it would be really easy for app developers to support all of the region's (or the nation's or world's) bus systems.

Unfortunately, there is no consensus here, unlike with GTFS. Most APIs are nonstandard ones an agency's IT staff or its contractor devised. New York uses the European standard SIRI, but had to make some changes of its own, and few US agencies use that. NextBus's is pretty widespread since that company serves a lot of agencies.

What to do? There are a few solutions.

First of all, everyone could get together and try to coalesce around an existing standard. It doesn't really matter which. It doesn't have to be the best one. Most standards are pretty imperfect; we type on QWERTY keyboards, which are one of the least efficient keyboard layouts you could devise, but any effort to come up with something else has failed. There's a strong lock-in, but to some extent, it doesn't really matter; we manage to type fine.

We could use SIRI; Europe does. Or NextBus could make their API a standard. Google did this with GTFS. Google initially created GTFS, but then they stopped controlling it and let the community of developers and agencies take control. They changed the "G" to stand for "General" instead of "Google." Many standards in computing started out as some company's property, but they transferred it to some national or international committee to shepherd.

If NextBus wanted to do this, they would probably want to give it a different trademark, so an agency offering the API wouldn't be saying they offer "NextBus" (we've had enough problems with NextBus trademark confusion already). And they would need to let other agencies and developers make changes, through some process, without the company having control.

Another approach would be to not worry about this at all. It's not all that hard to write some code to interact with multiple APIs, as long as they have a few features that you need to make them interoperable, like common identifiers. In the next part, we'll talk about this.

Some other company or entity could also set up an intermediary computer system that takes in all of the data on one end, and lets app developers connect to it. It would have to get the "firehose" style data from the agencies, and can then even offer 5, 10, or 50 different styles of APIs on the other.

What has to happen for that to be possible? For one, someone has to maintain it and pay for the bandwidth. An organization like COG, or a partnership of the DC, Maryland, and Virginia state DOTs, could do it. Or, to go national, a group like APTA or a federal agency could provide it. Or, perhaps some private entity would find it worthwhile, though the amount of revenue they could make is probably limited.

But for that to happen, the agencies have to offer the "firehose" of GTFS-realtime. For that reason, while there isn't consensus around all of the APIs, our region's transit agencies can and should take one step now, to offer GTFS-realtime, as Ride On and ART now do.


Where are people riding CaBi?

MV Jantzen has created another one of his interactive visualization tools, this time for Capital Bikeshare's public trip data. The tool lets you see where people ride to and from a particular station:

The tool looks at the trips in the third quarter of 2012. Click on any station to see what other stations people ride to and from, as lines of varying thickness. Click on any line to see how many trips there were between that pair of stations, and how "unbalanced" the trips are (whether the number of trips in one direction is close to that in the other, or if one direction dominates).

This isn't Jantzen's first interactive tool. He made one earlier this year for Metro ridership patterns. Around the same time, graduate student Rahul Nair made a tool for CaBi that has a lot in common with Jantzen's new one.

What do you notice with the tool?


Watch the patterns of Metro ridership

As a Metro train rolls along the tracks, who gets on and off? Where are they going? You can't read minds, but thanks to Metro's ridership data, you can watch patterns of riders on a typical train in a great new tool.

Morning peak riders at Union Station on a Shady Grove-bound Red Line train.
Image from

When public agencies release data sets, people can do all kinds of fascinating things with them. Yesterday, Matt Johnson used the Metro ridership data to show us which stations are busiest (with more to come), and Aaron Wiener looked at the most popular trips on different lines.

Reader Graham MacDonald sent along this interactive tool he created, Pick a train line, a direction, and a time of day, click play, and see a simulated train pick up and drop off passengers.

At each stop, the symbol for the train gets larger or smaller as the number of passengers on board changes. Meanwhile, circles at other stations on the map show where the passengers on the train are going.

Look below the map, and bar graphs show how the ridership of trains at this particular stop compare to equivalent stops along other lines.

It's all aggregate data showing a typical train total numbers of riders along segments of the lines, not one actual train, but you can almost imagine the riders on board a train all going to their many destinations.

What interesting patterns do you notice from playing with this tool?


Which Metro stations are busiest?

Thanks to data from Metro's planning department, we have the ability to analyze many different ridership patterns. Today, let's take a look at stations, and see which are the busiest.


During the weekday morning rush period, many people are entering the Metro system to get to work. The busiest stations for entering customers fall all across the region.

Here's a table of the top 10:

Metro AM Peak period entries: Top 10 stations
RankStationAvg. entries
1Union Station9,711.7
3Shady Grove9,557.4
4West Falls Church6,816.1
6New Carrollton6.320.9
8Silver Spring6,026.7
10Pentagon City5,714.9

Half of these stations are end-of-line stations with large park-and-ride lots. Pentagon and West Falls Church are both major bus hubs, as is Silver Spring. Union Station, of course, is at the top because it's where many commuter rail riders enter the Metro system.

The entries at these 10 stations account for 30.7% of all entries during the AM peak across the system.

And where are these riders going? The busiest stations for exits are all in the region's core. Here's the top 10:

Metro AM Peak period exits: Top 10 stations
RankStationAvg. exits
1Farragut North16,573.7
2Farragut West15,497.7
3Metro Center15,358.6
4L'Enfant Plaza13,143.5
5Union Station12,029.7
6McPherson Square11,185.4
7Gallery Place10,682.5
8Foggy Bottom10,529.9

Of all the people who exit the Metro system during the morning peak period, 50.3% of them exit at one of the top 10 stations. These 10 stations account for more exits than all the other stations combined, with 118,757 people exiting these stations on average each morning.

Also of note, the 2 Farragut Square stations combined handle more than twice as many exits as the third place station, Metro Center. Without the objection of the National Park Service, the Farragut stations would have been one station, and a crowded one at that.

Afternoon rush

We can see similar patterns during the evening rush hour.

The top 10 evening entry stations are all in the regional core, with just one, Rosslyn, outside downtown Washington. The only station in the AM peak top 10 exit list that is not in the evening entry list is Pentagon (which is 13th place). It's been replaced by Smithsonian (which is 14th in the AM exits list).

The top 10 entry stations for the PM peak represent 45.7% of all PM peak entries systemwide, a slightly smaller share than the share of the top 10 morning exit stations.

Metro PM Peak period entries: Top 10 stations
1Farragut North15,948.4
2Metro Center15,675.7
3Farragut West13,594.5
4L'Enfant Plaza13,196.7
5Union Station12,563.9
6Gallery Place12,089.8
7Foggy Bottom11,099.5
8McPherson Square9,830.1

And where are these evening commuters headed?

Metro PM Peak period exits: Top 10 stations
1Union Station11,587.7
3Shady Grove8,320.5
4Pentagon City7,636.7
5Gallery Place6,985.8
6West Falls Church6,555.5
7Dupont Circle6,282.5
9Silver Spring5,782.3
10New Carrollton5,645.5

The evening exits top 10 looks a lot like the morning entries top 10. But Huntington and Franconia-Springfield, which are the #7 and #9 top entry stations in the morning have dropped to #12 and #11, respectively. In their place are 2 central stations, Gallery Place and Dupont Circle.

This difference can probably be attributed to the entertainment venues and restaurants near these stations. Dupont Circle and Gallery Place are known for their nightlife opportunities, and passengers headed there probably drive the numbers up a bit.

The top 10 PM peak exit stations account for 28.3% of all exiting passengers systemwide on average.


The time between the morning and evening rush hours is what Metro calls the midday period. It's probably marked not just by people running errands or going to lunch, but also by workers who commute slightly later in the morning or earlier in the afternoon than most or who have jobs that don't have 9-5 hours.

Metro midday period entries and exits: Top 10 stations
1Union Station6,209.51Union Station7,114.5
2Metro Center5,003.62Metro Center7,085.3
3Gallery Place4,419.53Gallery Place6,151.8
4Foggy Bottom4,311.34Farragut North5,866.7
5Farragut North4,308.05Smithsonian5,135.9
6Dupont Circle3,776.06Foggy Bottom4,812.2
7L'Enfant Plaza3,721.17Farragut West4,488.9
8Farragut West3,572.98L'Enfant Plaza4,076.9
9Pentagon City3,532.59Dupont Circle4,055.2
10Rosslyn3,437.510Pentagon City3,781.6

I think the fact that the top 3 midday entry stations are the same as the top 3 exit stations is interesting. Union Station makes a lot of sense, considering its role as an intermodal hub. The reasons for Gallery Place and Metro Center are less clear. Keep in mind that people changing trains aren't counted; only people leaving or entering the faregates appear in these numbers.

Additionally, 9 stations are in both lists. Rosslyn, #10 in the midday entries list does not appear in the exits list because it has fallen to #12. Instead, Smithsonian appears in 5th place on the exits list. This is probably because many people (especially tourists) are headed to see the monuments or museums in the vicinity. Few are leaving the Mall area yet, though, perhaps accounting for Smithsonian's absence from the top entry stations list (it's 16th).


The period after the PM rush is the evening period. Note that these numbers do not include the average ridership for the after midnight service provided on Fridays.

Metro evening period entries and exits: Top 10 stations
1Gallery Place7,489.01Dupont Circle2,884.3
2Metro Center5,897.42Gallery Place2,803.5
3Foggy Bottom4,533.83Columbia Heights2,772.5
4Farragut North4,523.34Pentagon City2,512.6
5Union Station4126.55Silver Spring2,493.6
6Dupont Circle3,963.46Shady Grove2,349.8
7Farragut West3,875.47Vienna2,261.1
8Navy Yard3,494.18Rosslyn2,163.7
9Pentagon City2,519.49Union Station2,034.9
10McPherson Square2,345.810Fort Totten1,969.5

As expected, Gallery Place and Dupont Circle, major nightlife areas, appear in both the evening entry and exit top 10. Most of the other entry stations are in the core. Navy Yard comes in at number 8, perhaps due to Nats games during May, when the data were collected.

Shady Grove, Vienna, and Silver Spring are all major suburban hubs, and their presence in the top 10 exit list isn't surprising. Columbia Heights and Fort Totten are both stations that haven't appeared in other top 10 counts, so their inclusion is somewhat surprising.

What surprises you about these numbers?


Post-turkey video: That's a lot of dots

Using GTFS data, STLTransit has created videos showing all of the transit vehicles in a city over one day. Here's Washington's.

Via PlanItMetro.

The video shows one dot for each schduled Metrorail, Metrobus and Circulator vehicle. View the video in full screen (click the rectangular icon in the lower right of the video) to more clearly see the trains, which the video shows in a color corresponding to their line.

Support Us
DC Maryland Virginia Arlington Alexandria Montgomery Prince George's Fairfax Charles Prince William Loudoun Howard Anne Arundel Frederick Tysons Corner Baltimore Falls Church Fairfax City