Government Legislative Data, The Other Great White Fail Whale & How To Avoid It

Data: Big Data
Location: B118-119
Average rating: ***..
(3.50, 2 ratings)

In 2009, a new Majority at the New York State Senate, in-line with Open Government mandates issued by the White House, began an Open Senate initiative. A suite of tools were envisioned, designed and deployed that created an unprecedented level of transparency in a legislative body that was historically known for its closed doors. Open Legislation was one of the first tools released to allow for public consumption of legislative data in human, state legal and machine readable formats.

In the initial stages, Open Legislation, was designed and maintained by a single developer, who between other tasks was stretched thin. His job was to develop iterations quickly and with the option to fail publicly. The technology stack at this time consisted of the Tomcat servlet engine, a MySQL database using DataNucleus, Lucene for full text search, OSCache, and was hosted on an Amazon EC2. This succeeded in opening the door to legislative data, but the service was plagued by downtime, inefficiency and data quality issues – just nine months later it was clear that a second look was needed.

In an effort to offer faster load times and eliminate issues that were causing down time and in coordination with an intern from Rensselaer Polytechnic Institute, the controller to the front end was redesigned. Taking in to consideration that Open Legislation ultimately serves infrequently updated documents, we sought to remove the bottleneck that was created by our database. With Lucene already in place, we began loading the search index with pre-serialized versions of documents in both JSON and XML. That way, when our API was hit, we simply pulled up the document and dumped the latest version of the data. Without expensive and timely queries to the database, uptime and performance greatly improved, but in being forced to “eat our own food” substantial issues with data quality were realized.

Sponsors were missing, bills were fragmented, actions, or bill statuses, were duplicated, votes and committee agendas were completely missing, and so on. We had given our service a better face, but on the inside we were still plagued by latent issues involved with the speedy creation of the database, and the in some cases by the quality of the data we were receiving. There was an option to reprocess data after tweaking our parsers, but with Data Nucleus still being our primary backup such a process could have taken a great deal of time and offered no guarantee to fix similar issues in the future. With the same thought-process conceived in removing the database from our front end we now considered how we could remove it from our application altogether.

The data feed being used to generate documents for Open Legislation was our final place to look. Data is typically sent in a fixed format file (SOBI) that contains references to many documents. For the past two years we had been using SOBI documents to update our database, but we didn’t have a transportable document that could be easily generated or read, just pieces of documents spread across ten to twenty files. What we created to replace our heavily coupled data stream and database is a flat file hierarchy of JSON documents based on session year and type. Decoupling this process gave as a huge advantage when parsing issues were discovered, since instead of using a process that took hours to days to complete we could now completely reprocess 2.5 years worth of legislative data within a matter of minutes.

Since deploying this new process our issues are beginning to transition from, “what are we doing wrong” to “what’s broken in our data feed?” In it’s current iteration our service is still hosted on an EC2 instance with Apache Tomcat, but now fully powered by Lucene and using Varnish as a caching solution.

The final result of this work is the internal rebuilding of trust. It was easy enough to include a beta warning on the application for data quality issues, but beyond serving external users a huge aspect of the application is to provide readily accessible data for Senate specific applications. Those who could best scrutinize what we were offering found gaping flaws, and in some places where integration had already begun steps were taken back. With these new enhancements, though, we are finally offering a consistently accurate data set. With a solid application we will now have time to look forward to what’s next. The future of this project includes “diffing” the generated JSON documents to see how legislative data is built and maintained, including new data types and continuing efforts to make this open source application more of a framework and less NYSS specific.

Jared Williams

New York State Senate

Jared is part of the Open Government movement started by the New York State Senate. He began as an intern in February of 2010 and was hired on in August of that year. He graduated with a BS in Computer Science from SUNY Albany.

Photo of Noel Hidalgo

Noel Hidalgo

World Economic Forum

Noel Hidalgo works at the intersection of politics, community, technology and art. Advocate for free and open government data, open source software, open communities, free culture, dutch utility bicycles and transparent government; he is an established global leader in progressive political and technology communities; Eagle Scout. He is a member of the Royal Society of Arts (UK), a member of the British Council’s Transatlantic Network 2020, and on the board of advisors for Digital Democracy, a non-profit in New York City. He is one of three co-organizers of New Yorkʼs only meetup dedicated to open government, the Open New York Forum.

Graylin Kim

New York State Senate

Graylin is a recent graduate of Rensselaer Polytechnic Institute and a member of the Rensselaer Center for Open Source Software at RPI. He officially joined the development team at the NY Senate CIO in June of 2011 and has been working informally with them since the summer of 2010.

Comments on this page are now closed.


Picture of Sheeri K. Cabral
Sheeri K. Cabral
09/04/2011 10:03pm PDT

Video for this talk can be found at

Berend Tober
07/27/2011 6:07am PDT

Decent presentation (both in describing the problem being addressed and in explaining how the project addresses the problem) about an interesting and useful application domain.