Put open source to work
July 16–17, 2018: Training & Tutorials
July 18–19, 2018: Conference
Portland, OR

Deconstructing the US Patent Database

Van Lindberg (Python Software Foundation)
4:15pm4:55pm Wednesday, July 18, 2018
Artificial intelligence
Location: Portland 251
Level: Intermediate
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • Data scientists, natural language researchers, and lawyers

Prerequisite knowledge

  • Familiarity with the general terms associated with natural language processing (corpus, n-grams, etc.) and common concepts and tools used in language-oriented machine learning (word2vec, doc2vec, gensim, nltk, TensorFlow, RNN, CNN, VAE, etc.)

What you'll learn

  • Explore the US Patent Database, a huge, focused, and underappreciated source of knowledge


The US Patent Database is a huge, focused, and underappreciated source of knowledge—and a large, technically focused corpus. Join Van Lindberg to see how applying some natural language processing to the USPTO can teach us some things about the state of technology and coincidentally help the patent database fulfill its mission of being a freely available source of technical knowledge.

Topics include:

  • The USPTO as a data source: Previously, the easiest way to get the full text of a patent was to scrape it from the Patent Office site. Now, what took months takes minutes. Van explores a number of preprocessed collections of patent data ready to be used.
  • Technological word vectors: Most machine learning applications use word2vec, fastText, or a similar embedding of word context into a multidimensional space. We have all seen the “magic” of word vector algebra: king – man + woman = “queen.” But most word vectors have been generated using web crawl or news data. Van shows how things change when the word vectors are based on a large corpus of technology-related material. What does “computer – keyboard + phone” equal in an algebra defined in a technological space?
  • The web of technology: Improved techniques (and improved computing capacity) allow us to categorize and draw connections across the entire set of patents issued since 1976. Doing so gives a unique look at the “web of technology” defined by the connections that people (and computers) make between different documents. Van demonstrates that it’s possible to see clear “clusters” of technologies and the linkages between them. A different type of “technology web” can be created by going the other way: looking at common associations between parts of different technologies. It is possible to build up a probabilistic view of “what is in a computer” or “what is in a car” based upon the connections made by millions of inventors.
  • What’s next?: This just scratches the surface of what we can do. What happens when we set up a variational autoencoder to generate patentable concepts? How about having the computer generate patent text or patent drawings? Can we get to the point where we take the human out of the loop?
Photo of Van Lindberg

Van Lindberg

Python Software Foundation

Van Lindberg is an open source and intellectual property lawyer based out of San Antonio. Van’s professional work focuses on the intersection of technology and law, with particular expertise in the area of open source. Over his career, he has helped businesses with everything from open source compliance to business strategy and represents companies in high-stakes IP litigation and inter partes review proceedings before the Patent Trial and Appeal Board. Van has represented companies on Capitol Hill, before Congress, and in industry associations; has led teams through successful mergers and acquisitions and restructurings; and has organized employee agreements to create greater employee satisfaction and promote higher compliance with internal policies.

Van is a regular speaker on everything from community dynamics to graph theory and has testified in Congressional proceedings as an expert on both copyright and encryption policy. In 2012, he was named one of “America’s top 12 techiest attorneys” by the American Bar Association Journal. He is the author of Intellectual Property and Open Source.