• Item-Item Collaborative Filtering Workshop

    Recommendation systems are in use everywhere, and yet I found it difficult to find helpful material on implementing them. This workshop aims to make it both easy to understand and apply item-item collaborative filtering. It will be 30 minutes of theory and one hour of practice. This will be hands on using Jupyter notebooks—so please install Anaconda ahead of time.

    About the presenter:

    Merlyn is an experienced software engineer working at New Relic. He studied machine learning in Berlin where he also worked as a research assistant on SVMs. He’s particularly keen on reinforcement learning and on implementing AIs that compete in simulated environments. Merlyn regularly does “ML katas” with folks inside and outside of work—which you’re welcome to join. He also always has a side project going that he’ll gladly discuss with you.


  • shiny night: tutorial and use cases

    Let’s have a Shiny night! If you’re not familiar with Shiny, check it out at - Shiny is a web application framework for R. Many people use Shiny to convey business/research/etc. findings to collaborators/bosses/etc. without needing the person to be able to code. Checkout to see some really cool examples.

    We’re going to have a brief tutorial for those that haven’t used Shiny, as well as some Shiny use cases to see what you can do with Shiny in the real world.


    • Winston Saunders @winstononenergy - intro to Shiny

    • Jessica Minnier @datapointier - speaking on - Slides

    • Kinga Farkas - Presentation slides CUSUM Anomaly Detection App Using Shiny

    • John Smith @smithjd - Rise and shine. Shiny as an everyday tool

  • Data Dogs - The cycle of Data

    Our discussions will be targeted around four distinct areas of the data cycle:

    Cycle 1: Data Origins – Sources, generation and ownership

    C2: Data Growth – Science, assembly and storage

    C3: Data Maturity – Appeal, wisdom and power

    C4: Data Destruction – Burial, deletion and ceremony

    Here is our Handout

    While prior attendance is not required to participate, please watch or read a link of interest below:

    1. The Point of Collection

    2. The Illusion of Agency

    3. Origin problems at Google

    4. Stories VS Statistics

    5. Random or Systematic Error?

  • Object Oriented Data Science in Python

    A lot of focus in the data science community is on reducing the complexity and time involved in data gathering, cleaning, and organization. In this talk Sev Leonard will discuss how object oriented design techniques from software engineering can be used to reduce coding overhead and create robust, reusable data acquisition and cleaning systems. Sev will provide an overview of object oriented design and walk through an example of using these techniques for getting and cleaning data from a web API in Python. You can find the Jupyter Notebook for this talk on Github. This talk is based on Sev’s recent ODSC article.

    For a more in depth treatment of the subject see Sev’s tutorial on object oriented data pipelines from PyCon 2016 on YouTube and GitHub

    If you want to run the sample notebook you’ll need to setup a with an API key for RIDB which you can get at:


    Bio: Sev Leonard is a Python developer and sciencer of data, as a consultant, writer, and trainer. He’s been working with data for 10+ years in high volume circuit design, targeted advertising, and data-driven product development.

    Connect: The Data Scout (Sev’s company) Twitter LinkedIn

  • The Foibles of Organizational Data Management

    Real-world business data systems are often a mess. With customer and other data spread across multiple silos and complex technology stacks, it can be very difficult to get a complete picture of how data can fulfill the needs an organization has. With different teams (IT, marketing, executives, etc.) having divergent needs, it can get even messier. Tonight we are going to talk about some of the foibles faced by organizations, and ways to overcome them.

    Bio: Lars serves as the Customer Data and Insights Lead for Connective DX (formerly ISITE Design), a Portland-based digital experience agency. With over a decade’s experience generating actionable insights from interesting data, he manages the development of customer-centric data and analytics measurement strategies for a variety of organization. Lars’s experience spans B2B and B2C, while working with clients including Honda, Autodesk, Coca Cola, Kraft, Xerox, OHSU and Wharton School of Business.

    Connect with Lars:

    • LinkedIn:

    • Twitter:

    • Company:

    Background reading:




    • *


    Presented by Lars von Sneidern, Customer Data and Insights Lead for Connective DX

  • Collaborative coding with GitHub and RStudio server

    Andrew Bray will discuss Collaborative coding with GitHub and RStudio server. He’s in the faculty at Reed College here in Portland.

  • Unsupervised Politics, Machine Learning & R

    Speaker: Winston Saunders

    Abstract: Word vectors, derived by deep learning algorithms applied to billions of words of text, provide powerful semantic models of language. Code in R, demonstrating [queen] + [man] - [woman] ~ [King] to about 90% accuracy will be reviewed. Building first on exploratory “bag of words” analysis of Presidential debate texts, we’ll explore, using pre-computed GloVe vectors (Pennington et al, relationships like [sanders] + [trump] - [clinton] ~ [cruz] and how candidate positions align to rhetorical sentiment like [government] + [people] - [tax]. This analysis is work in progress. We’ll also test empirical limits (aka failed experiments).

    Here are the

    1. Slides from talk:

    2. Candidate Heat Map:

    3. Canonical Word Vectors:

    The word vectors files are huge and you’ll need to download manually if you want to experiment. Word vectors and presidential debates is still experimental and you can find it in this boneyard repo:

  • Data Engineering Architecture at Simple with Rob Story

    Speaker: Rob Story

    Abstract: Simple’s Data Engineering team has spent the past year and a half building data pipelines to enable the customer service, marketing, finance, and leadership teams to make data-driven decisions.

    We’ll walk through why the data team chose certain open source tools, including Kafka, RabbitMQ, Postgres, Celery, and Elasticsearch. We’ll also discuss the advantages to using Amazon Redshift for data warehousing and some of the lessons learned operating it in production.

    Finally, we’ll touch on the team’s choices for the languages used in the data engineering stack, including Scala, Java, Clojure, and Python.

    About the Speaker: Rob is a Data Engineer at Simple in Portland, OR. Once upon a time he was a Python developer, but now he spends most of his time reading the docs for JVM command line arguments and writing Clojure whenever possible.

    Blog post about Building Analytics at Simple

  • R at Microsoft and R in Visual Studio

    Two talks from Microsoft/Revolution Analytics:

    Slides for Joseph’s talk

  • Machine Learning in the Cloud: A Geekout Session with Poul Petersen

    Corvallis, OR based BigML team has been working for the past four years to democratize machine learning in the cloud – making it more consumable, programmable, and scalable. The net result is an intuitive platform that can be leveraged equally by business analysts, developers and data scientists who are eager to perform a variety of predictive analytics and machine learning tasks.

    In this session, Poul Petersen, Chief Infrastructure Officer at BigML (MLSaas company), talked about machine learning in the cloud (basics of decision trees, ensembles, association, anomaly, clustering) and how past is predicting the future and is now easier and more accessible than ever thanks to converging trends of open source technologies, cloud-based computing and a growing ‘big data’ imperative.

    Poul’s slides

    Some resources include:

  • Application of analytics for optimizing wireless plans for IoT devices

    Satish Doguparthy talked about analytics using R for optimizing tens of thousands of wireless plans for Internet of Things devices.

    Opportunities are in Plain Sight… the challenge is Business Engagement and Justification …

    Insights in Satish’ conclusion were:

    • The more you look at the data, the more insights you find
    • It’s not easy to make people believe you; it takes time to engage with business partners
    • Interactive analytics products (e.g., Shiny apps) are more useful than sharing static charts
    • more time goes into data gathering, data cleaning and understanding the data than into analysis
    • Important to change the mindset around data analysis
      • from: we take the order from our business partners and deliver
      • to: we explore the possibilities and insights with our partners


  • Introduction to Neural Networks. Part One: Perceptron

    The first talk of the Introduction to Machine Learning and Neural Networks series will introduce the general thoughts behind machine learning and provide the foundation for the future neural network implementations by deriving a simple mathematical model of a single neuron structure called Perceptron.

    You will implement a Perceptron in Python and use it for predictions of iris flowers and credit card fraud. The talk will also highlight the limitations of the Perceptron model and provide an outlook of the steps towards a neural network.

    Slides and workbook: Introduction to Neural Networks. Part One: Perceptron . Presented by Hannes

  • CUSUM Anomaly Detection (CAD) -- A novel anomaly detection algorithm


    Co-hosted with the Portland Data User Meetup Group.

    Brief description of the topic:

    CAD is an anomaly detection method developed for time series of network traffic flow measurements.  CAD searches for anomalous subsequences of internet performance variable (download throughput, packet retransmit rate, round trip time) time series that are indicators of internet performance degradation.  

    CAD was developed and implemented during a 3 month long Outreachy Internship at M-Lab.  CAD is written in R.

    The aim of this talk is to explain the main ideas behind CAD and to illustrate how it works via some real life examples.

    Talk given by Kinga Farkas.

  • Data Modeling with Caret

    Data Modeling Workshop with the Caret library in R will be led by Kyle Joecken.


  • Hacking Oregon's Hidden Political Connections

    Hobson Lane will be doing a case study on recent work done on Hack Oregon’s Campaign Finance project.

    He’ll walk through several things:

    • Using python sets and pandas to find relationships between database tables, between politicians and the Ashley Madison dump, and between political action committees that you’d never imagine would support each other.
    • Using sklearn and TFIDF to compare committee descriptions.

    • Using d3 force-directed graphs to do interactive visualization and clustering on that network of connections.


    Data Python Javascript Site (way to go Ken Whaler and Sam Higgins!) Hack Oregon (way to go Cat and Portland volunteer spirit!) Slides RFP

  • Text Mining Meets Neural Nets: Mining the Bio-medical Literature

    Text mining and natural language processing employs a range of techniques from syntactic parsering, statistical analysis, and more recently deep learning. Dan Sullivan will present on recent advances in dense word representations, also known as word embedding, and their advantages over sparse representations, such as the popular term frequency-inverse document frequency (tf-idf) approach. He will also discuss convolutional neural networks, a form of deep learning that is proving surprisingly effective in natural language processing tasks. Dan will also suggest reference papers and tools for those interested in further details. Examples will be drawn from the bio-medical domain.


    Presented by Dan Sullivan, Enterprise Architect at Cambia Health Solutions.