News

  • Collaborative coding with GitHub and RStudio server

    Andrew Bray will discuss Collaborative coding with GitHub and RStudio server. He’s in the faculty at Reed College here in Portland.


  • Unsupervised Politics, Machine Learning & R

    Speaker: Winston Saunders

    Abstract: Word vectors, derived by deep learning algorithms applied to billions of words of text, provide powerful semantic models of language. Code in R, demonstrating [queen] + [man] - [woman] ~ [King] to about 90% accuracy will be reviewed. Building first on exploratory “bag of words” analysis of Presidential debate texts, we’ll explore, using pre-computed GloVe vectors (Pennington et al https://github.com/ww44ss/Presidential_Debates_2015), relationships like [sanders] + [trump] - [clinton] ~ [cruz] and how candidate positions align to rhetorical sentiment like [government] + [people] - [tax]. This analysis is work in progress. We’ll also test empirical limits (aka failed experiments).

    Here are the

    1. Slides from talk: https://github.com/tactical-Data/SlidesPDXDataScienceApril2016

    2. Candidate Heat Map: https://github.com/tactical-Data/CandidateSpeechHeatMap

    3. Canonical Word Vectors: https://github.com/tactical-Data/CanonicalWordVectorExamples

    The word vectors files are huge and you’ll need to download manually if you want to experiment. Word vectors and presidential debates is still experimental and you can find it in this boneyard repo: https://github.com/ww44ss/Presidential_Debates_2015


  • Data Engineering Architecture at Simple with Rob Story

    Speaker: Rob Story

    Abstract: Simple’s Data Engineering team has spent the past year and a half building data pipelines to enable the customer service, marketing, finance, and leadership teams to make data-driven decisions.

    We’ll walk through why the data team chose certain open source tools, including Kafka, RabbitMQ, Postgres, Celery, and Elasticsearch. We’ll also discuss the advantages to using Amazon Redshift for data warehousing and some of the lessons learned operating it in production.

    Finally, we’ll touch on the team’s choices for the languages used in the data engineering stack, including Scala, Java, Clojure, and Python.

    About the Speaker: Rob is a Data Engineer at Simple in Portland, OR. Once upon a time he was a Python developer, but now he spends most of his time reading the docs for JVM command line arguments and writing Clojure whenever possible.

    Blog post about Building Analytics at Simple


  • R at Microsoft and R in Visual Studio

    Two talks from Microsoft/Revolution Analytics:

    Slides for Joseph’s talk


  • Machine Learning in the Cloud: A Geekout Session with Poul Petersen

    Corvallis, OR based BigML team has been working for the past four years to democratize machine learning in the cloud – making it more consumable, programmable, and scalable. The net result is an intuitive platform that can be leveraged equally by business analysts, developers and data scientists who are eager to perform a variety of predictive analytics and machine learning tasks.

    In this session, Poul Petersen, Chief Infrastructure Officer at BigML (MLSaas company), talked about machine learning in the cloud (basics of decision trees, ensembles, association, anomaly, clustering) and how past is predicting the future and is now easier and more accessible than ever thanks to converging trends of open source technologies, cloud-based computing and a growing ‘big data’ imperative.

    Poul’s slides

    Some resources include:


  • Application of analytics for optimizing wireless plans for IoT devices

    Satish Doguparthy talked about analytics using R for optimizing tens of thousands of wireless plans for Internet of Things devices.

    Opportunities are in Plain Sight… the challenge is Business Engagement and Justification …

    Insights in Satish’ conclusion were:

    • The more you look at the data, the more insights you find
    • It’s not easy to make people believe you; it takes time to engage with business partners
    • Interactive analytics products (e.g., Shiny apps) are more useful than sharing static charts
    • more time goes into data gathering, data cleaning and understanding the data than into analysis
    • Important to change the mindset around data analysis
      • from: we take the order from our business partners and deliver
      • to: we explore the possibilities and insights with our partners

    Contact:


  • Introduction to Neural Networks. Part One: Perceptron

    The first talk of the Introduction to Machine Learning and Neural Networks series will introduce the general thoughts behind machine learning and provide the foundation for the future neural network implementations by deriving a simple mathematical model of a single neuron structure called Perceptron.

    You will implement a Perceptron in Python and use it for predictions of iris flowers and credit card fraud. The talk will also highlight the limitations of the Perceptron model and provide an outlook of the steps towards a neural network.

    Slides and workbook: Introduction to Neural Networks. Part One: Perceptron . Presented by Hannes


  • CUSUM Anomaly Detection (CAD) -- A novel anomaly detection algorithm

    Slides

    Co-hosted with the Portland Data User Meetup Group.

    Brief description of the topic:

    CAD is an anomaly detection method developed for time series of network traffic flow measurements.  CAD searches for anomalous subsequences of internet performance variable (download throughput, packet retransmit rate, round trip time) time series that are indicators of internet performance degradation.  

    CAD was developed and implemented during a 3 month long Outreachy Internship at M-Lab.  CAD is written in R.

    The aim of this talk is to explain the main ideas behind CAD and to illustrate how it works via some real life examples.

    Talk given by Kinga Farkas.


  • Data Modeling with Caret

    Data Modeling Workshop with the Caret library in R will be led by Kyle Joecken.

    Slides


  • Hacking Oregon's Hidden Political Connections

    Hobson Lane will be doing a case study on recent work done on Hack Oregon’s Campaign Finance project.

    He’ll walk through several things:

    • Using python sets and pandas to find relationships between database tables, between politicians and the Ashley Madison dump, and between political action committees that you’d never imagine would support each other.
    • Using sklearn and TFIDF to compare committee descriptions.

    • Using d3 force-directed graphs to do interactive visualization and clustering on that network of connections.

    Resources:

    Data Python Javascript Site (way to go Ken Whaler and Sam Higgins!) Hack Oregon (way to go Cat and Portland volunteer spirit!) Slides RFP


  • Text Mining Meets Neural Nets: Mining the Bio-medical Literature

    Text mining and natural language processing employs a range of techniques from syntactic parsering, statistical analysis, and more recently deep learning. Dan Sullivan will present on recent advances in dense word representations, also known as word embedding, and their advantages over sparse representations, such as the popular term frequency-inverse document frequency (tf-idf) approach. He will also discuss convolutional neural networks, a form of deep learning that is proving surprisingly effective in natural language processing tasks. Dan will also suggest reference papers and tools for those interested in further details. Examples will be drawn from the bio-medical domain.

    Slides

    Presented by Dan Sullivan, Enterprise Architect at Cambia Health Solutions.