Interested in a data science career fair?
We are trying to organize a career fair for PDX Data Science. As a first step we are trying to ascertain whether or not there are enough people interested in attending such an event and so we are asking people to fill out this two minute survey.
Additional information about the Career Fair will be available on http://pdxdata.org and on the #careerfair channel on Slack. If you have not done so before, you are welcome to join our Slack team, PdxData, by inviting yourself here: http://pdxdata.org/slack
Item-Item Collaborative Filtering Workshop
Recommendation systems are in use everywhere, and yet I found it difficult to find helpful material on implementing them. This workshop aims to make it both easy to understand and apply item-item collaborative filtering. It will be 30 minutes of theory and one hour of practice. This will be hands on using Jupyter notebooks—so please install Anaconda ahead of time.
About the presenter:
Merlyn is an experienced software engineer working at New Relic. He studied machine learning in Berlin where he also worked as a research assistant on SVMs. He’s particularly keen on reinforcement learning and on implementing AIs that compete in simulated environments. Merlyn regularly does “ML katas” with folks inside and outside of work—which you’re welcome to join. He also always has a side project going that he’ll gladly discuss with you.
shiny night: tutorial and use cases
Let’s have a Shiny night! If you’re not familiar with Shiny, check it out at http://shiny.rstudio.com - Shiny is a web application framework for R. Many people use Shiny to convey business/research/etc. findings to collaborators/bosses/etc. without needing the person to be able to code. Checkout https://shiny.rstudio.com/gallery to see some really cool examples.
We’re going to have a brief tutorial for those that haven’t used Shiny, as well as some Shiny use cases to see what you can do with Shiny in the real world.
Data Dogs - The cycle of Data
Our discussions will be targeted around four distinct areas of the data cycle:
Cycle 1: Data Origins – Sources, generation and ownership
C2: Data Growth – Science, assembly and storage
C3: Data Maturity – Appeal, wisdom and power
C4: Data Destruction – Burial, deletion and ceremony
Here is our Handout
While prior attendance is not required to participate, please watch or read a link of interest below:
Object Oriented Data Science in Python
A lot of focus in the data science community is on reducing the complexity and time involved in data gathering, cleaning, and organization. In this talk Sev Leonard will discuss how object oriented design techniques from software engineering can be used to reduce coding overhead and create robust, reusable data acquisition and cleaning systems. Sev will provide an overview of object oriented design and walk through an example of using these techniques for getting and cleaning data from a web API in Python. You can find the Jupyter Notebook for this talk on Github. This talk is based on Sev’s recent ODSC article.
If you want to run the sample notebook you’ll need to setup a config.py with an API key for RIDB which you can get at: https://ridb.recreation.gov/?action=register
Bio: Sev Leonard is a Python developer and sciencer of data, as a consultant, writer, and trainer. He’s been working with data for 10+ years in high volume circuit design, targeted advertising, and data-driven product development.
The Foibles of Organizational Data Management
Real-world business data systems are often a mess. With customer and other data spread across multiple silos and complex technology stacks, it can be very difficult to get a complete picture of how data can fulfill the needs an organization has. With different teams (IT, marketing, executives, etc.) having divergent needs, it can get even messier. Tonight we are going to talk about some of the foibles faced by organizations, and ways to overcome them.
Bio: Lars serves as the Customer Data and Insights Lead for Connective DX (formerly ISITE Design), a Portland-based digital experience agency. With over a decade’s experience generating actionable insights from interesting data, he manages the development of customer-centric data and analytics measurement strategies for a variety of organization. Lars’s experience spans B2B and B2C, while working with clients including Honda, Autodesk, Coca Cola, Kraft, Xerox, OHSU and Wharton School of Business.
Connect with Lars:
Presented by Lars von Sneidern, Customer Data and Insights Lead for Connective DX
Collaborative coding with GitHub and RStudio server
Andrew Bray will discuss Collaborative coding with GitHub and RStudio server. He’s in the faculty at Reed College here in Portland.
Unsupervised Politics, Machine Learning & R
Speaker: Winston Saunders
Abstract: Word vectors, derived by deep learning algorithms applied to billions of words of text, provide powerful semantic models of language. Code in R, demonstrating [queen] + [man] - [woman] ~ [King] to about 90% accuracy will be reviewed. Building first on exploratory “bag of words” analysis of Presidential debate texts, we’ll explore, using pre-computed GloVe vectors (Pennington et al https://github.com/ww44ss/Presidential_Debates_2015), relationships like [sanders] + [trump] - [clinton] ~ [cruz] and how candidate positions align to rhetorical sentiment like [government] + [people] - [tax]. This analysis is work in progress. We’ll also test empirical limits (aka failed experiments).
Here are the
Slides from talk: https://github.com/tactical-Data/SlidesPDXDataScienceApril2016
Candidate Heat Map: https://github.com/tactical-Data/CandidateSpeechHeatMap
Canonical Word Vectors: https://github.com/tactical-Data/CanonicalWordVectorExamples
The word vectors files are huge and you’ll need to download manually if you want to experiment. Word vectors and presidential debates is still experimental and you can find it in this boneyard repo: https://github.com/ww44ss/Presidential_Debates_2015
Data Engineering Architecture at Simple with Rob Story
Speaker: Rob Story
Abstract: Simple’s Data Engineering team has spent the past year and a half building data pipelines to enable the customer service, marketing, finance, and leadership teams to make data-driven decisions.
We’ll walk through why the data team chose certain open source tools, including Kafka, RabbitMQ, Postgres, Celery, and Elasticsearch. We’ll also discuss the advantages to using Amazon Redshift for data warehousing and some of the lessons learned operating it in production.
Finally, we’ll touch on the team’s choices for the languages used in the data engineering stack, including Scala, Java, Clojure, and Python.
About the Speaker: Rob is a Data Engineer at Simple in Portland, OR. Once upon a time he was a Python developer, but now he spends most of his time reading the docs for JVM command line arguments and writing Clojure whenever possible.
Blog post about Building Analytics at Simple
R at Microsoft and R in Visual Studio
Two talks from Microsoft/Revolution Analytics:
Slides for Joseph’s talk
Machine Learning in the Cloud: A Geekout Session with Poul Petersen
Corvallis, OR based BigML team has been working for the past four years to democratize machine learning in the cloud – making it more consumable, programmable, and scalable. The net result is an intuitive platform that can be leveraged equally by business analysts, developers and data scientists who are eager to perform a variety of predictive analytics and machine learning tasks.
In this session, Poul Petersen, Chief Infrastructure Officer at BigML (MLSaas company), talked about machine learning in the cloud (basics of decision trees, ensembles, association, anomaly, clustering) and how past is predicting the future and is now easier and more accessible than ever thanks to converging trends of open source technologies, cloud-based computing and a growing ‘big data’ imperative.
Some resources include:
Application of analytics for optimizing wireless plans for IoT devices
Satish Doguparthy talked about analytics using R for optimizing tens of thousands of wireless plans for Internet of Things devices.
Opportunities are in Plain Sight… the challenge is Business Engagement and Justification …
Insights in Satish’ conclusion were:
- The more you look at the data, the more insights you find
- It’s not easy to make people believe you; it takes time to engage with business partners
- Interactive analytics products (e.g., Shiny apps) are more useful than sharing static charts
- more time goes into data gathering, data cleaning and understanding the data than into analysis
- Important to change the mindset around data analysis
- from: we take the order from our business partners and deliver
- to: we explore the possibilities and insights with our partners
Introduction to Neural Networks. Part One: Perceptron
The first talk of the Introduction to Machine Learning and Neural Networks series will introduce the general thoughts behind machine learning and provide the foundation for the future neural network implementations by deriving a simple mathematical model of a single neuron structure called Perceptron.
You will implement a Perceptron in Python and use it for predictions of iris flowers and credit card fraud. The talk will also highlight the limitations of the Perceptron model and provide an outlook of the steps towards a neural network.
Slides and workbook: Introduction to Neural Networks. Part One: Perceptron . Presented by Hannes
CUSUM Anomaly Detection (CAD) -- A novel anomaly detection algorithm
Co-hosted with the Portland Data User Meetup Group.
Brief description of the topic:
CAD is an anomaly detection method developed for time series of network traffic flow measurements. CAD searches for anomalous subsequences of internet performance variable (download throughput, packet retransmit rate, round trip time) time series that are indicators of internet performance degradation.
CAD was developed and implemented during a 3 month long Outreachy Internship at M-Lab. CAD is written in R.
The aim of this talk is to explain the main ideas behind CAD and to illustrate how it works via some real life examples.
Talk given by Kinga Farkas.
Data Modeling with Caret
Hacking Oregon's Hidden Political Connections
Hobson Lane will be doing a case study on recent work done on Hack Oregon’s Campaign Finance project.
He’ll walk through several things:
- Using python sets and pandas to find relationships between database tables, between politicians and the Ashley Madison dump, and between political action committees that you’d never imagine would support each other.
Using sklearn and TFIDF to compare committee descriptions.
- Using d3 force-directed graphs to do interactive visualization and clustering on that network of connections.
Text Mining Meets Neural Nets: Mining the Bio-medical Literature
Text mining and natural language processing employs a range of techniques from syntactic parsering, statistical analysis, and more recently deep learning. Dan Sullivan will present on recent advances in dense word representations, also known as word embedding, and their advantages over sparse representations, such as the popular term frequency-inverse document frequency (tf-idf) approach. He will also discuss convolutional neural networks, a form of deep learning that is proving surprisingly effective in natural language processing tasks. Dan will also suggest reference papers and tools for those interested in further details. Examples will be drawn from the bio-medical domain.
Presented by Dan Sullivan, Enterprise Architect at Cambia Health Solutions.