Thursday, October 31, 2019

Tales From the Teenage Cancel Culture

Source: https://www.nytimes.com/2019/10/31/style/cancel-culture.html?emc=rss&partner=rss
October 31, 2019 at 10:39PM

What’s cancel culture really like? Ask a teenager. They know.

Amazon Tests ‘Soul of Seattle’ With Deluge of Election Cash

Source: https://www.nytimes.com/2019/10/30/us/seattle-council-amazon-democracy-vouchers.html?emc=rss&partner=rss
October 31, 2019 at 09:54PM

Amazon is flexing its financial power in Seattle’s City Council elections, putting an innovative program meant to combat such influence at risk.

California Blackouts Hit Cellphone Service, Fraying a Lifeline

Source: https://www.nytimes.com/2019/10/28/business/energy-environment/california-cellular-blackout.html?emc=rss&partner=rss
October 31, 2019 at 08:14PM

As power is cut to reduce the wildfire risk from electrical wires and towers, a primary source of emergency communication is put in jeopardy.

Netflix Expands Into a World Full of Censors

Source: https://www.nytimes.com/2019/10/31/arts/television/netflix-censorship-turkey-india.html?emc=rss&partner=rss
October 31, 2019 at 06:04PM

The streaming giant is having to navigate different political and moral landscapes, and calls for government oversight, as it seeks subscribers worldwide.

A Physics Magic Trick: Take 2 Sheets of Carbon and Twist

Source: https://www.nytimes.com/2019/10/30/science/graphene-physics-superconductor.html?emc=rss&partner=rss
October 31, 2019 at 04:16PM

The study of graphene was starting to go out of style, but new experiments with sheets of the ultrathin material revealed there was much left to learn.

2 Plead Guilty in 2016 Uber and Lynda.com Hacks

Source: https://www.nytimes.com/2019/10/30/technology/uber-lyndacom-hacks-guilty-plea.html?emc=rss&partner=rss
October 31, 2019 at 03:43PM

Guilty pleas to charges of hacking and an extortion conspiracy cap a legal saga that ensnared the tech companies in data breach scandals.

Facebook’s Earnings and Revenue Jump, Topping Forecasts

Source: https://www.nytimes.com/2019/10/30/technology/facebooks-earnings-and-revenue-jump-topping-forecasts.html?emc=rss&partner=rss
October 31, 2019 at 07:41AM

The company’s financial performance is a regular bright spot for the social giant, which has been embroiled in scandals in recent years.

Lyft Focuses on Profitability as Cash-Burning Companies Lose Luster

Source: https://www.nytimes.com/2019/10/30/technology/lyft-earnings-profitability.html?emc=rss&partner=rss
October 31, 2019 at 03:30AM

Lyft had said that it would lose record amounts of money, but it changed its tune as the sentiment toward prominent technology start-ups has soured.

Apple Offers Upbeat Forecast While Profit Continues to Fall

Source: https://www.nytimes.com/2019/10/30/technology/apple-earnings.html?emc=rss&partner=rss
October 31, 2019 at 01:36AM

The company said its profit dropped 3.1 percent from a year ago, but there are early hints that new iPhones are selling well heading into the holidays.

This Guy Thought He Beat Facebook’s Controversial Political Ads Policy

Source: https://www.nytimes.com/2019/10/30/technology/facebook-political-criticism.html?emc=rss&partner=rss
October 30, 2019 at 11:11PM

“Apparently, it’s only O.K. to lie on Facebook if you don’t tell them you’re lying,” said Adriel Hampton, who is running for California governor in protest of the site’s reluctance to fact-check politicians.

Apple Enters Show Business With a Black-Carpet Premiere

Source: https://www.nytimes.com/2019/10/30/business/media/apple-tv-plus.html?emc=rss&partner=rss
October 30, 2019 at 07:49PM

Apple TV Plus arrives Friday with “The Morning Show,” starring Reese Witherspoon and Jennifer Aniston, as its flagship streaming program.

HBO Max, Out in May, Will Cost More Than Netflix

Source: https://www.nytimes.com/2019/10/29/business/media/hbo-max-price.html?emc=rss&partner=rss
October 30, 2019 at 05:47PM

The streaming service will offer some 10,000 hours of programming, including a “Game of Thrones” prequel in place of one that was scrapped earlier in the day.

Amazon Turns to More Free Grocery Delivery to Lift Food Sales

Source: https://www.nytimes.com/2019/10/29/technology/amazon-prime-fresh-whole-foods-grocery-delivery.html?emc=rss&partner=rss
October 30, 2019 at 07:57AM

The company announced free two-hour food delivery for Prime members in about 20 major metropolitan areas.

Sony to Shut Down PlayStation Vue, a Cable Alternative

Source: https://www.nytimes.com/2019/10/29/business/sony-playstation-vue.html?emc=rss&partner=rss
October 30, 2019 at 03:43AM

Vue was started in early 2015 as a cheaper version of cable TV. It drew a slew of copycats, but customer growth slowed for many of these services as prices rose.

WhatsApp Says Israeli Firm Used Its App in Spy Program

Source: https://www.nytimes.com/2019/10/29/technology/whatsapp-nso-lawsuit.html?emc=rss&partner=rss
October 30, 2019 at 03:01AM

In a federal lawsuit, the messaging service said 1,400 WhatsApp accounts were targeted, including those of 100 journalists and human-rights activists.

‘OK Boomer’ Marks the End of Friendly Generational Relations

Source: https://www.nytimes.com/2019/10/29/style/ok-boomer.html?emc=rss&partner=rss
October 29, 2019 at 08:07PM

Now it’s war: Gen Z has finally snapped over climate change and financial inequality.

Australia Proposes Face Scans for Watching Online Pornography

Source: https://www.nytimes.com/2019/10/29/world/australia/pornography-facial-recognition.html?emc=rss&partner=rss
October 29, 2019 at 03:58PM

As a government agency seeks approval of a facial recognition system, it says one use for it could be verifying the age of people who want to view pornography online.

F.C.C. Plans Vote to Restrict Huawei and ZTE Purchases

Source: https://www.nytimes.com/2019/10/28/technology/huawei-zte-fcc-china.html?emc=rss&partner=rss
October 29, 2019 at 06:07AM

Rural carriers have raised concerns about moves to crack down on Huawei gear, saying it will cut them off from a supply of affordable equipment.

Google, in Rare Stumble, Posts 23% Decline in Profit

Source: https://www.nytimes.com/2019/10/28/technology/google-alphabet-earnings.html?emc=rss&partner=rss
October 29, 2019 at 02:03AM

Alphabet, Google’s parent, said its profits fell after a sharp increase in spending for research and development.

Dissent Erupts at Facebook Over Hands-Off Stance on Political Ads

Source: https://www.nytimes.com/2019/10/28/technology/facebook-mark-zuckerberg-political-ads.html?emc=rss&partner=rss
October 28, 2019 at 10:13PM

In an open letter, the social network’s employees said letting politicians post false claims in ads was “a threat” to the company.

Read the Letter Facebook Employees Sent to Mark Zuckerberg About Political Ads

Source: https://www.nytimes.com/2019/10/28/technology/facebook-mark-zuckerberg-letter.html?emc=rss&partner=rss
October 28, 2019 at 08:24PM

Hundreds of Facebook employees signed a letter decrying the decision to let politicians post any claims they wanted — even false ones — in ads on the site.

The Advertising Industry Has a Problem: People Hate Ads

Source: https://www.nytimes.com/2019/10/28/business/media/advertising-industry-research.html?emc=rss&partner=rss
October 28, 2019 at 07:01AM

Ad-blocking consumers and cost-cutting clients make for “dangerous days for advertisers,” according to a new report.

Disney Is New to Streaming, but Its Marketing Is Unmatched

Source: https://www.nytimes.com/2019/10/27/business/media/disney-plus-marketing.html?emc=rss&partner=rss
October 28, 2019 at 06:19AM

The company’s synergistic approach — think trailers playing on TVs in 22,000 Disney World hotel rooms — has made Disney Plus known to millions.

California Attorney General Is a No-Show on Tech Investigations

Source: https://www.nytimes.com/2019/10/31/technology/tech-investigations-california-attorney-general-becerra.html?emc=rss&partner=rss
October 31, 2019 at 07:45PM

Attorney General Xavier Becerra is in Google and Facebook’s backyard. But unlike nearly all other state attorneys general, he won’t say whether he’s investigating them.

Netflix Expands Into a World Full of Censors

Source: https://www.nytimes.com/2019/10/31/arts/television/netflix-censorship-turkey-india.html?emc=rss&partner=rss
October 31, 2019 at 06:04PM

The streaming giant is having to navigate different political and moral landscapes, and calls for government oversight, as it seeks subscribers worldwide.

Tales From the Teenage Cancel Culture

Source: https://www.nytimes.com/2019/10/31/style/cancel-culture.html?emc=rss&partner=rss
October 31, 2019 at 04:52PM

What’s cancel culture really like? Ask a teenager. They know.

Voices in AI – Episode 99 – A Conversation with Patrick Surry

Source: https://gigaom.com/2019/10/31/voices-in-ai-episode-99-a-conversation-with-patrick-surry/
October 31, 2019 at 03:00PM

[voices_in_ai_byline]

About this Episode

On this Episode of Voices in AI Bryon speaks with Patrick Surry of Hopper on the nature of intelligence and the path that our relationship with AI is taking.

Listen to this episode or read the full transcript at www.VoicesinAI.com

Transcript Excerpt

Byron Reese: This is Voices in AI, brought to you by GigaOm. I’m Byron Reese. Today my guest is Patrick Surry. He is the Chief Data Scientist at Hopper. He holds a PhD in math and statistics from the University of Edinburgh. Welcome to the show, Patrick.

Patrick Surry: It’s great to be here.

I like to start our journey off with the same question for most guests, which is: What is artificial intelligence? Specifically, why is it artificial?

That’s a really interesting question. I think there’s a bunch of different takes you get from different people about that. I guess the way I think about it [is] in a pragmatic sense of trying to get computers to mimic the way that humans think about problems that are not necessarily easily broken down into a series of methodical steps to solve.

It’s getting computers to think like humans, or is it getting computers to solve problems that only humans used to be able to solve?

I think for me the way that AI started was this whole idea of trying to understand how we could mimic human thought processes, so thinking about playing chess, as an example. We were trying to understand – it was hard to write down how a human played chess, but we wanted to make a machine that could mimic that human ability. Interestingly enough, as we build these machines, we often come up with different ways of solving the problem that are nothing like the way a human actually solves the problem.

Isn’t that kind of almost the norm in a way? Taking something pretty simple, why is it that you can train a human with a sample size of one? “This is an alien. Find this alien in these photos.” Even if the alien is upside down or half obscured or under water, we’re like “there, there, and there.” Why can’t computers do that?

I think computers are getting better at those kinds of problems. I think humans have a whole set of not greatly understood pattern matching abilities that we’ve actually trained and evolved over thousands of years and trained since we were born as individuals that limit the kinds of problems in the way that we solve problems, but do it in a really interesting way that allows us to solve the kind of practical problems that we’re actually interested in as a species, to be able to survive and eat and find a mate and those kinds of things.

You know, it’s interesting because you’re right. It took us a long time, but it shouldn’t take computers nearly that long. They’re moving at the speed of light, right? If it takes a toddler five years, won’t we eventually be able to train a blank slate of a computer in five minutes?

Yes. I think you’re starting to see evidence of that now, right? I think we sort of started from a different place with computers. We started with this very predictable step-by-step binary system. We could show mathematically you could solve any kind of well-formulated mathematical problem. Then we decided [with] this universal computing device, it would be cool if we could make it solve the kinds of problems that humans solve. It’s almost like we started from the wrong place, in a sense. If you were trying to mimic humans, maybe we should have gone a lot farther down the analog computing path instead of trying to build everything on top of this binary computer, which it doesn’t really match the underlying hardware of a human very well.

We’re massively parallel, and computers just sequentially are enormously fast.

Also, this sort of digital versus analog thing is always interesting. The way human brains seem to work is with lots of gradients of electricity and chemicals and that is very different from the fundamental unit of a computer, which is this 0 or 1 bit. I think when you look at a lot of the recent work that’s being done in computer vision and these generative networks and so forth, the starting point is first of all to construct something that looks a lot more analog and a lot more like things that you find in someone’s brain out of these fundamental units that we originally built in the computer.

You know, records, LPs, they’re analog. CDs came along and they’re digital. Do you think people can tell the difference between the two when they listen to them?

I certainly cannot.

I can’t either. Yet, I think maybe it’s my own shortcoming. I don’t know. That’s not an approximation of an analog experience. It’s beyond an approximation to me at least.

There are people I know who claim that they can tell the differences. I think it’s with a lot of things. We’ve got to a point where you have a really high fidelity of approximation that you can’t really tell is different. You look back at the early days of television or the first computer monitor that I think I had way back in the day with my Apple IIe or whatever it was, there were four colors. You could individually see every box on the screen as a little pixel.

Now you have an 8K TV. If you’re not within an inch of the screen, it looks like a completely continuous picture. It’s sort of to that thing. I think with the CD, once you get to a certain level of digital approximation, it may not be the most efficient, but you can trick most of the people.

Listen to this episode or read the full transcript at www.VoicesinAI.com

[voices_in_ai_link_back]

Byron explores issues around artificial intelligence and conscious computers in his new book The Fourth Age: Smart Robots, Conscious Computers, and the Future of Humanity.

Wednesday, October 30, 2019

Stick to Sports? No Way. Deadspin Journalists Quit en Masse.

Source: https://www.nytimes.com/2019/10/30/business/media/deadspin-sports-staff.html?emc=rss&partner=rss
October 31, 2019 at 03:30AM

At least eight writers and editors resigned on the day after the firing of the interim editor in chief.

A Greater Understanding of Race and Identity Through Tech

Source: https://www.nytimes.com/2019/10/30/technology/personaltech/lauretta-charlton-race-related.html?emc=rss&partner=rss
October 30, 2019 at 09:58PM

Podcasts featuring inmates. The evolution of social media. The writing of Stephen Hawking. All are routes to thinking about race and identity, says Lauretta Charlton, editor of Race/Related.

Facebook’s Earnings and Revenue Jump, Topping Forecasts

Source: https://www.nytimes.com/2019/10/30/technology/facebooks-earnings-and-revenue-jump-topping-forecasts.html?emc=rss&partner=rss
October 31, 2019 at 01:28AM

The company’s financial performance is a regular bright spot for the social giant, which has been embroiled in scandals in recent years.

A Physics Magic Trick: Take 2 Sheets of Carbon and Twist

Source: https://www.nytimes.com/2019/10/30/science/graphene-physics-superconductor.html?emc=rss&partner=rss
October 30, 2019 at 11:59PM

The study of graphene was starting to go out of style, but new experiments with sheets of the ultrathin material revealed there was much left to learn.

Apple Has Upbeat Forecast While Profit Continues to Fall

Source: https://www.nytimes.com/2019/10/30/technology/apple-earnings.html?emc=rss&partner=rss
October 30, 2019 at 11:42PM

The company said its profit dropped 3.1 percent from a year ago, but there are early hints that new iPhones are selling well heading into the holidays.

Twitter Will Ban All Political Ads, C.E.O. Dorsey Says

Source: https://www.nytimes.com/2019/10/30/technology/twitter-political-ads-ban.html?emc=rss&partner=rss
October 30, 2019 at 11:30PM

The social media company’s action is a stark contrast to Facebook, which is taking a hands-off approach to political advertising.

Facebook’s Revenue Jumps 29%

Source: https://www.nytimes.com/2019/10/30/technology/facebooks-revenue-jumps-29.html?emc=rss&partner=rss
October 30, 2019 at 11:17PM

The company’s financial performance is a regular bright spot for the social giant, which has been embroiled in scandals in recent years.

This Guy Thought He Beat Facebook’s Controversial Political Ads Policy

Source: https://www.nytimes.com/2019/10/30/technology/facebook-political-criticism.html?emc=rss&partner=rss
October 30, 2019 at 11:11PM

“Apparently, it’s only O.K. to lie on Facebook if you don’t tell them you’re lying,” said Adriel Hampton, who is running for California governor in protest of the site’s reluctance to fact-check politicians.

Lyft Focuses on Profitability as Cash-Burning Companies Lose Luster

Source: https://www.nytimes.com/2019/10/30/technology/lyft-earnings-profitability.html?emc=rss&partner=rss
October 30, 2019 at 11:09PM

Lyft had said that it would lose record amounts of money, but it changed its tune as the sentiment toward prominent technology start-ups has soured.

2 Plead Guilty in 2016 Uber and Lynda.com Hacks

Source: https://www.nytimes.com/2019/10/30/technology/uber-lyndacom-hacks-guilty-plea.html?emc=rss&partner=rss
October 30, 2019 at 09:10PM

Guilty pleas to charges of hacking and an extortion conspiracy cap a legal saga that ensnared the tech companies in data breach scandals.

Amazon Tests ‘Soul of Seattle’ With Deluge of Election Cash

Source: https://www.nytimes.com/2019/10/30/us/seattle-council-amazon-democracy-vouchers.html?emc=rss&partner=rss
October 30, 2019 at 08:44PM

Amazon is flexing its financial power in Seattle’s City Council elections, putting an innovative program meant to combat such influence at risk.

Russia Tests New Disinformation Tactics in Africa to Expand Influence

Source: https://www.nytimes.com/2019/10/30/technology/russia-facebook-disinformation-africa.html?emc=rss&partner=rss
October 30, 2019 at 05:01PM

Facebook said it removed three Russian-backed influence networks aimed at African countries. The activity by the networks suggested Russia’s approach was evolving.

Ready. Set. Write a Book.

Source: https://www.nytimes.com/2019/10/30/technology/personaltech/national-novel-writing-month-apps-tools.html?emc=rss&partner=rss
October 30, 2019 at 12:00PM

The National Novel Writing Month event challenges people to crank out 50,000 words in 30 days. Here are the digital tools to help you make a go of it.

Voices in Innovation – Andrew Brust Interviews Sean Knapp of Ascend.IO

Source: https://gigaom.com/2019/10/24/voices-in-innovation-simon-gibson-interviews-sean-knapp-of-ascend-io/
October 24, 2019 at 03:00PM

[audio_player title=”Voices in Innovation Andrew Brust Interviews Sean Knapp” audio=”https://voices-in-innovation.s3.amazonaws.com/001-ViI-(00-32-06)-AB-Ascend.mp3″]

Guest

Sean Knapp is the founder and CEO of Ascend.io. Prior to Ascend.io, Sean was a co-founder, CTO, and Chief Product Officer at Ooyala. At Ooyala Sean played key roles in raising $120M, scaling the company to 500 employees, Ooyala’s $410M acquisition, as well as Ooyala’s subsequent acquisitions of Videoplaza and Nativ. He oversaw all Product, Engineering and Solutions, as well as well as defined Ooyala’s product vision for their award-winning analytics and video platform solutions. Before founding Ooyala, Sean worked at Google where he was the technical lead for Google’s legendary Web Search Frontend team, helping that team increase Google revenues by over $1B. Sean has both B.S. and M.S. degrees in Computer Science from Stanford University.

Transcript

Andrew Brust: All right, welcome to this GigaOm podcast. This is Andrew Brust from GigaOm. I’m a lead analyst focusing on data analytics, BI, AI, and all good things data. With us, we have Sean Knapp, who is the founder and CEO at Ascend. Sean, welcome.

Sean Knapp: Thank you.

I’m hoping we can talk a little bit about data pipelines but gosh, if that’s all we’re talking about, that can be a bit dry. I think what, hopefully, we can talk about is how to get beyond what has been and get to maybe a better place where data pipelines feel less like a necessary evil and more like something that can really help and perhaps are less burdensome. I may have loaded all the questions there, but hopefully that’s okay.

Can you maybe just introduce yourself a bit and the premise and the rubric of what Ascend is up to? Then maybe we can drill down on a few questions from there.

Yeah, happy to. As I introduce myself, Sean Knapp, founder and CEO of Ascend.io, I’m a software engineer by training. I’ve been building data pipelines for actually15 years now. I wrote my first MapReduce job in Sawzall back at Google in 2004 and since then have gone on to build companies and teams really heavily focused around data, analytics, ETL, and more. About four years ago, I founded Ascend.io to really help automate and optimize a lot more of how we build data pipelines and make it a lot easier for us to, frankly, do more of them faster and more efficiently.

You’ve got your work cut out for you. I forget what date you said you said you wrote your first MapReduce job, but were you really coding MapReduce in Java? Was that the substance of what you were up to at that time?

Yeah, so back when I started at Google in 2004, Google had their MapReduce framework. You usually wrote your jobs in a language called Sawzall, which I was the tech lead for the front-end team on web search. We were always writing a lot of large MapReduce jobs to analyze how our users were engaging with content on web search, what links they were clicking on, the efficacy of our various experimentation systems. We wrote a lot of analysis on usage even 15 years ago.

I know from a little bit of dabbling I’ve done with Hadoop and that flavor of MapReduce, that can get pretty in the weeds pretty quickly. Gosh, I guess even before there was MapReduce, there were all kinds of ETL packages. Pipelines are not a new concept; they’ve been with us for quite a long time. I go back 20 years in doing BI work, and we were writing SQL scripts then to do a lot of this stuff.

What have pipelines been good for, and where have they hindered us, if I can ask you those two oppositional questions?

I think pipelines have been of tremendous value and also pain for most companies for decades now. They are certainly not a new concept. I’d say the technologies that they employ and that they leverage have changed tremendously, but the core concepts of really doing data movement and transformation, the classic ETL approach, really hasn’t changed tremendously.

I’d say what pipelines have been incredibly valuable for is pulling data out of one or many systems, doing complex and interesting transformations on it, and then of course, loading it back into others, standard ETL. Where they have been increasingly valuable for companies and organizations is as those transformations get more and more complex, the pipelines get more and more valuable.

I would say where the pain really starts to emerge is frankly, as you have more data sets, higher data volumes, and more people tapping into those pipelines, this is really where the pain emerges as we get this exponential increase in complexities tied to the interconnectedness of these systems. That’s really where most of the pain points are felt is simply trying to maintain and sustain these systems as they become increasingly complex, interconnected highways, if you will, of data flowing through the enterprise.

I guess one thing I’ve seen over the years is that we’ve written scripts really with the very tactical goal of just getting data from A to B and from Form X to Form Y. Once we got it to work, we were pretty happy. That’s very ad hoc, and then to go from there to something that is operationalized is another thing. It seems like some of these scripts just became operationalized almost accidentally and they weren’t necessarily up to the task…

Yeah, I totally agree.

Go for it, yeah. I’m listening.

Yeah, I totally agree. I would say the vast majority of pipelines that have been built were built in isolation, oftentimes scripted to meet a very specific and certain need. Then lo and behold, all of a sudden, you found ten other teams were building on top of that data set that was derived and what’s probably somebody’s weekend or hack-a-thon project now is a critical and core piece of the business.

Maybe not a passion project but maybe a project under duress to meet an immediate requirement may be a little bit more focused on getting something done and not making something that’s really well-engineered and built to be repeatable and last. Here’s the thing: I’ve already shown my age. I’m coming from the old data warehouse days and the PI days, and along comes all this big data stuff and notions of ‘We’re not going to do ETL anymore; we’re going to do ELT. We’re going to do Schema on Read. We’re not going to worry so much about transforming the data until it’s time to query it.’

Also, [with] this notion of data virtualization and bringing the compute to the data rather than the other way around: the implication has been that eliminates the need for pipelines. What do you think of that prospect? Is there a kernel of truth to that? Is there no truth to that? Is it just that things are more complex than we sometimes make them out to be in taglines?

Yeah, I’d say there’s certainly some truth to it. I would say the challenge with the ELT approach is frankly, you’re repeating yourself. If you’re performing the same operation on data over, and over, and over again, it becomes inefficient and it becomes expensive. The same core principles behind why ETL is really valuable is if you know you’re going to be doing something to a piece of data, the same kind of transformation, to get it into a shared, aligned system where you can do more complex things with that piece of data – for example, pipelines are a really good fit.

What we saw and the reason that we’ve seen a surge of the ELT approach is frankly, we have a lot bigger hammers than we used to. When you get bigger hammers, a lot more things start to look like nails. There’s use cases where we can take an ELT approach where previously we just simply couldn’t. It makes life easier for a little while.

What we find, however, is eventually those approaches do start to break down a little bit as either your data volumes or your data complexities increase. Then we see similar to what we’ve seen with some of the most progressive data engineering teams today is: they still have a very strong need to pull data from disparate systems back into unified systems to do complex transformations on those data sets and actually store and persist those derived data sets as they’re optimized, and curated, and available for those downstream teams to really iterate all. That’s where we figure there’s a fit, but I do believe that over time, I would say the industry evolved not to just be ETL or even just ELT but a cascading approach to ETL, TL.

I think maybe we need to acknowledge a nuance, which is that in the world of analytics, some things are done repeatedly and operationally, and some stuff is more exploratory. Maybe for more exploratory stuff, leaving things less structured in storage makes sense. Then doing the transformations at query time makes sense.

Once we really get into productionalized questions and analyses and necessary insights, that’s where we need a greater deal of engineering and structure, and that really was the rubric of data warehouses all along. Maybe that requirement hasn’t really gone away with that in mind. That’s more of a statement than a question, I guess. Do you agree?

It is a statement, and I do agree. I do think there’s overhead to construction of pipelines that if you have an exploratory ad hoc use case, it really is important to be able to just ad hoc export data. We do think that that is why there’s this really powerful combination of that desire to do ELT and how pipelines work together. We think they both fit complementary needs.

That makes sense. Now we’re starting to get to some nuance instead of extremes. I think that’s good. We know that there’s lots of folks out there – arguably, I used to be one of them – that felt a lot more in control and comfortable just writing their own code for something every single time. That may not scale so well, so we’ve had systems for some time and we have a lot of them in the market now that let us construct pipelines visually through some kind of schematic approach where things are declared, even where we’re not coding. Does that address the issue? Does that make things less brittle? Why isn’t that good enough?

That’s a great question. We certainly have seen these systems pop up that make it easier to architect and design pipelines. In fact, even Ascend has a visual interface that we know our users really enjoy using. I think the thing that makes this harder is it’s not just about the construction of the pipeline, but also the operation of the pipeline. What we’ve seen is: most of the tools in the industry to date help you describe declaratively in far less code the operations you want to make to your data.

When it comes to the actual operation of that pipeline itself, it’s still hard. You still have to answer questions like ‘what sets of data should I persist? How do I intelligently back-fill data? How do I retract previously calculated statements? Should I partition my data?’ All these things you had to worry about when you wrote a ton of code to go to pipeline, you still actually have to worry about even in these higher-level systems.

I’d say they’re a step in the right direction, but this is also where we believe in having a fully declarative system, one that is less task-centric but far more data-centric, that understand the data far closer to how a database engine or a data warehouse engine understands the data is really where things are going.

We’re big fans and believers in this notion of declarative systems similar to what we’ve seen in infrastructure technologies like Kubernetes and similar to what we’ve seen in these warehouses with their database engines. If you can build a system that really deeply understands how the pipelines work and the nature of the data and the dependency of transformations, you can offload a lot more of that complexity to the underlying engine itself.

You segued into my next question, perhaps unwittingly, but for a while after I worked on a number of these things, I started to see patterns. As I went from one project to the next, I was able to apply some rules of thumb or heuristics to get these things done a little faster. That was just me doing it, in effect, manually.

As an industry, it feels like we’ve been building these pipelines for a long time. Are there learnings that can be applied in an automated fashion such that almost like we have rules or an expert system or something that really understands the generalized prospect of creating pipelines such that maybe a lot of this stuff doesn’t have to be reinvented from database to database and project to project? First, let me just ask that about the initial authoring of a pipeline, and then I have a follow-up.

I think we, as an industry, are right on the cusp of doing so. It’s interesting because when we look at adjacent industries, very few folks are building a new database engine or a new data warehouse engine. It’s generally accepted that there’s some pretty standard technologies and winners in that space.

What’s interesting that we observed is – and even my team and previous companies have done the same – you always end up with a team grabbing some of the existing open-source technologies like Spark, for example, but then building, as we all have, these abstraction layers on top to try and better structure and automate a lot of these repeatable patterns; we’ve seen this time and time again. Even at Ascend as we engage with dozens and dozens of different companies, we see every company trying to abstract away these complexities to make it easier for the rest of their company to self-serve and create data pipelines.

What’s really interesting is: everybody is following these same patterns but all implementing them slightly different[ly]. They all tend to be very bespoke. To your question, I do believe the industry is right on that cusp of tipping into this exciting new era of ‘there are standard, well-informed ways of having a standardized intelligence system that knows how to essentially be the automated engine for pipelines, that does similar to what a database engine and a query planner and a query optimizer does for databases’ that we will soon start to see an emergence of these really intelligent layers for how pipelines themselves work that understand far more of the data as opposed to the tasks and as a result, offload huge amounts of the developer burden required to build and maintain pipelines.

That’s interesting, and I have a follow-up, but I also wanted to dial back to something you said earlier. My follow-up is: can this approach, in addition to being applied to create a pipeline, also deal with – that’s the stuff that always caused loss of sleep on projects I was on was sometimes some flat file that was supposed to get put somewhere didn’t get delivered, or gosh, schemas can change, either the schema in the source data or the schema perhaps in my warehouse or my OLAP cube or something like that. Again, I’m dating myself, but there it is.

Can we anticipate some of that and automate the adjustment of the pipeline? That’s my follow-up; then I have one more follow-up on the follow-up, but we’ll go one at a time.

The short answer to your question is ‘yes.’ I think the core key notion behind a lot of this is: ‘Do you have that intelligence layer that understands where your data’s coming from and where it’s going to, as well as how it’s being used?’ At Ascend, we call this our control plane and various other teams and companies may implement it and call it something slightly different.

The key concept behind this is: ‘Do you have a system that isn’t looking at data as it moves through just in context of one transformation, or one query, or one stage, or even a whole pipeline but instead to look at the entire ecosystem and do two things?’ One, can it detect changes and even potential breaks way far upstream before they even trickle down through systems? Then two, to your specific question, could [it] also catch a change that somebody may be wanting to make to that OLAP cube or to a data set and say if you do this, it could have this impact downstream because that system, that control plane, is monitoring the entire data ecosystem?

This is a whole new level of intelligence that we’re certainly capable of building. I think it tends to just be hard technology, which takes a lot of time. So did, honestly, database engines to relate to and then refine as well, but the benefits are astronomically high as a result.

You’re getting me excited because it’s almost like – gosh, now I have this idea in my head that we could almost have a trigger in the pipeline as opposed to a trigger in the database so that if something changes in the schema or something changes in a requirement, a source file, or source system that isn’t responding to a query or where a file’s not present, that we have a way of dealing with that. I don’t know if I’m extrapolating too far or if that’s somewhere in the right neighborhood, but that’s intriguing.

You mentioned before the whole notion of whether we have to partition the data. That’s been important for a long time, too, but it also just seems like in the world of data lakes and when we’re working with these columnar file formats like ORC and Parquet, there’s umpteen opportunities to partition the data just in arbitrary number of chunks or across certain dimensions or certain levels within the dimension, partitioning across year, partitioning across month, and then just saying “oh, partition it down into 200 pieces.”

Of course, the hope there is when you run a somewhat founded query, you can skip over a lot of the data that’s not important. The little bit of that that I’ve done, I feel like I’m guessing a little bit. What can you do to really automate not just the action items for implementing partitions but for developing a partitioning strategy and helping me with that so that I don’t have to guess at it?

It’s a really good question. The general thesis and position here is nobody should ever really have to worry about partitioning. That should ultimately be the pipeline engine, if you will. Just as when we have a database engine, it worries about how it’s laying out the data on disk and indices are happening to optimize how you pull it off of disk or out of memory. The very same thing should be happening with that pipeline engine.

When we think about where we can and should be evolving to as an industry, we have these really powerful technologies like our cloud BLOB stores and our processing engines like Spark that are really powerful, capable, and scalable, but they don’t maintain that same context around the data as it persists across the life cycle. You just don’t know the context of usage.

It really comes down to that pipeline engine when it has a broader context. For example, how Ascend’s control plane maintains context. It knows everything that you’re doing downstream on a data set and is smart enough to say you’re doing things like daily, hourly, or monthly roll-ups of data and analytic-style use case downstream from this data set. It can be really smart and partition that data dynamically for you so you don’t have to worry about it.

This is really, I think, just one of many categories where an intelligent control plane can help over time. As a user, I get to focus more on those individual use cases and that control plane can intelligently move that data around on the data lake and tune the pipelines for me based off of how I’m accessing and using the data downstream.

I was going to say maybe there’s even a component that can observe the kinds of queries that are getting run so that we can understand the most advantageous partitioning approach. Sounds like you answered that question before I even asked it, so that works well.

Also, the notion you brought up before of how ELT, while it has its advantages, it also can put us in a place where we’re just repeating a lot of the same transformations because we didn’t do it upfront. Actually, at a higher level, at a meta level, it seems like – it just occurred to me while you were talking – that’s kind of what we’re doing with partitioning.

It’s not that we’re partitioning the same data over and over again but as we go from data domain to data domain, we’re thinking through those same questions and applying the same stuff somewhat manually over and over again. Why not generalize that? Why not factor that out into a platform that’s dedicated to that?

We whole-heartedly agree and we do believe that as these intelligent control planes continue to evolve, we’ll even start to see this really exciting hybrid world of warehouses and pipelines where we should be able to get a far more seamless fabric where even if I am trying to do my ad hoc exploration, if there’s a control plane that is observing and monitoring this, that control plane can and should even be able to take what was more of an ELT approach and dynamically construct pipelines, move them to an ETL-optimized system and simply help optimize downstream. For me as a user, I shouldn’t have to worry about that. That intelligent control plane really can help hybridize this model and give me as a user a higher level of benefit from it.

To use your term, which was bespoke, there’s a place for bespoke effort, but at some point there’s a threshold where the bespoke effort is getting replicated frequently enough that it really becomes more of something that should be engineered and not hand-crafted. Maybe just mapping out where that threshold is has been a little bit difficult.

It sounds like what you’re saying is if we can identify it, then we can say “We’ll take the baton from here and handle the drudge work of making this really bulletproof and managed and with the exception management in there and just making it work in production.”

What we’ve seen over time – and this is really the history of technology – has been the things that a few years ago really differentiated your company from a technological advantage really become table stakes and commonplace as the industry evolves. Five, ten years ago, it may’ve been of a strategic advantage for your company to be able to run a bigger Hadoop cluster than everybody else. As that’s become commoditized, it’s evolved. We saw it move from how big can your Hadoop or HDAP cluster store to how tuned can you make your Spark pipelines be to really now what we’re seeing is: how well do you construct, tune, and optimize your pipelines that the code that you write to marshal those underlying resources is also now being commoditized and should be heavily automated as it’s not core to your business.

What’s really core to your business is how you apply your data to your business logic and your business understanding. It’s the outcome and the output of your pipelines is what differentiates. We’ll continue to see this from technologies that really help people focus on the differentiated part of their business.

Just so we can end on a high note, I think some people who are really sophisticated engineers and developers in this sphere hear about automating a bunch of the work that maybe was bespoke beforehand. As we automate, does that mean we’re sun-setting the need for the engineers, or does it mean that maybe what’s become ‘old hat’ for them can be standardized and then they can move on to even higher level stuff where their talents can really be leveraged?

Oh, yeah, I whole-heartedly agree with the latter. I don’t think there’s many engineers that like getting paged at 3 in the morning because some JBM process ran out of memory and you had to go tune the cluster or repartition the database. I’d say what we’ve seen and especially with data engineering, we’ve been solving a lot of really painful problems and just grinding through the muck for years.

I think this lets us really start to free up from the data movement part of the problem and free up far more interesting and exciting parts of how do we apply these large data volumes and incredible insights to far more automated and intelligence layers that really fuel great and exciting new products for the business. Hopefully we see a surge of innovation over the next few years as frankly we’re taking some of the best and brightest and freeing them up to go tackle bigger and more interesting challenges.

We never get to the point where engineers’ skills are not in demand. Just a question of how high up the value chain we want them to apply their skills. With that we’ve actually come to the end of our half-hour conversation. I know we’re also going to have a webinar where we can focus even more. I will look forward to that, and Sean, I will thank you very much for your time today. This has been a great discussion.

Fantastic. Thanks so much, Andrew.

For GigaOm, this is Andrew thanking Sean and thanking all of you who listened in. We bid you good day, good evening, good night, depending on your time zone and so forth. Thanks very much.

Apple Enters Show Business With a Black-Carpet Premiere

Source: https://www.nytimes.com/2019/10/30/business/media/apple-tv-plus.html?emc=rss&partner=rss
October 30, 2019 at 12:00PM

Apple TV Plus arrives Friday with “The Morning Show,” starring Reese Witherspoon and Jennifer Aniston, as its flagship streaming program.

Tuesday, October 29, 2019

Sony to Shut Down PlayStation Vue, a Cable Alternative

Source: https://www.nytimes.com/2019/10/29/business/sony-playstation-vue.html?emc=rss&partner=rss
October 30, 2019 at 03:43AM

Vue was started in early 2015 as a cheaper version of cable TV. It drew a slew of copycats, but customer growth slowed for many of these services as prices rose.

HBO Max Will Go Live in May, WarnerMedia Announces

Source: https://www.nytimes.com/2019/10/29/business/media/hbo-max-price.html?emc=rss&partner=rss
October 30, 2019 at 02:10AM

It’s presentation day for the future online home of “Friends” and “Game of Thrones.”

Voices in Innovation – Andrew Brust Interviews Sean Knapp of Ascend.IO

Source: https://gigaom.com/2019/10/24/voices-in-innovation-simon-gibson-interviews-sean-knapp-of-ascend-io/
October 24, 2019 at 03:00PM

[audio_player title=”Voices in Innovation Andrew Brust Interviews Sean Knapp” audio=”https://voices-in-innovation.s3.amazonaws.com/001-ViI-(00-32-06)-AB-Ascend.mp3″]

Guest

Sean Knapp is the founder and CEO of Ascend.io. Prior to Ascend.io, Sean was a co-founder, CTO, and Chief Product Officer at Ooyala. At Ooyala Sean played key roles in raising $120M, scaling the company to 500 employees, Ooyala’s $410M acquisition, as well as Ooyala’s subsequent acquisitions of Videoplaza and Nativ. He oversaw all Product, Engineering and Solutions, as well as well as defined Ooyala’s product vision for their award-winning analytics and video platform solutions. Before founding Ooyala, Sean worked at Google where he was the technical lead for Google’s legendary Web Search Frontend team, helping that team increase Google revenues by over $1B. Sean has both B.S. and M.S. degrees in Computer Science from Stanford University.

Transcript

Andrew Brust: All right, welcome to this GigaOm podcast. This is Andrew Brust from GigaOm. I’m a lead analyst focusing on data analytics, BI, AI, and all good things data. With us, we have Sean Knapp, who is the founder and CEO at Ascend. Sean, welcome.

Sean Knapp: Thank you.

I’m hoping we can talk a little bit about data pipelines but gosh, if that’s all we’re talking about, that can be a bit dry. I think what, hopefully, we can talk about is how to get beyond what has been and get to maybe a better place where data pipelines feel less like a necessary evil and more like something that can really help and perhaps are less burdensome. I may have loaded all the questions there, but hopefully that’s okay.

Can you maybe just introduce yourself a bit and the premise and the rubric of what Ascend is up to? Then maybe we can drill down on a few questions from there.

Yeah, happy to. As I introduce myself, Sean Knapp, founder and CEO of Ascend.io, I’m a software engineer by training. I’ve been building data pipelines for actually15 years now. I wrote my first MapReduce job in Sawzall back at Google in 2004 and since then have gone on to build companies and teams really heavily focused around data, analytics, ETL, and more. About four years ago, I founded Ascend.io to really help automate and optimize a lot more of how we build data pipelines and make it a lot easier for us to, frankly, do more of them faster and more efficiently.

You’ve got your work cut out for you. I forget what date you said you said you wrote your first MapReduce job, but were you really coding MapReduce in Java? Was that the substance of what you were up to at that time?

Yeah, so back when I started at Google in 2004, Google had their MapReduce framework. You usually wrote your jobs in a language called Sawzall, which I was the tech lead for the front-end team on web search. We were always writing a lot of large MapReduce jobs to analyze how our users were engaging with content on web search, what links they were clicking on, the efficacy of our various experimentation systems. We wrote a lot of analysis on usage even 15 years ago.

I know from a little bit of dabbling I’ve done with Hadoop and that flavor of MapReduce, that can get pretty in the weeds pretty quickly. Gosh, I guess even before there was MapReduce, there were all kinds of ETL packages. Pipelines are not a new concept; they’ve been with us for quite a long time. I go back 20 years in doing BI work, and we were writing SQL scripts then to do a lot of this stuff.

What have pipelines been good for, and where have they hindered us, if I can ask you those two oppositional questions?

I think pipelines have been of tremendous value and also pain for most companies for decades now. They are certainly not a new concept. I’d say the technologies that they employ and that they leverage have changed tremendously, but the core concepts of really doing data movement and transformation, the classic ETL approach, really hasn’t changed tremendously.

I’d say what pipelines have been incredibly valuable for is pulling data out of one or many systems, doing complex and interesting transformations on it, and then of course, loading it back into others, standard ETL. Where they have been increasingly valuable for companies and organizations is as those transformations get more and more complex, the pipelines get more and more valuable.

I would say where the pain really starts to emerge is frankly, as you have more data sets, higher data volumes, and more people tapping into those pipelines, this is really where the pain emerges as we get this exponential increase in complexities tied to the interconnectedness of these systems. That’s really where most of the pain points are felt is simply trying to maintain and sustain these systems as they become increasingly complex, interconnected highways, if you will, of data flowing through the enterprise.

I guess one thing I’ve seen over the years is that we’ve written scripts really with the very tactical goal of just getting data from A to B and from Form X to Form Y. Once we got it to work, we were pretty happy. That’s very ad hoc, and then to go from there to something that is operationalized is another thing. It seems like some of these scripts just became operationalized almost accidentally and they weren’t necessarily up to the task…

Yeah, I totally agree.

Go for it, yeah. I’m listening.

Yeah, I totally agree. I would say the vast majority of pipelines that have been built were built in isolation, oftentimes scripted to meet a very specific and certain need. Then lo and behold, all of a sudden, you found ten other teams were building on top of that data set that was derived and what’s probably somebody’s weekend or hack-a-thon project now is a critical and core piece of the business.

Maybe not a passion project but maybe a project under duress to meet an immediate requirement may be a little bit more focused on getting something done and not making something that’s really well-engineered and built to be repeatable and last. Here’s the thing: I’ve already shown my age. I’m coming from the old data warehouse days and the PI days, and along comes all this big data stuff and notions of ‘We’re not going to do ETL anymore; we’re going to do ELT. We’re going to do Schema on Read. We’re not going to worry so much about transforming the data until it’s time to query it.’

Also, [with] this notion of data virtualization and bringing the compute to the data rather than the other way around: the implication has been that eliminates the need for pipelines. What do you think of that prospect? Is there a kernel of truth to that? Is there no truth to that? Is it just that things are more complex than we sometimes make them out to be in taglines?

Yeah, I’d say there’s certainly some truth to it. I would say the challenge with the ELT approach is frankly, you’re repeating yourself. If you’re performing the same operation on data over, and over, and over again, it becomes inefficient and it becomes expensive. The same core principles behind why ETL is really valuable is if you know you’re going to be doing something to a piece of data, the same kind of transformation, to get it into a shared, aligned system where you can do more complex things with that piece of data – for example, pipelines are a really good fit.

What we saw and the reason that we’ve seen a surge of the ELT approach is frankly, we have a lot bigger hammers than we used to. When you get bigger hammers, a lot more things start to look like nails. There’s use cases where we can take an ELT approach where previously we just simply couldn’t. It makes life easier for a little while.

What we find, however, is eventually those approaches do start to break down a little bit as either your data volumes or your data complexities increase. Then we see similar to what we’ve seen with some of the most progressive data engineering teams today is: they still have a very strong need to pull data from disparate systems back into unified systems to do complex transformations on those data sets and actually store and persist those derived data sets as they’re optimized, and curated, and available for those downstream teams to really iterate all. That’s where we figure there’s a fit, but I do believe that over time, I would say the industry evolved not to just be ETL or even just ELT but a cascading approach to ETL, TL.

I think maybe we need to acknowledge a nuance, which is that in the world of analytics, some things are done repeatedly and operationally, and some stuff is more exploratory. Maybe for more exploratory stuff, leaving things less structured in storage makes sense. Then doing the transformations at query time makes sense.

Once we really get into productionalized questions and analyses and necessary insights, that’s where we need a greater deal of engineering and structure, and that really was the rubric of data warehouses all along. Maybe that requirement hasn’t really gone away with that in mind. That’s more of a statement than a question, I guess. Do you agree?

It is a statement, and I do agree. I do think there’s overhead to construction of pipelines that if you have an exploratory ad hoc use case, it really is important to be able to just ad hoc export data. We do think that that is why there’s this really powerful combination of that desire to do ELT and how pipelines work together. We think they both fit complementary needs.

That makes sense. Now we’re starting to get to some nuance instead of extremes. I think that’s good. We know that there’s lots of folks out there – arguably, I used to be one of them – that felt a lot more in control and comfortable just writing their own code for something every single time. That may not scale so well, so we’ve had systems for some time and we have a lot of them in the market now that let us construct pipelines visually through some kind of schematic approach where things are declared, even where we’re not coding. Does that address the issue? Does that make things less brittle? Why isn’t that good enough?

That’s a great question. We certainly have seen these systems pop up that make it easier to architect and design pipelines. In fact, even Ascend has a visual interface that we know our users really enjoy using. I think the thing that makes this harder is it’s not just about the construction of the pipeline, but also the operation of the pipeline. What we’ve seen is: most of the tools in the industry to date help you describe declaratively in far less code the operations you want to make to your data.

When it comes to the actual operation of that pipeline itself, it’s still hard. You still have to answer questions like ‘what sets of data should I persist? How do I intelligently back-fill data? How do I retract previously calculated statements? Should I partition my data?’ All these things you had to worry about when you wrote a ton of code to go to pipeline, you still actually have to worry about even in these higher-level systems.

I’d say they’re a step in the right direction, but this is also where we believe in having a fully declarative system, one that is less task-centric but far more data-centric, that understand the data far closer to how a database engine or a data warehouse engine understands the data is really where things are going.

We’re big fans and believers in this notion of declarative systems similar to what we’ve seen in infrastructure technologies like Kubernetes and similar to what we’ve seen in these warehouses with their database engines. If you can build a system that really deeply understands how the pipelines work and the nature of the data and the dependency of transformations, you can offload a lot more of that complexity to the underlying engine itself.

You segued into my next question, perhaps unwittingly, but for a while after I worked on a number of these things, I started to see patterns. As I went from one project to the next, I was able to apply some rules of thumb or heuristics to get these things done a little faster. That was just me doing it, in effect, manually.

As an industry, it feels like we’ve been building these pipelines for a long time. Are there learnings that can be applied in an automated fashion such that almost like we have rules or an expert system or something that really understands the generalized prospect of creating pipelines such that maybe a lot of this stuff doesn’t have to be reinvented from database to database and project to project? First, let me just ask that about the initial authoring of a pipeline, and then I have a follow-up.

I think we, as an industry, are right on the cusp of doing so. It’s interesting because when we look at adjacent industries, very few folks are building a new database engine or a new data warehouse engine. It’s generally accepted that there’s some pretty standard technologies and winners in that space.

What’s interesting that we observed is – and even my team and previous companies have done the same – you always end up with a team grabbing some of the existing open-source technologies like Spark, for example, but then building, as we all have, these abstraction layers on top to try and better structure and automate a lot of these repeatable patterns; we’ve seen this time and time again. Even at Ascend as we engage with dozens and dozens of different companies, we see every company trying to abstract away these complexities to make it easier for the rest of their company to self-serve and create data pipelines.

What’s really interesting is: everybody is following these same patterns but all implementing them slightly different[ly]. They all tend to be very bespoke. To your question, I do believe the industry is right on that cusp of tipping into this exciting new era of ‘there are standard, well-informed ways of having a standardized intelligence system that knows how to essentially be the automated engine for pipelines, that does similar to what a database engine and a query planner and a query optimizer does for databases’ that we will soon start to see an emergence of these really intelligent layers for how pipelines themselves work that understand far more of the data as opposed to the tasks and as a result, offload huge amounts of the developer burden required to build and maintain pipelines.

That’s interesting, and I have a follow-up, but I also wanted to dial back to something you said earlier. My follow-up is: can this approach, in addition to being applied to create a pipeline, also deal with – that’s the stuff that always caused loss of sleep on projects I was on was sometimes some flat file that was supposed to get put somewhere didn’t get delivered, or gosh, schemas can change, either the schema in the source data or the schema perhaps in my warehouse or my OLAP cube or something like that. Again, I’m dating myself, but there it is.

Can we anticipate some of that and automate the adjustment of the pipeline? That’s my follow-up; then I have one more follow-up on the follow-up, but we’ll go one at a time.

The short answer to your question is ‘yes.’ I think the core key notion behind a lot of this is: ‘Do you have that intelligence layer that understands where your data’s coming from and where it’s going to, as well as how it’s being used?’ At Ascend, we call this our control plane and various other teams and companies may implement it and call it something slightly different.

The key concept behind this is: ‘Do you have a system that isn’t looking at data as it moves through just in context of one transformation, or one query, or one stage, or even a whole pipeline but instead to look at the entire ecosystem and do two things?’ One, can it detect changes and even potential breaks way far upstream before they even trickle down through systems? Then two, to your specific question, could [it] also catch a change that somebody may be wanting to make to that OLAP cube or to a data set and say if you do this, it could have this impact downstream because that system, that control plane, is monitoring the entire data ecosystem?

This is a whole new level of intelligence that we’re certainly capable of building. I think it tends to just be hard technology, which takes a lot of time. So did, honestly, database engines to relate to and then refine as well, but the benefits are astronomically high as a result.

You’re getting me excited because it’s almost like – gosh, now I have this idea in my head that we could almost have a trigger in the pipeline as opposed to a trigger in the database so that if something changes in the schema or something changes in a requirement, a source file, or source system that isn’t responding to a query or where a file’s not present, that we have a way of dealing with that. I don’t know if I’m extrapolating too far or if that’s somewhere in the right neighborhood, but that’s intriguing.

You mentioned before the whole notion of whether we have to partition the data. That’s been important for a long time, too, but it also just seems like in the world of data lakes and when we’re working with these columnar file formats like ORC and Parquet, there’s umpteen opportunities to partition the data just in arbitrary number of chunks or across certain dimensions or certain levels within the dimension, partitioning across year, partitioning across month, and then just saying “oh, partition it down into 200 pieces.”

Of course, the hope there is when you run a somewhat founded query, you can skip over a lot of the data that’s not important. The little bit of that that I’ve done, I feel like I’m guessing a little bit. What can you do to really automate not just the action items for implementing partitions but for developing a partitioning strategy and helping me with that so that I don’t have to guess at it?

It’s a really good question. The general thesis and position here is nobody should ever really have to worry about partitioning. That should ultimately be the pipeline engine, if you will. Just as when we have a database engine, it worries about how it’s laying out the data on disk and indices are happening to optimize how you pull it off of disk or out of memory. The very same thing should be happening with that pipeline engine.

When we think about where we can and should be evolving to as an industry, we have these really powerful technologies like our cloud BLOB stores and our processing engines like Spark that are really powerful, capable, and scalable, but they don’t maintain that same context around the data as it persists across the life cycle. You just don’t know the context of usage.

It really comes down to that pipeline engine when it has a broader context. For example, how Ascend’s control plane maintains context. It knows everything that you’re doing downstream on a data set and is smart enough to say you’re doing things like daily, hourly, or monthly roll-ups of data and analytic-style use case downstream from this data set. It can be really smart and partition that data dynamically for you so you don’t have to worry about it.

This is really, I think, just one of many categories where an intelligent control plane can help over time. As a user, I get to focus more on those individual use cases and that control plane can intelligently move that data around on the data lake and tune the pipelines for me based off of how I’m accessing and using the data downstream.

I was going to say maybe there’s even a component that can observe the kinds of queries that are getting run so that we can understand the most advantageous partitioning approach. Sounds like you answered that question before I even asked it, so that works well.

Also, the notion you brought up before of how ELT, while it has its advantages, it also can put us in a place where we’re just repeating a lot of the same transformations because we didn’t do it upfront. Actually, at a higher level, at a meta level, it seems like – it just occurred to me while you were talking – that’s kind of what we’re doing with partitioning.

It’s not that we’re partitioning the same data over and over again but as we go from data domain to data domain, we’re thinking through those same questions and applying the same stuff somewhat manually over and over again. Why not generalize that? Why not factor that out into a platform that’s dedicated to that?

We whole-heartedly agree and we do believe that as these intelligent control planes continue to evolve, we’ll even start to see this really exciting hybrid world of warehouses and pipelines where we should be able to get a far more seamless fabric where even if I am trying to do my ad hoc exploration, if there’s a control plane that is observing and monitoring this, that control plane can and should even be able to take what was more of an ELT approach and dynamically construct pipelines, move them to an ETL-optimized system and simply help optimize downstream. For me as a user, I shouldn’t have to worry about that. That intelligent control plane really can help hybridize this model and give me as a user a higher level of benefit from it.

To use your term, which was bespoke, there’s a place for bespoke effort, but at some point there’s a threshold where the bespoke effort is getting replicated frequently enough that it really becomes more of something that should be engineered and not hand-crafted. Maybe just mapping out where that threshold is has been a little bit difficult.

It sounds like what you’re saying is if we can identify it, then we can say “We’ll take the baton from here and handle the drudge work of making this really bulletproof and managed and with the exception management in there and just making it work in production.”

What we’ve seen over time – and this is really the history of technology – has been the things that a few years ago really differentiated your company from a technological advantage really become table stakes and commonplace as the industry evolves. Five, ten years ago, it may’ve been of a strategic advantage for your company to be able to run a bigger Hadoop cluster than everybody else. As that’s become commoditized, it’s evolved. We saw it move from how big can your Hadoop or HDAP cluster store to how tuned can you make your Spark pipelines be to really now what we’re seeing is: how well do you construct, tune, and optimize your pipelines that the code that you write to marshal those underlying resources is also now being commoditized and should be heavily automated as it’s not core to your business.

What’s really core to your business is how you apply your data to your business logic and your business understanding. It’s the outcome and the output of your pipelines is what differentiates. We’ll continue to see this from technologies that really help people focus on the differentiated part of their business.

Just so we can end on a high note, I think some people who are really sophisticated engineers and developers in this sphere hear about automating a bunch of the work that maybe was bespoke beforehand. As we automate, does that mean we’re sun-setting the need for the engineers, or does it mean that maybe what’s become ‘old hat’ for them can be standardized and then they can move on to even higher level stuff where their talents can really be leveraged?

Oh, yeah, I whole-heartedly agree with the latter. I don’t think there’s many engineers that like getting paged at 3 in the morning because some JBM process ran out of memory and you had to go tune the cluster or repartition the database. I’d say what we’ve seen and especially with data engineering, we’ve been solving a lot of really painful problems and just grinding through the muck for years.

I think this lets us really start to free up from the data movement part of the problem and free up far more interesting and exciting parts of how do we apply these large data volumes and incredible insights to far more automated and intelligence layers that really fuel great and exciting new products for the business. Hopefully we see a surge of innovation over the next few years as frankly we’re taking some of the best and brightest and freeing them up to go tackle bigger and more interesting challenges.

We never get to the point where engineers’ skills are not in demand. Just a question of how high up the value chain we want them to apply their skills. With that we’ve actually come to the end of our half-hour conversation. I know we’re also going to have a webinar where we can focus even more. I will look forward to that, and Sean, I will thank you very much for your time today. This has been a great discussion.

Fantastic. Thanks so much, Andrew.

For GigaOm, this is Andrew thanking Sean and thanking all of you who listened in. We bid you good day, good evening, good night, depending on your time zone and so forth. Thanks very much.

WhatsApp Says Israeli Firm Used Its App in Spy Program

Source: https://www.nytimes.com/2019/10/29/technology/whatsapp-nso-lawsuit.html?emc=rss&partner=rss
October 29, 2019 at 10:34PM

In a federal lawsuit, the messaging service said the firm had targeted 1,400 WhatsApp accounts, including 100 journalists and human-rights activists.

Debunking 4 Viral Rumors About the Bidens and Ukraine

Source: https://www.nytimes.com/2019/10/29/business/media/fact-check-biden-ukraine-burisma-china-hunter.html?emc=rss&partner=rss
October 29, 2019 at 10:13PM

As lawmakers examine whether President Trump pushed Ukraine to investigate the Biden family, here are some of the most prominent falsehoods that have spread online and an explanation of what really happened.

Google, in Rare Stumble, Posts 23% Decline in Profit

Source: https://www.nytimes.com/2019/10/28/technology/google-alphabet-earnings.html?emc=rss&partner=rss
October 29, 2019 at 02:03AM

Alphabet, Google’s parent, said its profits fell after a sharp increase in spending for research and development.

Read the Letter Facebook Employees Sent to Mark Zuckerberg About Political Ads

Source: https://www.nytimes.com/2019/10/28/technology/facebook-mark-zuckerberg-letter.html?emc=rss&partner=rss
October 28, 2019 at 08:24PM

Hundreds of Facebook employees signed a letter decrying the decision to let politicians post any claims they wanted — even false ones — in ads on the site.

Using Virtual Reality to Plan Your Actual Retirement

Source: https://www.nytimes.com/2019/10/25/business/retirement-savings.html?emc=rss&partner=rss
October 25, 2019 at 08:39PM

V.R. and online visual aids are giving workers a better idea of how much they need to save.

Australia Proposes Face Scans for Watching Online Pornography

Source: https://www.nytimes.com/2019/10/29/world/australia/pornography-facial-recognition.html?emc=rss&partner=rss
October 29, 2019 at 03:58PM

As a government agency seeks approval of a facial recognition system, it says one use for it could be verifying the age of people who want to view pornography online.

Amazon Turns to More Free Grocery Delivery to Lift Food Sales

Source: https://www.nytimes.com/2019/10/29/technology/amazon-prime-fresh-whole-foods-grocery-delivery.html?emc=rss&partner=rss
October 29, 2019 at 01:00PM

The company announced free two-hour food delivery for Prime members in about 20 major metropolitan areas.

‘OK Boomer’ Marks the End of Friendly Generational Relations

Source: https://www.nytimes.com/2019/10/29/style/ok-boomer.html?emc=rss&partner=rss
October 29, 2019 at 12:00PM

Now it’s war: Gen Z has finally snapped over climate change and financial inequality.

Voices in Innovation – Andrew Brust Interviews Sean Knapp of Ascend.IO

Source: https://gigaom.com/2019/10/24/voices-in-innovation-simon-gibson-interviews-sean-knapp-of-ascend-io/
October 24, 2019 at 03:00PM

[audio_player title=”Voices in Innovation Andrew Brust Interviews Sean Knapp” audio=”https://voices-in-innovation.s3.amazonaws.com/001-ViI-(00-32-06)-AB-Ascend.mp3″]

Guest

Sean Knapp is the founder and CEO of Ascend.io. Prior to Ascend.io, Sean was a co-founder, CTO, and Chief Product Officer at Ooyala. At Ooyala Sean played key roles in raising $120M, scaling the company to 500 employees, Ooyala’s $410M acquisition, as well as Ooyala’s subsequent acquisitions of Videoplaza and Nativ. He oversaw all Product, Engineering and Solutions, as well as well as defined Ooyala’s product vision for their award-winning analytics and video platform solutions. Before founding Ooyala, Sean worked at Google where he was the technical lead for Google’s legendary Web Search Frontend team, helping that team increase Google revenues by over $1B. Sean has both B.S. and M.S. degrees in Computer Science from Stanford University.

Transcript

Andrew Brust: All right, welcome to this GigaOm podcast. This is Andrew Brust from GigaOm. I’m a lead analyst focusing on data analytics, BI, AI, and all good things data. With us, we have Sean Knapp, who is the founder and CEO at Ascend. Sean, welcome.

Sean Knapp: Thank you.

I’m hoping we can talk a little bit about data pipelines but gosh, if that’s all we’re talking about, that can be a bit dry. I think what, hopefully, we can talk about is how to get beyond what has been and get to maybe a better place where data pipelines feel less like a necessary evil and more like something that can really help and perhaps are less burdensome. I may have loaded all the questions there, but hopefully that’s okay.

Can you maybe just introduce yourself a bit and the premise and the rubric of what Ascend is up to? Then maybe we can drill down on a few questions from there.

Yeah, happy to. As I introduce myself, Sean Knapp, founder and CEO of Ascend.io, I’m a software engineer by training. I’ve been building data pipelines for actually15 years now. I wrote my first MapReduce job in Sawzall back at Google in 2004 and since then have gone on to build companies and teams really heavily focused around data, analytics, ETL, and more. About four years ago, I founded Ascend.io to really help automate and optimize a lot more of how we build data pipelines and make it a lot easier for us to, frankly, do more of them faster and more efficiently.

You’ve got your work cut out for you. I forget what date you said you said you wrote your first MapReduce job, but were you really coding MapReduce in Java? Was that the substance of what you were up to at that time?

Yeah, so back when I started at Google in 2004, Google had their MapReduce framework. You usually wrote your jobs in a language called Sawzall, which I was the tech lead for the front-end team on web search. We were always writing a lot of large MapReduce jobs to analyze how our users were engaging with content on web search, what links they were clicking on, the efficacy of our various experimentation systems. We wrote a lot of analysis on usage even 15 years ago.

I know from a little bit of dabbling I’ve done with Hadoop and that flavor of MapReduce, that can get pretty in the weeds pretty quickly. Gosh, I guess even before there was MapReduce, there were all kinds of ETL packages. Pipelines are not a new concept; they’ve been with us for quite a long time. I go back 20 years in doing BI work, and we were writing SQL scripts then to do a lot of this stuff.

What have pipelines been good for, and where have they hindered us, if I can ask you those two oppositional questions?

I think pipelines have been of tremendous value and also pain for most companies for decades now. They are certainly not a new concept. I’d say the technologies that they employ and that they leverage have changed tremendously, but the core concepts of really doing data movement and transformation, the classic ETL approach, really hasn’t changed tremendously.

I’d say what pipelines have been incredibly valuable for is pulling data out of one or many systems, doing complex and interesting transformations on it, and then of course, loading it back into others, standard ETL. Where they have been increasingly valuable for companies and organizations is as those transformations get more and more complex, the pipelines get more and more valuable.

I would say where the pain really starts to emerge is frankly, as you have more data sets, higher data volumes, and more people tapping into those pipelines, this is really where the pain emerges as we get this exponential increase in complexities tied to the interconnectedness of these systems. That’s really where most of the pain points are felt is simply trying to maintain and sustain these systems as they become increasingly complex, interconnected highways, if you will, of data flowing through the enterprise.

I guess one thing I’ve seen over the years is that we’ve written scripts really with the very tactical goal of just getting data from A to B and from Form X to Form Y. Once we got it to work, we were pretty happy. That’s very ad hoc, and then to go from there to something that is operationalized is another thing. It seems like some of these scripts just became operationalized almost accidentally and they weren’t necessarily up to the task…

Yeah, I totally agree.

Go for it, yeah. I’m listening.

Yeah, I totally agree. I would say the vast majority of pipelines that have been built were built in isolation, oftentimes scripted to meet a very specific and certain need. Then lo and behold, all of a sudden, you found ten other teams were building on top of that data set that was derived and what’s probably somebody’s weekend or hack-a-thon project now is a critical and core piece of the business.

Maybe not a passion project but maybe a project under duress to meet an immediate requirement may be a little bit more focused on getting something done and not making something that’s really well-engineered and built to be repeatable and last. Here’s the thing: I’ve already shown my age. I’m coming from the old data warehouse days and the PI days, and along comes all this big data stuff and notions of ‘We’re not going to do ETL anymore; we’re going to do ELT. We’re going to do Schema on Read. We’re not going to worry so much about transforming the data until it’s time to query it.’

Also, [with] this notion of data virtualization and bringing the compute to the data rather than the other way around: the implication has been that eliminates the need for pipelines. What do you think of that prospect? Is there a kernel of truth to that? Is there no truth to that? Is it just that things are more complex than we sometimes make them out to be in taglines?

Yeah, I’d say there’s certainly some truth to it. I would say the challenge with the ELT approach is frankly, you’re repeating yourself. If you’re performing the same operation on data over, and over, and over again, it becomes inefficient and it becomes expensive. The same core principles behind why ETL is really valuable is if you know you’re going to be doing something to a piece of data, the same kind of transformation, to get it into a shared, aligned system where you can do more complex things with that piece of data – for example, pipelines are a really good fit.

What we saw and the reason that we’ve seen a surge of the ELT approach is frankly, we have a lot bigger hammers than we used to. When you get bigger hammers, a lot more things start to look like nails. There’s use cases where we can take an ELT approach where previously we just simply couldn’t. It makes life easier for a little while.

What we find, however, is eventually those approaches do start to break down a little bit as either your data volumes or your data complexities increase. Then we see similar to what we’ve seen with some of the most progressive data engineering teams today is: they still have a very strong need to pull data from disparate systems back into unified systems to do complex transformations on those data sets and actually store and persist those derived data sets as they’re optimized, and curated, and available for those downstream teams to really iterate all. That’s where we figure there’s a fit, but I do believe that over time, I would say the industry evolved not to just be ETL or even just ELT but a cascading approach to ETL, TL.

I think maybe we need to acknowledge a nuance, which is that in the world of analytics, some things are done repeatedly and operationally, and some stuff is more exploratory. Maybe for more exploratory stuff, leaving things less structured in storage makes sense. Then doing the transformations at query time makes sense.

Once we really get into productionalized questions and analyses and necessary insights, that’s where we need a greater deal of engineering and structure, and that really was the rubric of data warehouses all along. Maybe that requirement hasn’t really gone away with that in mind. That’s more of a statement than a question, I guess. Do you agree?

It is a statement, and I do agree. I do think there’s overhead to construction of pipelines that if you have an exploratory ad hoc use case, it really is important to be able to just ad hoc export data. We do think that that is why there’s this really powerful combination of that desire to do ELT and how pipelines work together. We think they both fit complementary needs.

That makes sense. Now we’re starting to get to some nuance instead of extremes. I think that’s good. We know that there’s lots of folks out there – arguably, I used to be one of them – that felt a lot more in control and comfortable just writing their own code for something every single time. That may not scale so well, so we’ve had systems for some time and we have a lot of them in the market now that let us construct pipelines visually through some kind of schematic approach where things are declared, even where we’re not coding. Does that address the issue? Does that make things less brittle? Why isn’t that good enough?

That’s a great question. We certainly have seen these systems pop up that make it easier to architect and design pipelines. In fact, even Ascend has a visual interface that we know our users really enjoy using. I think the thing that makes this harder is it’s not just about the construction of the pipeline, but also the operation of the pipeline. What we’ve seen is: most of the tools in the industry to date help you describe declaratively in far less code the operations you want to make to your data.

When it comes to the actual operation of that pipeline itself, it’s still hard. You still have to answer questions like ‘what sets of data should I persist? How do I intelligently back-fill data? How do I retract previously calculated statements? Should I partition my data?’ All these things you had to worry about when you wrote a ton of code to go to pipeline, you still actually have to worry about even in these higher-level systems.

I’d say they’re a step in the right direction, but this is also where we believe in having a fully declarative system, one that is less task-centric but far more data-centric, that understand the data far closer to how a database engine or a data warehouse engine understands the data is really where things are going.

We’re big fans and believers in this notion of declarative systems similar to what we’ve seen in infrastructure technologies like Kubernetes and similar to what we’ve seen in these warehouses with their database engines. If you can build a system that really deeply understands how the pipelines work and the nature of the data and the dependency of transformations, you can offload a lot more of that complexity to the underlying engine itself.

You segued into my next question, perhaps unwittingly, but for a while after I worked on a number of these things, I started to see patterns. As I went from one project to the next, I was able to apply some rules of thumb or heuristics to get these things done a little faster. That was just me doing it, in effect, manually.

As an industry, it feels like we’ve been building these pipelines for a long time. Are there learnings that can be applied in an automated fashion such that almost like we have rules or an expert system or something that really understands the generalized prospect of creating pipelines such that maybe a lot of this stuff doesn’t have to be reinvented from database to database and project to project? First, let me just ask that about the initial authoring of a pipeline, and then I have a follow-up.

I think we, as an industry, are right on the cusp of doing so. It’s interesting because when we look at adjacent industries, very few folks are building a new database engine or a new data warehouse engine. It’s generally accepted that there’s some pretty standard technologies and winners in that space.

What’s interesting that we observed is – and even my team and previous companies have done the same – you always end up with a team grabbing some of the existing open-source technologies like Spark, for example, but then building, as we all have, these abstraction layers on top to try and better structure and automate a lot of these repeatable patterns; we’ve seen this time and time again. Even at Ascend as we engage with dozens and dozens of different companies, we see every company trying to abstract away these complexities to make it easier for the rest of their company to self-serve and create data pipelines.

What’s really interesting is: everybody is following these same patterns but all implementing them slightly different[ly]. They all tend to be very bespoke. To your question, I do believe the industry is right on that cusp of tipping into this exciting new era of ‘there are standard, well-informed ways of having a standardized intelligence system that knows how to essentially be the automated engine for pipelines, that does similar to what a database engine and a query planner and a query optimizer does for databases’ that we will soon start to see an emergence of these really intelligent layers for how pipelines themselves work that understand far more of the data as opposed to the tasks and as a result, offload huge amounts of the developer burden required to build and maintain pipelines.

That’s interesting, and I have a follow-up, but I also wanted to dial back to something you said earlier. My follow-up is: can this approach, in addition to being applied to create a pipeline, also deal with – that’s the stuff that always caused loss of sleep on projects I was on was sometimes some flat file that was supposed to get put somewhere didn’t get delivered, or gosh, schemas can change, either the schema in the source data or the schema perhaps in my warehouse or my OLAP cube or something like that. Again, I’m dating myself, but there it is.

Can we anticipate some of that and automate the adjustment of the pipeline? That’s my follow-up; then I have one more follow-up on the follow-up, but we’ll go one at a time.

The short answer to your question is ‘yes.’ I think the core key notion behind a lot of this is: ‘Do you have that intelligence layer that understands where your data’s coming from and where it’s going to, as well as how it’s being used?’ At Ascend, we call this our control plane and various other teams and companies may implement it and call it something slightly different.

The key concept behind this is: ‘Do you have a system that isn’t looking at data as it moves through just in context of one transformation, or one query, or one stage, or even a whole pipeline but instead to look at the entire ecosystem and do two things?’ One, can it detect changes and even potential breaks way far upstream before they even trickle down through systems? Then two, to your specific question, could [it] also catch a change that somebody may be wanting to make to that OLAP cube or to a data set and say if you do this, it could have this impact downstream because that system, that control plane, is monitoring the entire data ecosystem?

This is a whole new level of intelligence that we’re certainly capable of building. I think it tends to just be hard technology, which takes a lot of time. So did, honestly, database engines to relate to and then refine as well, but the benefits are astronomically high as a result.

You’re getting me excited because it’s almost like – gosh, now I have this idea in my head that we could almost have a trigger in the pipeline as opposed to a trigger in the database so that if something changes in the schema or something changes in a requirement, a source file, or source system that isn’t responding to a query or where a file’s not present, that we have a way of dealing with that. I don’t know if I’m extrapolating too far or if that’s somewhere in the right neighborhood, but that’s intriguing.

You mentioned before the whole notion of whether we have to partition the data. That’s been important for a long time, too, but it also just seems like in the world of data lakes and when we’re working with these columnar file formats like ORC and Parquet, there’s umpteen opportunities to partition the data just in arbitrary number of chunks or across certain dimensions or certain levels within the dimension, partitioning across year, partitioning across month, and then just saying “oh, partition it down into 200 pieces.”

Of course, the hope there is when you run a somewhat founded query, you can skip over a lot of the data that’s not important. The little bit of that that I’ve done, I feel like I’m guessing a little bit. What can you do to really automate not just the action items for implementing partitions but for developing a partitioning strategy and helping me with that so that I don’t have to guess at it?

It’s a really good question. The general thesis and position here is nobody should ever really have to worry about partitioning. That should ultimately be the pipeline engine, if you will. Just as when we have a database engine, it worries about how it’s laying out the data on disk and indices are happening to optimize how you pull it off of disk or out of memory. The very same thing should be happening with that pipeline engine.

When we think about where we can and should be evolving to as an industry, we have these really powerful technologies like our cloud BLOB stores and our processing engines like Spark that are really powerful, capable, and scalable, but they don’t maintain that same context around the data as it persists across the life cycle. You just don’t know the context of usage.

It really comes down to that pipeline engine when it has a broader context. For example, how Ascend’s control plane maintains context. It knows everything that you’re doing downstream on a data set and is smart enough to say you’re doing things like daily, hourly, or monthly roll-ups of data and analytic-style use case downstream from this data set. It can be really smart and partition that data dynamically for you so you don’t have to worry about it.

This is really, I think, just one of many categories where an intelligent control plane can help over time. As a user, I get to focus more on those individual use cases and that control plane can intelligently move that data around on the data lake and tune the pipelines for me based off of how I’m accessing and using the data downstream.

I was going to say maybe there’s even a component that can observe the kinds of queries that are getting run so that we can understand the most advantageous partitioning approach. Sounds like you answered that question before I even asked it, so that works well.

Also, the notion you brought up before of how ELT, while it has its advantages, it also can put us in a place where we’re just repeating a lot of the same transformations because we didn’t do it upfront. Actually, at a higher level, at a meta level, it seems like – it just occurred to me while you were talking – that’s kind of what we’re doing with partitioning.

It’s not that we’re partitioning the same data over and over again but as we go from data domain to data domain, we’re thinking through those same questions and applying the same stuff somewhat manually over and over again. Why not generalize that? Why not factor that out into a platform that’s dedicated to that?

We whole-heartedly agree and we do believe that as these intelligent control planes continue to evolve, we’ll even start to see this really exciting hybrid world of warehouses and pipelines where we should be able to get a far more seamless fabric where even if I am trying to do my ad hoc exploration, if there’s a control plane that is observing and monitoring this, that control plane can and should even be able to take what was more of an ELT approach and dynamically construct pipelines, move them to an ETL-optimized system and simply help optimize downstream. For me as a user, I shouldn’t have to worry about that. That intelligent control plane really can help hybridize this model and give me as a user a higher level of benefit from it.

To use your term, which was bespoke, there’s a place for bespoke effort, but at some point there’s a threshold where the bespoke effort is getting replicated frequently enough that it really becomes more of something that should be engineered and not hand-crafted. Maybe just mapping out where that threshold is has been a little bit difficult.

It sounds like what you’re saying is if we can identify it, then we can say “We’ll take the baton from here and handle the drudge work of making this really bulletproof and managed and with the exception management in there and just making it work in production.”

What we’ve seen over time – and this is really the history of technology – has been the things that a few years ago really differentiated your company from a technological advantage really become table stakes and commonplace as the industry evolves. Five, ten years ago, it may’ve been of a strategic advantage for your company to be able to run a bigger Hadoop cluster than everybody else. As that’s become commoditized, it’s evolved. We saw it move from how big can your Hadoop or HDAP cluster store to how tuned can you make your Spark pipelines be to really now what we’re seeing is: how well do you construct, tune, and optimize your pipelines that the code that you write to marshal those underlying resources is also now being commoditized and should be heavily automated as it’s not core to your business.

What’s really core to your business is how you apply your data to your business logic and your business understanding. It’s the outcome and the output of your pipelines is what differentiates. We’ll continue to see this from technologies that really help people focus on the differentiated part of their business.

Just so we can end on a high note, I think some people who are really sophisticated engineers and developers in this sphere hear about automating a bunch of the work that maybe was bespoke beforehand. As we automate, does that mean we’re sun-setting the need for the engineers, or does it mean that maybe what’s become ‘old hat’ for them can be standardized and then they can move on to even higher level stuff where their talents can really be leveraged?

Oh, yeah, I whole-heartedly agree with the latter. I don’t think there’s many engineers that like getting paged at 3 in the morning because some JBM process ran out of memory and you had to go tune the cluster or repartition the database. I’d say what we’ve seen and especially with data engineering, we’ve been solving a lot of really painful problems and just grinding through the muck for years.

I think this lets us really start to free up from the data movement part of the problem and free up far more interesting and exciting parts of how do we apply these large data volumes and incredible insights to far more automated and intelligence layers that really fuel great and exciting new products for the business. Hopefully we see a surge of innovation over the next few years as frankly we’re taking some of the best and brightest and freeing them up to go tackle bigger and more interesting challenges.

We never get to the point where engineers’ skills are not in demand. Just a question of how high up the value chain we want them to apply their skills. With that we’ve actually come to the end of our half-hour conversation. I know we’re also going to have a webinar where we can focus even more. I will look forward to that, and Sean, I will thank you very much for your time today. This has been a great discussion.

Fantastic. Thanks so much, Andrew.

For GigaOm, this is Andrew thanking Sean and thanking all of you who listened in. We bid you good day, good evening, good night, depending on your time zone and so forth. Thanks very much.

Blog Archive