Jekyll2021-12-25T22:21:52+00:00https://mokshith.xyz/feed.xmlMokshith’s ThoughtsThis is just a page where I talk about random things I feel like talking about once a week. Mostly, I think I can look back at this in like 20 years and be like, damn.Mokshith Voodarlamvoodarla@gmail.comSolving Computer Vision: Applying First Principles2021-12-25T07:00:00+00:002021-12-25T07:00:00+00:00https://mokshith.xyz/tech/2021/12/25/revisiting-computer-vision<h2 id="introduction">Introduction</h2>
<p>In my last post from a few months ago, I wrote my view on the future of computer vision. I suggest you read that first, to see how my perspective has changed.</p>
<p>The biggest change in perspective comes from a practice I’ve been using for some time now. I only realized recently that such practice is called <a href="https://fs.blog/first-principles/">First Principles Thinking</a>. It’s the idea that you condense a complicated problem down to the fewest facts you know to be true about it, and then try to extrapolate everything else you can from those simple facts.</p>
<h2 id="is-it-really-a-problem-with-computer-vision">Is it really a problem with computer vision?</h2>
<p>Our claim is that computer vision has many issues today, preventing it from becoming ubiquitous. More specifically but still pretty generally, we claim that the bottleneck has to do with the data we use to train computer vision models. Let’s take a step back though and understand what go us to this stage, and how computer vision fits into a more general class of data analysis problems. To do that, let’s start with the history of storing data on computers.</p>
<h3 id="history-of-data-storage-and-processing">History of Data Storage and Processing</h3>
<p>The database was pioneered in the 1960/70s. From <a href="https://www.geeksforgeeks.org/history-of-dbms/">DBMS</a> to <a href="https://en.wikipedia.org/wiki/Relational_database">RDBMS</a> which led to more efficient ways of storing, processing, and later querying data using languages like SQL - the database has constantly improved. Sometime later came along <a href="https://www.mysql.com/">MySQL</a> which open-sourced RDBMS.</p>
<p>In the late 90s and early 2000s, NoSQL became a lot more popular. The internet needed faster processing of unstructured data and NoSQL could handle both structured and unstructed data while helping scale some of the internets most well-known services like Google, Facebook, and Twitter. The main theme was scalability (large datasets) and flexibility (many schemas).</p>
<p>At some point down the line, these services were scaling really well but become really messy in the sense of different devices or clients writing to different databases or different modes of information being stored in different places. The buzz-phrase here was that data got really siloed - which made it hard for business to make data-driven decisions easily. This is what led to the creation of services like <a href="https://www.snowflake.com/">Snowflake</a> which provide what we call <a href="https://aws.amazon.com/data-warehouse/">data warehousing</a>. Other companies like <a href="https://segment.com/">Segment</a> made a living within this problem of siloed data, but in their own niches like the consolidation of customer data into a single API. A whole separate set of services grew out of this with companies like <a href="https://www.fivetran.com/">Fivetran</a> and <a href="https://airbyte.io/">Airbyte</a> which act as easy plug-and-play connectors that update your data warehouse periodically from hundreds of different sources.</p>
<p>All things considered, the ecosystem is large and prosperous today. However, a key thing to note is that this large ecosystem is only limited to the most primitive data-types like text and numbers. Sure, we can store and retrieve other types of raw data like audio, images, etc but nothing at the level of operations we can perform on text and numbers such aggregation, counts, transformations, etc.</p>
<h3 id="new-types-of-data">New Types of Data</h3>
<p>There is already so much attention put forth to companies solving these traditional data problems and don’t get me wrong, there’s still so much to solve there. From <a href="https://www.secoda.co/">better internal data search tools</a> to <a href="https://supabase.com/">open-source database alternatives</a>, there is still so much going on. But, I think we’re set for a larger revolution within the niche of non-traditional data types. Think images, videos, LiDAR, RADAR, audio, and more.</p>
<p>Think about robotics for example. At the end of 2019, there were about 2.7 million operational industrial robots and the projection for the end of 2021 is close to 3.8 million (source: <a href="https://ifr.org/ifr-press-releases/news/record-2.7-million-robots-work-in-factories-around-the-globe">IFR</a>). A growing number of these robots will have all types of sensors, either for the purpose of controlling the robot or the for the purpose of collecting data for later analysis.</p>
<p>Imagine parallel applications in factories that can monitor for worker safety using cameras, or detect defective parts. Or other applications in city streets and store for security, shopper / traffic analysis, etc. We are in an age where our ability to work with these more complex data types has increased multiple orders of magnitude due to different types of analytic techniques like deep learning. This will only accelerate the number of applications we come up with for this kind of technology.</p>
<h3 id="whats-actually-going-wrong">What’s actually going wrong</h3>
<p>In my last post, I wrote about the ML-specific issues with computer vision today and why we’re struggling from that perspective. We struggle with building good computer vision models because it’s really hard to build good datasets, and building good datasets is really hard because of poor tooling. But I’d now argue that such a statement doesn’t do a good job of getting to the root cause of why we have poor tooling. Analyzing the history of databse I’d now argue the following:</p>
<p><strong>Our databases weren’t designed for handling more than simple text and numbers.</strong></p>
<p>There is a lot of work happening focused on building ML-specific tools for this but I’m not sure whether that’s the best mindset to have as we build these things. The problems we’re talking about are first being experienced in communities like computer vision and ML because that’s where these new types of data are being most quickly aggregated today. They’re not the only industries though that can benefit from storage more purpose-built for richer data types. How do e-commerce stores manage their visual assets? Or gaming / animation studios?</p>
<h2 id="pondering-a-solution">Pondering a Solution</h2>
<p>The biggest difference between traditional data like numbers and text versus newer types of data like images, videos, audio, 3D scans, and more are that these new types of data have greater semantic meaning. They’re however still stored solely as their individual bytes rather than with a stronger connection to what they actually represent. There’s a few reasons for this involving lack of robust all-purpose models and high compute-costs. Imagine being able to perform operations like equality, difference, grouping, counting, and more with these non-traditional data types in ways as trivial as it is with text and numbers.</p>
<p>I think we have the necessary tech today to build such a solution. Questions still lie who wants this right now, how would they use it, and how much would they pay for it.</p>
<p>Answering those questions is still WIP. Keep following these posts because I guarantee to post clear answers to those questions in 3 months time.</p>Mokshith Voodarlamvoodarla@gmail.comIntroductionThe Future of Computer Vision2021-08-21T07:00:00+00:002021-08-21T07:00:00+00:00https://mokshith.xyz/tech/2021/08/21/future-of-computer-vision<h2 id="introduction">Introduction</h2>
<p>Computer vision has the potential to impact so much of our world but little has actually come to fruition. If you think about it, the only wide-spread use of computer vision has been with selfie filters, Amazon product recommendations, Facebook feed tagging, and a select few others. There’s the applications in robotics and self-driving which we’ve been working on for 10+ years but with no wide-spread use as of yet, along with all the other small solutions in security cameras, etc. My view is that computer vision can become “Google Analytics but for the real world”. Google Analytics made it really easy for companies to manage and track site traffic to then make changes / optimizations to the way your site works. In the same way, computer vision will enable this in factories, malls, town centers, traffic-ridden streets, shipping docs, and almost any other physical place you can think of.</p>
<p>This changes the way our global supply chain works and can create a lot of value through simple collection of these insights.</p>
<h2 id="whats-wrong">What’s Wrong?</h2>
<p>What’s wrong though? Why hasn’t computer vision been able to be deployed in most of these scenarios? It comes to down to the way these algorithms work today, how they’re built, and how quickly they can break down.</p>
<p>Most of the recent success of computer vision algorithms has come from <a href="https://en.wikipedia.org/wiki/Deep_learning">deep learning</a> and the ability to fit more and more complex functions. Deep learning is what most self-driving cars use today to detect objects of interest while driving. Turns out however that these algorithms aren’t really generalizable. François Chollet, the creator of Keras, exemplifies this perfectly with the following tweet:</p>
<p><img src="https://drive.google.com/uc?id=1sabYomoqRX59VsvuoSxMtRDE07xDkl-H" alt="fchollet tweet" /></p>
<p>We haven’t been working on self-driving for 10+ years because we feel like it. It’s because we’re trying to collect as many “edge cases” as we can in our datasets until we feel comfortable enough leaving it up to the cars.</p>
<p>While this is <a href="https://twitter.com/FSD_in_6m/status/1400207129479352323">best exemplified with self-driving cars</a>, this limitation exists with virtually every application of computer vision. Teams across industry sometimes assume some of these algorithms to be magic and automatically generalize. This is a myth and it’s a dangerous one. Self-driving car companies are some of the most <a href="https://www.marketplace.org/shows/marketplace-tech/funding-is-pouring-in-to-companies-trying-to-crack-self-driving-tech/">well-funded companies</a> and still have these issues. Imagine the mistakes a smaller, scrapier team might make.</p>
<p>There are two important takeways from the above.</p>
<ol>
<li>Computer vision models aren’t generalizable.</li>
<li>Few teams have the resources necessary to follow best-practices when building these models.</li>
</ol>
<h5 id="computer-vision-models-arent-generalizable">Computer vision models aren’t generalizable</h5>
<p>Earlier in this post, we’ve seen examples of how computer vision algorithms break down, even when built by the most qualified and well-equipped teams. This is an artifact of model performance which can be attributed to the following two factors:</p>
<ol>
<li>Building the right datasets to train models on</li>
<li>Properly evaluating real-world performance of models</li>
</ol>
<p>Building the right datasets to train models on involves <a href="https://scale.com/">amazing label quality</a> in the case the problem is supervised and a way to pick which data to actually label and then train the model on. Models can struggle from things like <a href="https://en.wikipedia.org/wiki/Concept_drift">data drift</a> and <a href="https://en.wikipedia.org/wiki/Catastrophic_interference">catastrophic forgetting</a>. Given that our datasets are always a different distribution from the real-world who’s distribution constantly changes and may under-represent certain scenarios / edge-cases which may be important to our use-case, we must be wary of these issues. Methods such as <a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)">active learning</a> are being more widely used by teams like <a href="https://www.tesla.com/autopilot">Autopilot at Tesla</a> and are being integrated into services like <a href="https://scale.com/nucleus">Scale</a> and <a href="https://www.aquariumlearning.com/">Aquarium Learning</a> to make it easier for other teams to do the same.</p>
<p>There is still a lot of active work happening in this space however with some contreversy around whether active learning works better than random sampling and figuring out at which point we can say these models are actually ready.</p>
<p>Then comes the problem of actually evaluating models properly after they’re trained. Most of the time, teams will take a single aggregate metric like accuracy on a test set as their north-star metric and compare different models they train solely off of that.</p>
<p>This methodology is wrong however because it faces the issue of <a href="https://arxiv.org/abs/1909.12475">hidden statification</a>. Take the example of two models we train to detect people. Say one has an accuracy of 90% and the other has an accuracy of 95%. Say however that model 1 performs at 95% in night-time and model 2 performs at 90% at night-time. If you’re deploying this model in a room that tends to stay dark, model 1 is better even though it has a lower overall accuracy. Another reason the methodology doesn’t work is because of the way a test set might be sourced. If a team takes a random sample, it already faces the issue that the distribution of the dataset is not the same as the real-world and thus isn’t representative of real-world performance.</p>
<p>A better way to break this problem down is in the sense of unit tests or scenarios that are conceptually important to wherever you might be deploying a model. Taking the example of a people detection model again, we might want to build different sets based on the following to ensure we perform at a similar level regardless of the factors:</p>
<ol>
<li>Lighting</li>
<li>Gender</li>
<li>Type of clothing</li>
<li>Race</li>
<li>Scene layout type</li>
<li>more…</li>
</ol>
<p>These factors help us decide which scenarios our model performs better in and make deployment decisions comparing models using this context. The issue right now is that these sets are really hard to build because most teams tend not to actually get these other types of metadata labeled as a part of each sample (<a href="https://www.marketplace.org/shows/marketplace-tech/funding-is-pouring-in-to-companies-trying-to-crack-self-driving-tech/">unless they have a lot of budget</a>) which makes building these subsets a manual process which engineers would have to go through.</p>
<h5 id="few-teams-have-the-resources-necessary-to-follow-best-practices-when-building-these-models">Few teams have the resources necessary to follow best-practices when building these models</h5>
<p>Companies like Google, Facebook, Tesla, Amazon, Snap, etc are the only ones who’ve been able to successfully deploy computer vision at a larger scale. It isn’t a coincidence that all these companies are valued over $100B. They all…</p>
<ol>
<li>have order of magnitude larger budgets for CV projects</li>
<li>are talent aggregators</li>
<li>have more people working on these problems</li>
<li>have built a lot of internal tooling to help themselves</li>
</ol>
<p>Smaller teams will constantly take shortucts like the ones below just to name a few:</p>
<ol>
<li>Less rigorous hyper-parameter tuning</li>
<li>Random sample / manual test set curation rather than metadata tagging to build better test subsets</li>
<li>Non-implementation of active learning (hard to implement internally)</li>
</ol>
<p>This makes models produced by these smaller teams objectively worse. It also seems inneficient that every small company is trying to attract talent to work on their specific computer vision problem when in reality most of these companies work with similar paradigms with substitution of the type of thing they’re predicting and the scenario they’re deploying in.</p>
<h2 id="scaling-computer-vision">Scaling Computer Vision</h2>
<p>My view here is that it’s infeasible to continue with the development of computer vision in the way it’s going now. There are so many applications of the technology in our supply chain that it requires smaller teams and startups to tackle them. However, these teams should not keep taking shortcuts and should not have to build computer vision teams and infrastructure from the ground up.</p>
<p>Why not build a talent aggregator that’s specifically equipped to build computer vision for everyone, and serve it as models-as-a-service? This seems so much more efficient. This talent aggregator would build a team and all the internal infrastructure necessary to quickly and reliably ship computer vision models to companies that need to apply them to a specific task. The goal of other companies would then be to collect as much data as they can and build the other software and insights around the model’s results. We’re already seeing this with teams like <a href="https://scale.com/blog/scale-document-ai">Scale Document</a>. Abstract away the model development lifecycle.</p>
<p>The way to build a successful business here is to find the right attack vector to then scale to a talent aggregator that builds all the relevant infrastructure to quickly ship good computer vision.</p>
<p>This is what I think makes sense to dedicate the next decade of my life to given past failed ventures like <a href="https://placeware.io/">Placeware</a> and the constant pains I’ve seen working at <a href="https://corporate.ford.com/operations/autonomous-vehicles.html">Ford</a>, <a href="https://scale.com/blog/scale-document-ai">Scale</a>, and <a href="https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/">NVIDIA</a>. It’s the single greatest thing I can do with the priors of my past experiences, current interests, and current timing of problems.</p>Mokshith Voodarlamvoodarla@gmail.comIntroductionFrictionless Software2021-07-25T07:00:00+00:002021-07-25T07:00:00+00:00https://mokshith.xyz/tech/2021/07/25/users<p>This post was written after reading the <a href="https://1729.com/the-billion-user-table">The Billion User Table</a> on <a href="https://1729.com/">1729</a>.</p>
<h2 id="introduction">Introduction</h2>
<p>If you’re an avid user of any service on the internet, you probably have an account for it. Of course there are exceptions, but this is mostly true. Some of these online services have become so valuable and are used by billions (and are worth billions). This post will explore how some of these services became valuable over time, why that value can’t be so easily duplicated, and how this might change in the future.</p>
<h2 id="internet-services-of-value">Internet Services of Value</h2>
<h3 id="consumer-services">Consumer Services</h3>
<p>To start, let’s take the example of Twitch.</p>
<p><a href="https://www.twitch.tv/">Twitch</a> is an online streaming platform, targeted towards gamers, but is really for any type of streaming. Twitch started off as something called Justin.tv which was literally a guy named <a href="https://en.wikipedia.org/wiki/Justin_Kan">Justin</a> living his boring, silicon-valley life streamed through a clunky camera and that’s all you could watch on the site. Eventually they pivoted to a few different things, with gaming being the one that took off. Now, there is a whole community around Twitch with the donations, cheers, community-specific language like <a href="https://www.urbandictionary.com/define.php?term=Pog">pog</a>, and more.</p>
<p>If we think about the value Twitch provided, it was a platform that focused in on a specific community, building a lot of community-specific features around the general idea of streaming. Twitch became its own little community with its own slang and jokes. That’s what made joining the community more valuable for each user.</p>
<p>Sure, someone could go create a streaming site again in 2021 but nobody would want to use it. That’s because everyone is on Twitch.</p>
<p>This is the same idea behind sites like YouTube, TikTok, and Twitter. Their value isn’t the actually technology itself, it’s rather the entire network. This is more commonly referred to as <a href="https://en.wikipedia.org/wiki/Network_effect">network effects</a>.</p>
<h3 id="enterprise-software">Enterprise Software</h3>
<p>It might be harder to pinpoint these network effects in other software like enterprise. However, these effects are still there.</p>
<p>Let’s take the example of <a href="https://slack.com/">Slack</a>. Slack changed the way we communicate within companies (some arguing for the better, while others arguing for the worse). Regardless, thousands of companies use it today because of how much more efficient communication became; almost making email an obsolete way to communicate within companies.</p>
<p>One example of the network effect of Slack is that as employees switch companies, they still know how to use Slack and don’t need to get onboarded for as long as it would with other software. This is powerful, but not as powerful as the network effects of something like Twitch.</p>
<p>A more powerful network effect was introduced with a feature called <a href="https://slack.com/connect">Slack Connect</a>. This introduced the idea of creating a channel for collaboration between two different companies. So when working with clients, you don’t have to email anymore; you just use another Slack channel embedded within your own workspace. This makes communication quicker, and as efficient as it is speaking with co-workers. Now that’s something powerful since it has direct implications into dollars amounts of contracts and customer satisfaction. If you don’t use Slack, you couldn’t do this.</p>
<h2 id="limitations">Limitations</h2>
<p>Many successful companies today have built amazing walled gardens which make you quasi-locked-in to a piece of software after you start using it for a while. This is good for business, but is it good for the users?</p>
<p>My answer to this is no.</p>
<p>Take the example of some incumbent software like AWS. Let’s say they start slowly hiking up prices, or start having more downtime, or start having more security breaches. You are inclined to want to switch to another provider but you can’t because you’re locked into their service and switching will cost too much engineering time.</p>
<p>Take another example of <a href="https://scale.com/">Scale AI</a>. Scale’s main business is labeling data for companies. This could be images, text, document, 3D, anything really. They’ve built an amazing company around the perfect marriage between tech and operations which has enabled high quality data labels, and now the pursuit of being the provider for almost anything MLOps related. They’ve started building out <a href="https://scale.com/nucleus">Scale Nucleus</a> for example which acts as a “Mission Control for Data”, allowing teams to explore, group, and find failures within datasets to then help improve models. There are many other companies building similar pieces of software like <a href="https://www.aquariumlearning.com/">Aquarium</a> and <a href="https://voxel51.com/">Voxel51</a> but the main selling point for Scale is the ability to “one click send to Scale for labeling” feature which few other companies can match unless they put in extra work to integrate with Scale’s API. Though Scale might allow this integration today, it may choose to disable this for users in the future which then kills competition for its Nucleus platform. If you like everything else about a different platform better but still need the “one click to label” feature, you are forced to use Nucleus.</p>
<p>This is a hyper-specific use-case but it holds true with a lot of other consumer and enterprise software with ideas of not being able to “transfer social following” to another site or “transfer message history” from Slack to Microsoft Teams for example.</p>
<h2 id="solution-the-billion-user-table">Solution: The Billion User Table</h2>
<p>The billion user table is this idea powered by crypto which is essentially saying one login for everything. It’s different from something like “Sign-in with Google” though because it has the ability to store information of any form with a user account. This could be anything from a list of followers to a list of data labeling projects. This means that rather than information being silo-ed in Scale’s database preventing companies like Aquarium from integrating or being locked into a cloud provider like AWS, things become frictionless. This is in the sense that there are decided upon protocols and different providers who follow those protocols, which can then access a set of users regardless of how big or small they are. It could mean that a click of a button switches a company from Google Cloud to AWS, or from Slack to Microsoft Teams, or from Scale AI to another labeling competitor.</p>
<p>The benefits of this are immense. It means companies have to keep innovating to keep their active user base, driving greater quality to products. As a result, the big guy won’t always stay the big guy if others keep innovating. There is no “network effect” advantage.</p>
<h2 id="conclusion">Conclusion</h2>
<p>There are probably a lot of technical details I glossed over in my opinions / explanations but I think the important thing is that something like this could be possible. The crypto space is constantly changed, and so are things like the world’s opinion on the technology so there is no telling how close or far away from the truth this is. Regardless, I really enjoyed writing this post and I hope it keeps us all dreaming about what’s possible.</p>Mokshith Voodarlamvoodarla@gmail.comThis post was written after reading the The Billion User Table on 1729.Abstracting Away Crypto2021-07-11T07:00:00+00:002021-07-11T07:00:00+00:00https://mokshith.xyz/tech/2021/07/11/non-crypto<p>This post was written after reading the recent post on <a href="https://1729.com/crypto-for-people-who-dont-follow-crypto">1729</a>.</p>
<h2 id="introduction">Introduction</h2>
<p>The crypto space has reached new heights in 2021 with Bitcoin reaching <a href="https://cointelegraph.com/news/bitcoin-suddenly-hits-60k-as-a-new-resistance-battle-liquidates-850m">all-time heights</a>, <a href="https://www.theverge.com/2021/2/25/22300835/coinbase-s1-bitcoin-going-public-profit-revenue-invest-crypto">Coinbase going public</a>, a hot <a href="https://www.theverge.com/2021/3/11/22325054/beeple-christies-nft-sale-cost-everydays-69-million">digital art market</a>, and more institutional adoption. So many people are drinking the crypto Kool-Aid: social media influencers <a href="https://mashable.com/article/influencers-altcoin-scams">promoting shitcoins</a>, and <a href="https://www.nasdaq.com/articles/about-46-million-americans-now-own-bitcoin-2021-05-14">46 million Americans</a> holding Bitcoin as of mid-May. There has been so much market-value created in a much shorter time relative to other types of technologies like the internet. The weird thing though when comparing crypto with other technologies like the computer or the internet is that the technology was created before the application. There is very little practical use of crypto right now. This however does not mean the shortage in projects like <a href="https://ethereum.org/">Ethereum</a>, <a href="https://uniswap.org/">Uniswap</a>, and <a href="https://solana.com/">Solana</a> which are helping build the infrastructure needed for more applications.</p>
<p>This is why I want to dedicate this post to the applications of this basic blockchain technology and how it might replace so much of the tech we’re used to today. Let’s abstract away smart-contracts and the blockchain. Assume a decentralized database that anyone can write to, that is extremely difficult to modify or delete. Also assume a simple way to execute code on this database that modifies certain values when different requirements are met.</p>
<h3 id="online-social-identity">Online Social Identity</h3>
<p>Right now, most of us use Twitter, Instagram, Facebook, LinkedIn, or some other form of social networking. We have a set of friends, a group of followers, or follow different topics ourselves on these sites. These sites are great ways for us to share and consume content posted by others.</p>
<p>However, there are some big problems with most of these sites. First is that they’re insecure. Many top figures have been <a href="https://twitter.com/TwitterSupport/status/1283591848729219073">hacked on Twitter</a> for example. Another is censorship and de-platforming where the company behind the site can remove your account or prevent you from posting, one of the most famous incidents being with <a href="https://en.wikipedia.org/wiki/Social_media_use_by_Donald_Trump">President Donald Trump</a>. The biggest problem by far however is the fact that if you do get de-platformed or feel like posting somewhere else that isn’t one of those sites, going to an alternative or preserving that free-speech you want to exercise is difficult. You can’t take a following from Twitter for example and move it to your own personal domain.</p>
<p>Imagine however if things like posts and a public profile are all stored on-chain with other private information only being accessible via a private key. This is what <a href="http://bitclout.com/">bitclout</a> is trying to build as an alternative to Twitter. There is a bitclout chain and using the network isn’t restricted to using bitclout.com. Anyone can build their own clients with different algorithms for recomendation, etc. where people use the one they like the most. This prevents being hacked or any form of de-platforming.</p>
<h3 id="app-stores">App Stores</h3>
<p>The ongoing case between <a href="https://en.wikipedia.org/wiki/Epic_Games_v._Apple">Epic Games and Apple</a> related to Apple’s restrictions on having other in-app purchasing methods outside Apple’s payment system which takes a 30% cut has gotten a lot of press coverage. Epic’s argument with regards to the fact that as people grow large mobile app businesses, Apple will always be able to take a significant cut of business, decreasing margins and making it harder to compete. Apple’s side of the argument is that they’ve built an amazing phone and OS which gave these companies access to millions of consumers and thus deserves this cut. Apple also has it’s own arbitrary safeguards as to which apps it accepts onto the store, etc. The case is complex and it’s hard to side with one or the other.</p>
<p>Another story is one where <a href="https://www.npr.org/2020/10/22/926290942/google-paid-apple-billions-to-dominate-search-on-iphones-justice-department-says">Google paid billions</a> to Apple so it could keep Google search on iPhones.</p>
<p>Let’s imagine an alternative however. What if we had a decentralized store with some type of voting process when it comes to adding or removing apps from the store, which still takes a cut but a much smaller one? This cut however doesn’t go to a single business and rather goes to the computers which are hosting apps for download and putting more effort into quality reviews for acceptance vs rejection decisions for apps. This makes it easier for companies to compete and makes it impossible for a singal entity to prevent an app from existing on a store.</p>
<h3 id="aws-s3">AWS S3</h3>
<p>Right now, AWS is the most dominant cloud provider out there. Unless you’re a giant company, you’re not building your own secure infrastructure to host your site or other operations it provides. You’re most likely using AWS products like S3, Compute, and Databases. There are some potential issues with this however. What if AWS goes down? There’s nothing you can do about it. What if Amazon chooses to up prices? There are a few other competitors right now with GCP and Azure so it’s unlikely but it would be a pain in the ass to immediately switch over to an alternative. Also do we trust Amazon to never look at our data? Most of these cloud providers ensure secure encryption but most of the time also store the encryption keys themselves. This might ensure no 3rd parties from accessing your data but it still leaves the door open for AWS to access it if it needs to (say maybe the government comes knocking).</p>
<p>Now let’s think of an alternative. AWS is of course more than just storage now but at its start, began with just S3. What if there’s a way for any person to lease storage space on their personal computers for some price and securely store bits of your information for you? No one computer will store an entire file and will rather store small parts of it, which can only be accessed all together with some private key. This is essentially what <a href="https://filecoin.io/">Filecoin</a> is. This prevents the risk of government seizures, unforseen price hikes, or any real downtime given the properties of decentralization and the ability for files to be stores across an entire network.</p>
<h3 id="owning-things">Owning Things</h3>
<p>Right now, we carry with us a lot of documentation. This includes things like our passport, driver’s license, birth certificate, property documents, etc. This even includes smaller things like a ticket to a concern or a movie. Most of these things are stored physically on paper or on some PDF. There is however a blossoming market of creating fake IDs and in places like India, cases where people <a href="https://www.bbc.com/news/world-asia-india-20457766">make up property documents</a> and claim they own a piece of property when they actually don’t. This is because the entities which issue these documents aren’t transperant and nobody can publicly access information to this regard which should be atleast semi-accessible.</p>
<p>Now imagine if these agencies which issued some credential, document, or proof of ownership did it on-chain? Then a person can always prove that they’re the real owner of something or verify their identity without the burden of carrying physical documents which can be faked. That would be so much more convenient and also just pretty cool.</p>
<h3 id="fake-news">Fake News</h3>
<p>We are all probably too familar with the issue of fake news and spread of mis-information. I wrote a <a href="https://blog.mokshith.xyz/tech/2021/01/07/state-of-privacy.html">different post</a> about this when the famous capitol riot happened in January of 2021. Events like that highlight the importance of ensuring that there’s some way of fact-checking information. We don’t have good ways of doing this right now, and so social media sites have their own private team of people or set of models that manually do this. This has however led to many cases where something false spreads like wildfire or something true is de-platformed.</p>
<p>Now imagine if there was a way to give badges to any piece of content online that communicated some sense of whether a thing is true or not. This could be a system where given organizations or groups which could be represented as <a href="https://ethereum.org/en/dao/">DAOs</a> are given a certain amount of credibility over certain topics. These organizations can then give certain badges to content they support or disagree with, at which point a decision can be made about whether it can cause harm and should be removed. Obviously, I don’t have the perfect idea here of how this would work exactly but it just comes to show how diverse the applications of this blockchain database are.</p>
<h3 id="paywalls">Paywalls</h3>
<p>Right now, I can’t read any content I want on sites like the Wall Street Journal or the New York Times. I need to pay. This is the same for many top scientific journals like Nature or IEEE. Why does knowledge sometimes have a monetary barrier? This definitely shouldn’t be the case. The reason these journals or sites exist is because they’ve all formed a notion of credibility around their name. This is really powerful for good but also allows for these centralized organizations to slowly sneak in opinion or accept papers which don’t deserve to be accepted based on the personal motives of a few.</p>
<p>Imagine an on-chain way of sharing papers and giving them some sense of credibility through systems of upvotes, which are weighted based on the readers credentials, which are also stored on-chain. Imagine if rather than journalist working for a bigger organization, they just work for themselves and publish freely on their own sites, which become popular based on the value of their journalism. People can then support these people through donations or by buying some ownership into the journalist’s token which increases in value depending on the value of the content they produce.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I’ve listed a few applications of blockchain technology but there are a lot more that I haven’t mentioned including everything happening in DeFi which will enable a lot more financial freedom. However, it just comes to show how many applications of this technology there are and how it can rewrite almost everything we’ve build on the web already, and digitize some other things that hadn’t yet made it there (like our passports).</p>Mokshith Voodarlamvoodarla@gmail.comThis post was written after reading the recent post on 1729.Testing ML Models in Production2021-06-30T07:00:00+00:002021-06-30T07:00:00+00:00https://mokshith.xyz/tech/2021/06/30/production-ml-testing<h3 id="problem">Problem</h3>
<h4 id="introduction">Introduction</h4>
<p>Testing is an important part of any production software. Without it, simple mistakes which developers didn’t mean to make, seep into production and can be the result of lost revenue, unmet SLAs, etc.</p>
<p>One universal practice software developers follow is unit testing and continuous integration where a test suite is run against the whole codebase everytime something is committed. This is a majority contributor to finding failures before moving to production. This type of testing however only works on what we call the Software 1.0 stack.</p>
<p>Software 1.0 is being replaced and ML models are already the major contributor. Instead of writing code that solves a problem, we’re writing code which trains a model which solves a problem. This is called Software 2.0.</p>
<p><img src="https://miro.medium.com/max/1400/1*5NG3U8MsaTqmQpjkr_-UOw.png" alt="Andrej Karpathy Software 2.0" /></p>
<p>Our workflow changes from write code, test code, deploy code to collect data, label data, train model, evaluate model, tune hyperparameters or curate dataset, repeat, deploy. The later part of this new process is where things get a little hand-wavy. What are the most important metrics to evaluate on? What should we change to improve model performance? Autonomous driving is a perfect example of why this is really hard. You could build a great prototype for a self-driving car in 2015 with little budget (<a href="https://comma.ai/">Comma AI</a> did this). But 5 years and billions of dollars later, the best autonomous driving systems are still ~L2+ (see <a href="https://twitter.com/fchollet/status/1373112777519230977">this</a>).</p>
<p>When it comes to evaluation, ML practitioners tend to pick one or two north star metrics and stick with those for the rest of their project. This could be their aggregated cross-validation score on a validation set, or a different score on a held out test set, or just manually evaluating on individual samples and visually evaluating. One important thing to note though is that all these metrics in an ideal world are trying to represent one thing: how well will our model do on data it has never seen before, in the real world? This is a constant struggle of building the flashy ML demo into something that works in production.</p>
<p>Autonomous driving is the seminal example of this and has been the field that has helped most with the development of new ML algorithms but there are bound to be dozens of other applications of the same scale with the same high stakes. The billion dollar question will be how we can bring the prototype to production cycle down from unknown (since AVs are still not solved) years to just one or less.</p>
<p>Evaluating correctly and being able to surface exact failure patterns allows for the right data to be labeled next, and for a team to know whether a system is production ready.</p>
<h4 id="test-driven-development">Test-Driven Development</h4>
<p>This is where we might want to take inspiration from test-driven development philosophies which have worked flawlessly as we’ve built some of the most sophisticated software systems. At its core, it means the following (a slide taken from Tesla’s Andrej Karpathy):</p>
<p><img src="https://i0.wp.com/mlinproduction.com/wp-content/uploads/2020/05/karpathy-unit-tests-1.png" alt="Test Driven Computer Vision" /></p>
<p>Just like with software, let’s come up with a set of cases, or critical populations of data, that we can separately evaluate our model on (before we ever train a model). If it passes each of these tests separately, it is ready for deployment. This in-fact is what most top-notch ML projects follow and is regarded as a gold standard as per what teams want to bring to their model evaluation.</p>
<p>The problem arises when teams try to compile a meaningful set of critical populations of data which makes sense to evaluate on. After all, there are infinite edge cases and no way to test all of them.</p>
<p><img src="https://pbs.twimg.com/media/E4aa7m9VIAQEgSl?format=jpg&name=large" alt="Test Driven Triggers" /></p>
<h4 id="current-solutions-and-issues-with-them">Current Solutions and Issues with Them</h4>
<p>One way teams implement this today is through access to external data. This could be through a set of sensors or some other extra data which can be used as hindsight. An example of this could include an extremely heavy ML model which can’t be run in production but can be run in evaluation to provide more accurate results we treat as ground truth. This allows one to make simple classifications like “pictures with more than 5 people”, “pictures with less than 30% road”, “sequence where bounding box flickers”, “sun glare”, etc. These are then used as the critical populations to evaluate on.</p>
<p>Let’s discuss the issues with this.</p>
<p>First is the engineering time that goes into putting systems in place which detect sun glare, or include other types of evaluations which aren’t directly related to the problem / are too heavy to use during test time. What if you want to evaluate on a set for all images where the sun is about to set? This “sunset” characteristic hasn’t been labeled and can’t be detected trivially by a sensor. This highlights the fact that not all relevant “critical populations” are necessarily just what is labeled in your data or is detected by your main system. Most of the time, external unlabeled factors influence a model in more significant ways than paid attention to.</p>
<p>Second, what if new data is constantly coming in and there are new edge cases to potentially explore and automatically add to the evaluation suite? There is a lot of boring, annoying code to write to make this work seamlessly.</p>
<p>Finally, who likes writing tests anyways? Many bureaucratic companies force their software teams to do this but many fast-growing startups don’t have strict requirements due to the pace of change. With regular software, it’s hard to automate the creation of tests. With ML, there is at least some hope for making it easier to automate a test suite.</p>
<h3 id="what-to-solve">What to Solve</h3>
<p>At its core, we want to build an easy way to do test-driven development in production ML with datasets. This means making it easy to find critical populations through some sort of query language. We’ll focus on computer vision first.</p>
<h3 id="solution">Solution</h3>
<p>The first part of a solution is a tech stack that allows users to easily search for data. In the context of images, it could mean any of the following things.</p>
<h5 id="clip-for-text-searches-on-a-dataset-of-images">CLIP for text searches on a dataset of images?</h5>
<p>“Windy road with no trees”<br />
“Dimly lit factory floor”<br />
“Ocean view on a bridge”<br /></p>
<h5 id="search-by-image-use-one-image-to-find-similar-images-to-add-to-a-set-using-embeddings">Search by image? Use one image to find similar images to add to a set using embeddings.</h5>
<p>Click on an image of a bathroom –> set of many bathrooms<br />
Click on an outdoor image of a forest –> many forests</p>
<h5 id="automatic-classification-detection-segmentation-models-run-on-data-for-metadata-search">Automatic classification, detection, segmentation models run on data for metadata search?</h5>
<p>SELECT image, num_people<br />
FROM dataset<br />
WHERE num_people > 5 and weather = ‘sunny’<br /></p>
<p>SELECT image_name<br />
FROM dataset<br />
WHERE road_pct < 0.1 and ocean_pct > 0.7</p>
<p>This searching stack should then be a part of a managed platform which can automatically run the required models for automatic data tagging and neural network based embedding or text search. Searches should either be able to be made via a web UI or a Python API.</p>
<p>Along with the searching stack comes organization into different sets/groups that can be used as unit test sets. A set of training / testing samples can then easily be pulled using the Python API as a part of a team’s training / evaluation code.</p>
<h3 id="example-workflow">Example Workflow</h3>
<p>Data either exists in some system, or is in a bucket in the cloud. Regardless, that bucket can keep updating as more data is collected and acts as the place for all your raw collected data (maybe with labels included). It is somehow hooked up to the managed platform.</p>
<p>Use the managed platform to write queries for interesting populations of data (see Solution section). Give the query a name and choose a percentage of this type of data that you want to add to a hold-out test set.</p>
<p>See the percentage breakdown of how much data in training, test, and validation sets fit / do not fit into each of these categories.</p>
<p>Use Python API to upload evaluation scores / predictions for each sample and automatically visualize across different interesting populations.</p>
<p>See subsets in which model is performing poorly to narrow in on failure patterns. Then, either apply new pre-processing, post-processing, or collect new data to improve on that subset.</p>
<p>Make new queries easily and see whether there is drastically low performance on any of them. If so, it might be important to add that as another unit test.</p>
<h3 id="core-value-adds">Core Value Adds</h3>
<h5 id="have-10x-better-insight-into-where-model-is-performing-goodbad">Have 10x better insight into where model is performing good/bad</h5>
<p>Spend less time model debugging and spend more time improving
Better communication with relevant stakeholders past arbitrary accuracy/f1 scores
More self-aware product experience depending on model performance issues</p>
<h5 id="have-10x-better-oversight-over-collected-data">Have 10x better oversight over collected data</h5>
<p>It is expensive to get all data labeled, tagging provides visibility over the type of data you have and type of data you need
As troves of data get piled, having information that doesn’t have to be self-transcribed becomes extremely important</p>
<h3 id="potential-initial-target-markets">Potential Initial Target Markets</h3>
<p>Logistics companies handling documents<br />
AV companies<br />
Companies deploying algorithms on IP cameras</p>
<h3 id="other-related-projects">Other Related Projects</h3>
<p>https://www.amundsen.io/amundsen/<br />
https://kolena.io/</p>Mokshith Voodarlamvoodarla@gmail.comProblem Introduction Testing is an important part of any production software. Without it, simple mistakes which developers didn’t mean to make, seep into production and can be the result of lost revenue, unmet SLAs, etc. One universal practice software developers follow is unit testing and continuous integration where a test suite is run against the whole codebase everytime something is committed. This is a majority contributor to finding failures before moving to production. This type of testing however only works on what we call the Software 1.0 stack.Going from Collected Data to Curated Datasets2021-03-14T07:00:00+00:002021-03-14T07:00:00+00:00https://mokshith.xyz/tech/2021/03/14/collecting-and-curating-data<p>For every ML project that is worked on in industry, there exists the process of going from collected data to curated datasets. We will go over the current workflow for doing this and problems that exist within it. We’ll take the example of Zillow possibly wanting to automatically label the part of the house each of the listing images are from on their site. My thoughts below have been compiled after talking to 50+ ML engineers and researchers in industry. Also note: this writing was free-form so don’t expect perfect thoughts.</p>
<h3 id="current-workflow">Current Workflow</h3>
<h5 id="1-figure-out-how-youre-getting-the-data">1. Figure out how you’re getting the data</h5>
<p>If you’re a drone company or a self-driving car company, this might be going out to take the vehicle for a spin either in the real-world or in simulation. But, if you’re a company like Zillow, you’ll have access to all these images in a database. The team first begins by querying their database for a set of listing image samples along with relevant metadata (house address, timestamps, etc). There are 100M+ listings on Zillow with about say ~10 images per listing which amounts to about 1B images. The team however, is looking for a dataset which is probably around ~100K images.</p>
<h5 id="2-curate-the-data">2. Curate the data</h5>
<p>Since the team at Zillow is smart, they’ll want to diversify their dataset based on a few factors which could include an even distribution of timestamps, housing prices, housing locations, year built, etc. Writing a query for this though is pretty tough because it’s over many axes. So, what the team might do is first sample ~10M images randomly into each of multiple buckets of timestamp ranges. Then, the team may look at each bucket and recursively create new buckets for each of these buckets based on say housing prices until they reach a threshold of about ~5M. They do this for all axes and eventually land on a finite set of buckets of about ~500K images each which fall into each combination of the metadata properties. They can then randomly toss out a fraction of the images from each bucket and have a dataset that’s curated according to these 5-6 axes. Systems like <a href="https://eng.uber.com/michelangelo-machine-learning-platform/">Michelangelo</a> help with this stuff.</p>
<h5 id="3-augment-the-data">3. Augment the data</h5>
<p>Though the team went through the due diligence of trying to create an evenly distributed dataset, chances are that they didn’t really. This is mainly because even though they curated over ~5 axes, there are probably hundreds of relevant ones across the dataset. Some of these might not even lie directly within the metadata and could be things such as image lighting, camera focal length, average distance from subjects, moving objects in images, etc. All of these things have the ability to bias the dataset in some way. So, what they decide is that they’ll try using some augmentation to de-bias the dataset a little more. This could be with things like vertical/horizontal flips and artificially changing the lighting within the images. We now ideally have a dataset which is completely “de-biased” and gives us a good representation of the houses we’ll see in the real world or in the future on the site.</p>
<h5 id="4-pre-process-the-data">4. Pre-process the data</h5>
<p>Now, it’s the job of the team to figure out the best way to feed these images into the model. Much of the time, it’s a matter of simply resizing the image to the fixed dimensions expected by the model and then one-hot encoding the classification labels. However, there are possiblities of trying to do other things like “un-distorting” the images based on camera parameters one may know depending on what phone the image was taken on or purposefully blacking out all non-static objects within the image (people, animals, cars, etc). The motivation of things like this are rooted in the task you’re trying to perform and what it depends on.</p>
<h5 id="5-train-on-the-data">5. Train on the data</h5>
<p>Now you can just send this data off to some AWS instance that loads up a training script and ingests all this data.</p>
<h5 id="6-diagnose-model-failures-and-restart">6. Diagnose model failures and restart</h5>
<p>At this stage, your model probably isn’t perfoming well on some subset of your test set. You have to go through these failure cases and figure out why this is. You then either choose to change something about the model architecture, model hyperparameters, or the dataset itself and run a new job.</p>
<h3 id="problems-with-current-workflow">Problems with Current Workflow</h3>
<p>Now we try to analyze what’s wrong with this workflow and list potential problems below.</p>
<h5 id="1-manual-axis-picking-for-data-curation">1. Manual axis picking for data curation</h5>
<p>The first red flag one might see is within the “curating” step. The team hand-picked a certain set of axes that they thought might me relevant to the task at hand and tried to balance the data according to these metrics. However, it must be noted that these aren’t ever the only axes that matter. Against our intuition, they may even be axes that barely matter to the task for the reason that deep learning models are black boxes that we don’t understand.</p>
<h5 id="2-the-movement-of-data">2. The movement of data</h5>
<p>Though it may not be immediately obvious, there is a tedious process of having download the data to even view it as an ML practitioner. This is what allows you to make decisions on which metadata you should use to balance your dataset or to figure out which types of pre-processing and augmentation might help. The first time you do this is in the data curation step where you might download a portion of the dataset to some local work station to view. The next time data moves is when you run the task of recursively bucketing your data to curate on some remote machine because this is presumably a compute intensive task. The next time it moves is when we pre-process the data because we may require specialized GPU hardware to possibly do deep-learning based pre-processing (i.e. removing people from images). We then move it a final time to an AWS bucket which the training script accesses.</p>
<h5 id="3-augmentation--pre-processing-is-messy-and-annoying">3. Augmentation + pre-processing is messy and annoying</h5>
<p>Sometimes when dealing with data samples that might contain RGB images along with LiDAR scans, depth maps, bounding boxes, etc, it gets tough to manage. When doing deep-learning based pre-processing, this may happen because you’re trying multiple off-the-shelf models on a set of data and seeing which one gives the best pre-processed results. The takeaway here is that augmentation and pre-processing can be annoying both because of how messy it is to manage multiple experiments in pre-processing procedures and because of how long it may take to run on an entire dataset.</p>
<h5 id="4-diagnosing-issues-with-the-model-and-improving-the-dataset">4. Diagnosing issues with the model and improving the dataset</h5>
<p>These days, model architectures have been standardized by companies like Google, Microsoft, and OpenAI to the point at which it makes no sense to experiment with it yourself. The only real way to improve your model is to figure out how to improve your data. This involves sourcing samples that are similar to our failure cases to include in our next training run or seeing if a pre-processing technique negatively affected the model’s performance for some reason.</p>
<h3 id="solving-the-problems">Solving the Problems</h3>
<p>In my last post, I wrote about an IDE for datasets. I think that some of my unorganized ideas from there are part of the solution but since that post, I’ve further organized my ideas on solutions and am trying to build some of them out myself. Will keep the blog posted…</p>Mokshith Voodarlamvoodarla@gmail.comFor every ML project that is worked on in industry, there exists the process of going from collected data to curated datasets. We will go over the current workflow for doing this and problems that exist within it. We’ll take the example of Zillow possibly wanting to automatically label the part of the house each of the listing images are from on their site. My thoughts below have been compiled after talking to 50+ ML engineers and researchers in industry. Also note: this writing was free-form so don’t expect perfect thoughts.An IDE for Datasets?2021-02-15T07:00:00+00:002021-02-15T07:00:00+00:00https://mokshith.xyz/tech/2021/02/15/dataset-ide<h3 id="introduction">Introduction</h3>
<p>ML systems. A whole practice that has come out of trying to take these amazing algorithms out of research and bring them to the forefront of core technologies we all use today. ML systems is all about building tools around ML which make it easier to work with and deploy. There’s a few different categories that we can put these tools into.</p>
<ol>
<li>Data collection + labelling</li>
<li>Data management + storage</li>
<li>Scaled training (distributed, on AWS, etc)</li>
<li>Experiment management (keep track of models, accuracies, etc)</li>
<li>Easy model deployment</li>
</ol>
<p>The left of the image below is a cool summary of some of the companies in this space (peep Scale).</p>
<p><img src="https://s3.amazonaws.com/basecase.vc/defensible-ml-market-map.png" alt="ml systems diagram" /></p>
<h3 id="intro-to-data-management-for-computer-vision">Intro to Data Management for Computer Vision</h3>
<p>I’m super interested in the data management + storage part, and it’s what I’ll talk about in this post. I think I had a different post on a related topic but now I’m less of a noob since I’ve been super into all of this stuff for the past month and a half now. Also, disclaimer: I’ll be mostly talking about all of this in the frame of computer vision.</p>
<p>The big problem ML teams experience with data isn’t necessarily the ability to collect a lot of it. It is rather the ability to take all that collected data, and understand what’s important, what’s not, and where bias may exist. After all, the model is directly learning from the data and any bias or uneven spread will be learned by the model as the truth. This is why finding bias within data is important.</p>
<p>Now, are there tools that help us solve this problem? Well, yes and no. Most ML teams tend to calculate many statistics about their data to make sure that they are balanced across various types of metadata (class, timestamp, etc). However, the more complicated the data gets (especially with many images), the easier it is to miss bias that exists in non-metadata form. For example, what are the general colors of the image? How is the lighting like? Are there always exactly 5 cars there? What type of camera took the picture? This is why they then try to visualize their data in an <a href="https://cs.stanford.edu/people/karpathy/cnnembed/">embedding space</a>. An embedding space is essentially a higher dimensional space that is calculated by running images a few layers through some model. In reference to CNNs, this gives you a representation that the rest of the network is going to use to decide on some sort of classification, detection, etc. Algorithms like <a href="https://distill.pub/2016/misread-tsne/">T-SNE</a> help us take that embedding and bring it down to a lower dimensional space we’re able to then view and explore.</p>
<p>So, I’ve now named two main things: dataset statistics/summarization and embedding space view. These two things should be something every ML team does when working with image data but most of the time they tend to move forward with one or none. What’s the main reason for this? Well, even though at face value these two things are pretty easy to write custom scripts for, those scripts need to be maintained to work with small new variations to the data and also need to be more specific based on use-case (types of scores used to evaluate spread, what’s relevant). They also aren’t always optimized or the easiest to open up and use (because the main job of an ML team is to build model, not to build tooling). Companies like Google, Facebook, Apple, and self-driving car companies like Cruise, Tesla, Waymo, etc all have built redundant tooling out of their needs.</p>
<h3 id="current-tools-and-their-problems">Current Tools and Their Problems</h3>
<p>The closest thing to what I’ve described above is <a href="https://www.aquariumlearning.com/">Aquarium</a>. Aquarium offers a webgui for exploring the samples in a dataset while also being able to see various statistics about it along with a nice embedding view. This enables ML teams to select parts of their dataset where they see issues (based on misclassifications, high concentrations of similar images, etc). They also have a pretty intuitive GUI with some nice settings.</p>
<p>But, why might a platform like this be limited? Well, I guess a good way to think about this is through the lens of IDEs we use to write code. There’s Atom, PyCharm, VSCode, Sublime, Vim, etc. All great options for Python and many other languages. They come will nice plugins and a place to write and compile code. What if we frame these data management platforms in the same sense we’ve done for code? After all, we are talking about a tool to build good datasets from mounds of data that have been collected. In an IDE for code, I’m in control of what code is written, when I want to compile it, custom unit tests, and other plugins based on my specific use case. It all changes based on what I want. However, with something like Aquarium, you are limited to the set of tools Aquarium has built to be available. You can only view a specific set of statistics, you can only filter or select in ways the Aquarium GUI defines and allows. You can’t evaluate model performance on something as custom as detecting transforming many camera views into a birds-eye view and the metrics related to that, for example.</p>
<h3 id="building-on-top-of-current-tools">Building On Top of Current Tools</h3>
<p>This is why my take on data management tools is a little different than what we see exists today with Aquarium or <a href="https://scale.com/nucleus">Scale Nucleus</a>. I think that there should be a platform that allows the exploration of data in embedding spaces, as well as a view to understand various statistics, histograms about the spread of the data. However, I think there should be two main changes to what is being built right now.</p>
<ol>
<li>
<p>We should build a platform catered to specific modes of data (i.e. images with computer vision). There are so many processes that have to do with images and not other forms of data like text or sound. This includes for example, things like mean subtraction, blurring, edge enhancing, contour detection, specific data augmentation, etc. All possible things we might want to do before sending the data through a model. There are also more specific forms of images we may input (birds-eye view, selfie, professional camera, RAW, fish-eye). There’s all these small things we can build features around to make the life of someone who works with this data easier. For example, why not offer automated ways to <a href="https://github.com/ethz-asl/image_undistort">rectify stereo pairs, or undistort a set of images based on camera parameters</a>? If we’re able to build a tool that uploads raw data at which point it can all be processed and analyzed here, we’ve built the ideal tool.</p>
</li>
<li>
<p>Engineers like custom plugins and tools. They like to write code. Give them what they want. Implement specific ways engineers can programatically specify certain splices or filters to a set of data, specific comparators to sort the data in a specific manner or to cluster into gorups. We could also build the ability to programatically specify specific <a href="https://github.com/aleju/imgaug">augmentations</a> and pre-processing to the data. These are all things that become endless in options when applied in the real world. The platform should thus not constrain this need, it should embrace it.</p>
</li>
</ol>
<p>While my thoughts may not be perfect here, I genuinly believe that these are a few of the things we need in order to build the ideal tool for computer vision/ML engineers. How cool would it be if I didn’t have to download my data locally anywhere in order to analyze it and curate the perfect dataset to train my model on. It would be crazy if it was a matter of a few different types of pre-written inspections and clicks at which point it was ready and packaged for training. Right now, tools exists but they’re all fragmented and separate. Maybe this is starting to sound like Palantir since they build tools that make it easier for humans to make decisions on unorganized data coming from many sources. But, I think that whoever builds this tool for ML engineers will have solved a great problem. One that hasn’t yet been solved.</p>Mokshith Voodarlamvoodarla@gmail.comIntroduction ML systems. A whole practice that has come out of trying to take these amazing algorithms out of research and bring them to the forefront of core technologies we all use today. ML systems is all about building tools around ML which make it easier to work with and deploy. There’s a few different categories that we can put these tools into.Formatting Data is Hard2021-01-15T07:00:00+00:002021-01-15T07:00:00+00:00https://mokshith.xyz/tech/2021/01/15/data-processing-tool<p>I’m in Hawaii right now and I woke up early so I decided to write this.</p>
<p>There are so many tools that are making the development of ML models easier. This starts from the data collection step in which there are services that can both help with the collection and the labelling of data (Appen, Scale, and Labelbox). Tools like PyTorch and TensorFlow then make it easy for us to both train and deploy our ML models. However, my two main observations in painpoints come in the data processing step as well as the model deployment step.</p>
<p>I want to talk more about the data processing step in this post so I’ll describe the model deployment issue quickly. When deploying models, there are many situations when we don’t want to also deploy the models in Python due to the slow runtime. For this reason, we require an ML/data engineer to take the trained model file and write separate inference code in PyTorch or TensorFlow which can be run in C++ but are harder to write and less experimentable in a quick and dirty way. It would be interesting to create a tool here.</p>
<p>But, the main issue is the data processing step. What this step entails is taking the labelled data and formatting it in the right way to train the model. For example, with image data we tend to have to do things like batch resizing (via some interpolation) so it fits the models input dimensions. Most data engineers just resize their input image and save the interpolated version on disk to then access. They then also have to save variations based on the augmentations they decide to perform and what ablation study they’re running if there are different formats they’re trying, etc. This stays the same with text data or any other data where similar/parallel operations tend to need to be performed. It just gets too messy too quickly.</p>
<p>I’m thinking of a tool that makes the ability to use a generator to pass data in through batches easy. It could be an API that has access to a base dataset that has been uploaded to some S3 bucket. Within the online tool, you then have the ability to not only organize all your data (view it, query through it, etc) but also add things like augmentations, specific subsets, etc which can then be called via a generator which has a single line of code to an Python API calling for the next batch of data from the server.</p>
<p>I don’t see any tools trying to do this and I think it would be an important problem to solve that saves times for all them ML engineers out there.</p>Mokshith Voodarlamvoodarla@gmail.comI’m in Hawaii right now and I woke up early so I decided to write this.My Thoughts on Internet Echo Chambers2021-01-07T07:00:00+00:002021-01-07T07:00:00+00:00https://mokshith.xyz/tech/2021/01/07/state-of-privacy<p>With the recent news about the riots at <a href="https://www.cnn.com/politics/live-news/washington-dc-election-riots/index.html">Capitol Hill</a>, it brought back to light the reasons behind why all this happened. This meme that Elon Musk tweeted is pretty characteristic of that.</p>
<p><img src="https://pbs.twimg.com/media/ErGeSQTVoAMmwdb?format=jpg&name=large" alt="Domino Meme" /></p>
<p>Over the last few days, I’ve been thinking about some of this a little more deeply but it’s still hard to make sense of exactly what causes this sort of polarization. The two main ways that I know people get their news is from social media sites like Facebook and Twitter or from news outlets (TV or website or social media posts).</p>
<p>In terms of the news outlets, most people subconciously just listen to the ones that feed them information they want to hear. This results in them feeling that the truth is exactly what they hear without considering ulterior motives of the news anchors.</p>
<p>However, social media is by far the thing that contributes most to these echo chambers that result in all the riots and protests we see. As we know, platforms like Facebook track almost all of our internet usage and use it to not only feed us relevant ads but also target other content to which we’ll have the most engagement. This is due to the fact that these companies want to increase our engagement with the app. However, most of the time this sort of engagement factor is where the most polarization exists, without real regard for right or wrong. It’s also the space that results in people mindlessly believing in actual fake news and causing harm to the world because of it. Presidents like Donald Trump then supporting these people amplifies all said effects.</p>
<p>To break this down the effects of social media into steps, we have the following.</p>
<ul>
<li>People use a platform because they want to interact with people on that platform</li>
<li>People see content they want to see and things they want to interact with</li>
<li>People create echo chambers based on their usage of a platform</li>
<li>People start believing in things that may not be true, unwilling to consider another possibility</li>
<li>Violence or hatred is incited</li>
</ul>
<p>If we look at this list, things start going downhill at the point in which people only see things they want to see and not a reflection of the real world (but rather their ideal world). The main culprit here is the news feed algorithm. This algorithm constraints target content in such a way as to maximize Facebook ad revenue while the constraint should really abide to how the real world works. This maximization of ad revenue happens to be a thing because there is no other way Facebook tries to make money. So, in order for the news feed algorithm to change, we need to have a different way for a platform like this to survive and make money.</p>
<p>An obvious first option here could be to make it a subscription based service which has prices set by the goverment and partially funded through some taxpayer money. There are probably problems with this idea at face value but it does remove any ad and content targetting incentives.</p>
<p>Another option would be to have separate data stores and cloud data providers which have some sort of encryption on data being stored within the platform which can then be accessed by companies like Facebook as a third party. This ensures that there are multiple entities which keep each other in check as to how data is being used.</p>
<p>Another option would be to have trusted arbitors of truth (some sort of service) which then these platforms are forced to use to monitor and censor content that is blatantly false.</p>
<p>For a few other solutions, I probably have to think longer but overall, things are really screwed up right now and I don’t think there are enough engineers focused on this problem.</p>Mokshith Voodarlamvoodarla@gmail.comWith the recent news about the riots at Capitol Hill, it brought back to light the reasons behind why all this happened. This meme that Elon Musk tweeted is pretty characteristic of that.State of Deep Learning Workflows2021-01-06T07:00:00+00:002021-01-06T07:00:00+00:00https://mokshith.xyz/tech/2021/01/06/state-of-dl<p>Below is a breakdown of the current gaps or inefficiencies in the workflow for deep learning.</p>
<p>Deep learning is a field that has been in great focus for the last 5-7 years now. People are trying to make the tools we use to build deep learning models easier to use (just so you don’t need a PhD to actually build something useful with them). There’s already been so much progress here with accessible libraries like Keras along with new tools for getting labelled data, etc.</p>
<h3 id="workflow-breakdown">Workflow Breakdown</h3>
<p>I can break down some of the workflow of building ml/deep learning models into the following steps:</p>
<ul>
<li>Understanding the task</li>
<li>Training Data
<ul>
<li>Collect + Label</li>
<li>Create environment + simulate</li>
</ul>
</li>
<li>Develop model pipeline/architecture</li>
<li>Format data to be fed in properly (preprocessing)</li>
<li>Train model
<ul>
<li>Remote Cluster</li>
<li>Google Cloud/AWS Instance</li>
<li>Local Machine</li>
<li>Other Options</li>
</ul>
</li>
<li>Evaluate model</li>
<li>Deploy model</li>
<li>Monitor failures, then report and fix (edge cases)</li>
</ul>
<p>Training data steps are mostly solved by companies like Scale AI, Appen, Labelbox (only for real-world data though). If we’re looking to build realistic simulated environments, there’s still a gap to fill (though there are companies like Applied Intuition which focus specifically on self-driving cars).</p>
<p>Building the model has been made easy with PyTorch, TensorFlow, Keras, and other libraries + repos. There may be some gap to fill here to make a lot of code people write in their research projects cleaner, more accesible and easier to deploy in production since most of the time, code from research projects is really messy or not easily adapatable.</p>
<p>Formatting data has been in my experience something that takes so much time to do. Especially when working with multiple data streams (i.e. rgb image, segmentation map, depth map) and trying to combine them all, it becomes really messy to get them in the right format via some batch operation we can perform and then save on disk. There’s definitely some though I need to put here in terms of how we can improve current workflows.</p>
<p>Training the model has been made relatively easy with the use of things like Docker along with newer companies like Anyscale which allow for some distributed operation. I don’t have deep enough knowledge of this part to assess the current state of tools here.</p>
<p>Evaluating the model can be done right now but there’s no tool that makes it easy to organize all results and easily view them over specific pieces of data you want to view them on. Scale AI is building a tool called Nucleus that tries to make this easier though.</p>
<p>Deploying models is still pretty hard. Depending on the use case, you want to do this on-edge, in the cloud via an instance, or in the cloud via some serverless backend. A good amount of the time, it’s important that the model trained with some Python package is then easily useable in C++ which allows for faster, more efficient runtimes per inference when deployed. From my knowledge, this is still somewhat tough to do and requires a lot of custom optimization.</p>
<p>Monitoring a model in production is probably to biggest thing we can solve in this workflow. Companies like Tesla have probably perfected this with the pipeline they use for Autopilot but there isn’t really any open-source way to effectively monitor when a model fails and then report + iterate on those cases of data.</p>
<h3 id="painpoints">Painpoints</h3>
<p>There’s a few different painpoints we can observe with the overall workflow that exists for deep learning above. Only thing I left out is possible solutions which is the hard part.</p>Mokshith Voodarlamvoodarla@gmail.comBelow is a breakdown of the current gaps or inefficiencies in the workflow for deep learning.