Update README.md
Browse files
README.md
CHANGED
@@ -58,49 +58,8 @@ widget:
|
|
58 |
\ and parameters 0, and generalization is influenced by the inductive bias of\
|
59 |
\ this function space (Section 5)."
|
60 |
example_title: scientific paper
|
61 |
-
- text: ' the
|
62 |
-
|
63 |
-
is generated in various environments and scenarios, before looking at what should
|
64 |
-
be done with this data and how to design the best possible architecture to accomplish
|
65 |
-
this The evolution of IT architectures, described in Chapter 2, means that the
|
66 |
-
data is no longer processed by a few big monolith systems, but rather by a group
|
67 |
-
of services In parallel to the processing layer, the underlying data storage has
|
68 |
-
also changed and became more distributed This, in turn, required a significant
|
69 |
-
paradigm shift as the traditional approach to transactions (ACID) could no longer
|
70 |
-
be supported. On top of this, cloud computing is becoming a major approach with
|
71 |
-
the benefits of reducing costs and providing on-demand scalability but at the
|
72 |
-
same time introducing concerns about privacy, data ownership, etc In the meantime
|
73 |
-
the Internet continues its exponential growth: Every day both structured and unstructured
|
74 |
-
data is published and available for processing: To achieve competitive advantage
|
75 |
-
companies have to relate their corporate resources to external services, e.g.
|
76 |
-
financial markets, weather forecasts, social media, etc While several of the sites
|
77 |
-
provide some sort of API to access the data in a more orderly fashion; countless
|
78 |
-
sources require advanced web mining and Natural Language Processing (NLP) processing
|
79 |
-
techniques: Advances in science push researchers to construct new instruments
|
80 |
-
for observing the universe O conducting experiments to understand even better
|
81 |
-
the laws of physics and other domains. Every year humans have at their disposal
|
82 |
-
new telescopes, space probes, particle accelerators, etc These instruments generate
|
83 |
-
huge streams of data, which need to be stored and analyzed. The constant drive
|
84 |
-
for efficiency in the industry motivates the introduction of new automation techniques
|
85 |
-
and process optimization: This could not be done without analyzing the precise
|
86 |
-
data that describe these processes. As more and more human tasks are automated,
|
87 |
-
machines provide rich data sets, which can be analyzed in real-time to drive efficiency
|
88 |
-
to new levels. Finally, it is now evident that the growth of the Internet of Things
|
89 |
-
is becoming a major source of data. More and more of the devices are equipped
|
90 |
-
with significant computational power and can generate a continuous data stream
|
91 |
-
from their sensors. In the subsequent sections of this chapter, we will look at
|
92 |
-
the domains described above to see what they generate in terms of data sets. We
|
93 |
-
will compare the volumes but will also look at what is characteristic and important
|
94 |
-
from their respective points of view. 3.1 The Internet is undoubtedly the largest
|
95 |
-
database ever created by humans. While several well described; cleaned, and structured
|
96 |
-
data sets have been made available through this medium, most of the resources
|
97 |
-
are of an ambiguous, unstructured, incomplete or even erroneous nature. Still,
|
98 |
-
several examples in the areas such as opinion mining, social media analysis, e-governance,
|
99 |
-
etc, clearly show the potential lying in these resources. Those who can successfully
|
100 |
-
mine and interpret the Internet data can gain unique insight and competitive advantage
|
101 |
-
in their business An important area of data analytics on the edge of corporate
|
102 |
-
IT and the Internet is Web Analytics.'
|
103 |
-
example_title: data science textbook
|
104 |
- text: "Transformer-based models have shown to be very useful for many NLP tasks.\
|
105 |
\ However, a major limitation of transformers-based models is its O(n^2)O(n 2)\
|
106 |
\ time & memory complexity (where nn is sequence length). Hence, it's computationally\
|
|
|
58 |
\ and parameters 0, and generalization is influenced by the inductive bias of\
|
59 |
\ this function space (Section 5)."
|
60 |
example_title: scientific paper
|
61 |
+
- text: "Is a else or outside the cob and tree written being of early client rope and you have is for good reasons. On to the ocean in Orange for time. By's the aggregate we can bed it yet. Why this please pick up on a sort is do and also M Getoi's nerocos and do rain become you to let so is his brother is made in use and Mjulia's's the lay major is aging Masastup coin present sea only of Oosii rooms set to you We do er do we easy this private oliiishs lonthen might be okay. Good afternoon everybody. Welcome to this lecture of Computational Statistics. As you can see, I'm not socially my name is Michael Zelinger. I'm one of the task for this class and you might have already seen me in the first lecture where I made a quick appearance. I'm also going to give the tortillas in the last third of this course. So to give you a little bit about me, I'm a old student here with better Bulman and my research centres on casual inference applied to biomedical disasters, so that could be genomics or that could be hospital data. If any of you is interested in writing a bachelor thesis, a semester paper may be mastathesis about this topic feel for reach out to me. you have my name on models and my email address you can find in the directory I'd Be very happy to talk about it. you do not need to be sure about it, we can just have a chat. So with that said, let's get on with the lecture. There's an exciting topic today I'm going to start by sharing some slides with you and later on during the lecture we'll move to the paper. So bear with me for a few seconds. Well, the projector is starting up. Okay, so let's get started. Today's topic is a very important one. It's about a technique which really forms one of the fundamentals of data science, machine learning, and any sort of modern statistics. It's called cross validation. I know you really want to understand this topic I Want you to understand this and frankly, nobody's gonna leave Professor Mineshousen's class without understanding cross validation. So to set the stage for this, I Want to introduce you to the validation problem in computational statistics. So the problem is the following: You trained a model on available data. You fitted your model, but you know the training data you got could always have been different and some data from the environment. Maybe it's a random process. You do not really know what it is, but you know that somebody else who gets a different batch of data from the same environment they would get slightly different training data and you do not care that your method performs as well. On this training data. you want to to perform well on other data that you have not seen other data from the same environment. So in other words, the validation problem is you want to quantify the performance of your model on data that you have not seen. So how is this even possible? How could you possibly measure the performance on data that you do not know The solution to? This is the following realization is that given that you have a bunch of data, you were in charge. You get to control how much that your model sees. It works in the following way: You can hide data firms model. Let's say you have a training data set which is a bunch of doubtless so X eyes are the features those are typically hide and national vector. It's got more than one dimension for sure. And the why why eyes. Those are the labels for supervised learning. As you've seen before, it's the same set up as we have in regression. And so you have this training data and now you choose that you only use some of those data to fit your model. You're not going to use everything, you only use some of it the other part you hide from your model. And then you can use this hidden data to do validation from the point of you of your model. This hidden data is complete by unseen. In other words, we solve our problem of validation."
|
62 |
+
example_title: transcribed audio - lecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
- text: "Transformer-based models have shown to be very useful for many NLP tasks.\
|
64 |
\ However, a major limitation of transformers-based models is its O(n^2)O(n 2)\
|
65 |
\ time & memory complexity (where nn is sequence length). Hence, it's computationally\
|