2. Data Collection Techniques and Program Design


The following content is
provided under a Creative Commons license. Your support will help
MIT OpenCourseWare continue to offer high-quality
educational resources for free. To make a donation or to
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. GABRIEL SANCHEZ-MARTINEZ:
Any questions on Homework 1 before we get started? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:
OK, fire away. AUDIENCE: I guess,
first, do you think we have like this
minimum cycle time, like a theoretical minimum cycle
time and then what was actually [INAUDIBLE] cycle time? GABRIEL SANCHEZ-MARTINEZ: So
cycle time, just to review– it’s the time that
it takes a bus to– from the time
[AUDIO OUT] for a trip. It goes all the way one way,
has to wait at the other end to recover the schedule,
comes back, waits to recover, and is ready to
begin the next round. So that’s a cycle. AUDIENCE: Since you have
[INAUDIBLE] going on, if you had 4.1 buses,
then you use a cycle time. Then obviously,
you can’t do that? [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ: So
you would need five buses– AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:
–if that’s what you’ve got. Or you would have to do a
trade-off with reliability if that were to happen. AUDIENCE: I think
most of my questions were on this very last
couple of questions. GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: We were aggregating
a bunch of data for– [INAUDIBLE] you did it
across both directions and then asked,
how does it change when you would like to evaluate
each direction separately in layover time? GABRIEL SANCHEZ-MARTINEZ: This
is the penultimate question, correct? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:
So that’s the hardest question
on the assignment. AUDIENCE: OK. GABRIEL SANCHEZ-MARTINEZ:
It is a challenge question because there are different
cases that you have to analyze. That’s maybe the hint, right? There are some cases. And for each case,
there is a probability that that case will occur. AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ: And–
let’s see if this starts– there’s a probability
that it will occur and then a consequence, or
something happens in that case. So you have to look at each case
and then aggregate the cases together, if that make sense. AUDIENCE: Yes. GABRIEL SANCHEZ-MARTINEZ: We’re
taking questions for Assignment 1, which is due on Thursday. Any other questions? AUDIENCE: That’s it. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
It is due at 4:00 so at class time
essentially, yeah. I actually [AUDIO OUT]
if you 4:00. I said 4:05, so you
have five minutes. AUDIENCE: Can you [INAUDIBLE]
what assumptions there are [INAUDIBLE]? GABRIEL SANCHEZ-MARTINEZ:
In what question? AUDIENCE: When you
said it seems to be the reasoning or assumption
about the schedule [INAUDIBLE]?? Which metric do you use? Based on the data,
which [INAUDIBLE]?? GABRIEL SANCHEZ-MARTINEZ: Yeah,
so that’s Question 3, correct? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:
So I can’t really explain. I can’t give you the
answer to the question. So what I’m looking for
there is your intuition and your understanding of why
you would pick which statistics from Question 2, where it tells
you calculate all these things. Now I’m saying pick
from those statistics what you would use
for t and for r. And you may want to combine
different statistics for the computation of r. Yeah? AUDIENCE: [INAUDIBLE]
multiple valid responses but– GABRIEL SANCHEZ-MARTINEZ: Yes,
some more valid than others, but some that are
definitely invalid and some that are almost 100%
valid but not 100% valid. So there are several
correct answers, and some that are
very good answers because you can justify
the choice of the statistic conceptually. Yeah. Any other questions
on Homework 1? I can take some more questions
after class, if that’s OK. So we had a snow day if you had
a good time, and/or at least, you could use it to catch up. So the schedule is a
little different now. I’ve posted an update about
that on Stellar (class site). There’s a new syllabus. And we’re going to do
some [AUDIO OUT] different [AUDIO OUT]. You may remember that we have
three introductory classes on topics of [INAUDIBLE]. And then, we had model
characteristics and roles. And then, [AUDIO OUT]. We’re going to
shuffle a little bit. [AUDIO OUT] Microphone working? So because the second assignment
is on data collection, we’re going to cover that today. And we’re going to give
you that homework today, so that you can get started
on the data collection side. Then, we’re going to cover some
of the short-range [INAUDIBLE] of planning concepts. Nema is going to do that– Nema Nassir. You might recall him from
the previous lecture. And then, we’ll finish
with [INAUDIBLE] and costs in March the 2nd, OK? So remember, there’s no
class on Monday the 21st. AUDIENCE: You mean Tuesday? GABRIEL SANCHEZ-MARTINEZ:
Sorry, yes, Tuesday. I think, there’s
no class on Monday. And then, Tuesday
there are classes. But it’s Monday’s schedule. So we don’t have class. Thank you for bringing that up. OK. I’ll leave Homework 2 for when
we finish with the lecture. But I’ll distribute it later. So let’s just get
started on that. So data collection techniques
and program design– that’s the topic for today. Here’s the outline. So we’re going to cover a
summary of current practice quite quickly. Then, we’re going to talk about
data collection program design process, the needs, the data
needs, the techniques for data collection, the sampling. We’re going to get
into the details of how we get sample slices. And we’re going to finish
with special considerations for surveys and
surveying techniques. so where are we? Where is the transit industry
in terms of data collection, and sampling, and these things? Largely, there’s
been a transition from manual to automatic
data collection. As you might imagine, with
the internet of things, and sensors, and the
internet, and wireless, it used to be that
if you wanted to have statistics on your
running times, you had to send people out. We call those people checkers. And those checkers would
have notebooks and record running times, and number
of people boarding, and these things. Nowadays, with the modern
systems, especially the modern systems, we have
several sensors and types of sensors that collect
some of that data for us. So we’re going to
cover both approaches. [INAUDIBLE] data
collection to supplement [INAUDIBLE] data collection. And if you happen to be
consulting for a developing country that is working with a
system that has not yet brought in automatic data
collection technologies, it’s also useful to know
all about the manual design and manual data
collection process. [AUDIO OUT] took this
class and ended up working in large consulting
firms have gone off to help countries put
in new transit systems. And one of the first
things they have to do is back to these slides and see
what the plan is going to be, and how many people you
need, and how much it’s going to cost. So very useful topic. So as I said, there’s
automatic data collection. There’s manual data collection. There’s sometimes a mix of
data collection techniques. Often, what happens is
that we just send people out and collect data. Or we just extract a sample of
automatically collected data. And we don’t really think about
sampling, and the confidence interval, and how sure
are we of that result that we’re going to influence
policy or make decisions that will affect service. How sure are we of those? So statistical validity. Often, there’s an
efficient use of data. And ADCS, which is Automatic
Data Collection Systems– we’ll use that abbreviation
throughout the course- presents a major opportunity
for strengthening data to support decision making. We’ll talk about
how that happens. Let’s first compare manual
and automatic data collection. So what happens with
manual data collection? You hire people, as I said. You hired checkers. So initially, there’s
no setup cost. There’s a low
capital cost to that. But there’s a high marginal
cost because if you want to collect more data,
you have to hire more people. Does that make sense? If you want to bring in an
automatic data collection system, you might have to
retrofit all your buses with AVL sensors. And that’s going to
cost you initially. So that’s a high
capital cost relatively. But low marginal cost– once
you have those systems in place, they keep collecting
data for you. And it’s almost free. You do need some maintenance
on these equipments. But comparing to
manual data collection, you have low marginal cost. Because of that marginal
cost difference, it tends to happen that when
you have manual data collection, you only pay checkers
for small sample sizes– just what you need. Whereas, once you put in
automatic data collection systems, they keep
collecting data. So you get much bigger data. Bless you. OK, in both cases, we can
collect data and analyze it for aggregate analysis
and disaggregate analysis. So you might want
passenger-specific data on things. Or you might want things
like just averages and aggregate things,
total number of passengers using the system. And when you’re doing
manual data collection, you can look at
quantitative things, things you can measure and count. Or you can also observe
things qualitatively. One example that I
saw in a recent paper was considering the
[? therivation ?] by student in some country. And they didn’t ask people
if they were students. They were looking at people’s– more or less, are they young? Are they carrying a backpack? And that would be the
labeling for your student. So that’s something that a
sensor might not do so well. Although now with machine
learning, who knows? But we haven’t seen that so. So you can do
qualitative observations when you’re doing
manual data collection. Manual data collection
tends to be unreliable, especially when people
aren’t very well trained and when you have a group of
different people collecting data. So each person might
have different biases. It’s hard to reproduce the
exact bias across persons. With automatic data
collection, you do the errors. And often, they
are not corrected. But if you do correct them,
and you estimate those biases just for them, you can end
up with a better result. Because of the small
sample sizes in manual data collection, you tend to
have to have limited spatial and temporal coverage of data. So for example, if you’re
interested in ridership in the system, it’s unlikely
that you will cover ridership in holidays for
[INAUDIBLE] system because there are
only a few holidays. And usually, you’re not
mostly interested in holidays. So chances are, you won’t have
data collection for holidays. Whereas once you install
automatic data collection systems, they keep
collecting data. So you get data at midnight
on President’s Day. So they’re always on. They’re always collecting data. Manual data needs to be checked,
cleaned, analyzed, coded, and sometimes put into systems
before they can be analyzed. That could take a while. You need to hire
people to do that. Whereas automatic data
collection systems often send their data to databases
in real-time or very close to real-time. [INAUDIBLE] you can start
analyzing things the next day. So you arrive in the morning to
your desk at a transit agency, and you have performance
metrics for yesterday. So you wouldn’t be able to do
that unless you have people working very hard if
you’re using manual data collection system. When we talk about automatic
data collection systems, there are many. But there are three types that
we refer to very, very often. And so the first one in AFC,
Automatic Fare Collection Systems. This is your fare box or your
fare gates in your smart card, your Charlie Card. You’re in Boston. You tap to enter the bus. And you tap to enter
the subway system. Increasingly, it’s based
on contactless smart cards. And those contactless
smart cards have some sort of
RFID technology with a unique identifier. When you tap that
card to the sensor, the sensor will read
that identifier. And it’ll do things like
fare calculation for you. But that record gets
sent to a database. And it’s there for people
like us to analyze and make good use of it for planning. So it tends to provide entry
information almost always. In some systems, like the
Washington, DC metro or the TFL subway, you tap in
to enter and exit. So you have both origin
and destinations. And if you always
have the systems on, then you have full spatial
and temporal coverage of all of the use of the system
at an individual passenger level. So very disaggregate–
sorry about that. Traditionally, these
systems are not real-time. So it might take a while
for those transactions to make it to the
data warehouse, where they’re available for
planners to analyze it. The calculation of how
much fare in some systems is in real-time. In other systems like
the Charlie Card, the stored value that you
have is stored on your card. So it may take a while if
you tap at a bus for that bus to go to a garage
and get probed– and for the data that has
been stored in that bus to be extracted from that
bus to the central server. There is a move– and we’ll talk more
about this when we get to fare policy and technology– towards using mobile
phone payments and using contactless
bank card payment systems. And those systems often
do the full transaction over the air in real-time. So we’re starting to
look at the possibility of having all this data
in real-time or almost in real-time. But it’s not there yet. AUDIENCE: [INAUDIBLE] can I
ask a question about that? GABRIEL SANCHEZ-MARTINEZ:
Yeah, of course. AUDIENCE: In terms
of smart card, where this balance is
stored on the card– GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –if one can figure
out how to hack that card– GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –then
what can [INAUDIBLE] fares through an elaborate
technology that I couldn’t do and most people couldn’t do. But maybe some could. GABRIEL SANCHEZ-MARTINEZ:
Yeah, definitely. So the Charlie Card system
is an example about– actually, MIT students
were the first to hack it. AUDIENCE: I’m not surprised. GABRIEL SANCHEZ-MARTINEZ:
So it’s older technology. It used a low-bit
encryption key. That’s a symmetric
encryption key. And they just brute forced it. They figured what the key was. They happened to use the
same key for every card. So once you broke that key,
you could take any card. And with the right hardware,
you could add however much value you want to that card. And– AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Yeah, yeah, exactly. We don’t think it’s
been a major problem. AUDIENCE: But it happens. GABRIEL SANCHEZ-MARTINEZ:
I haven’t seen MIT students selling special MIT cards. But that would be
criminal, of course. Yeah, so newer systems have
much stronger encryption. And they have different
encryption keys for each card. And certainly, when we’re moving
towards contactless bank cards, we’re talking about a much
more secure encryption. It’s your credit card
that you’re using to tap or your Android or Apple Pay. AUDIENCE: Account
based [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:
Account based– and essentially, what you
have is a token with an ID. And then, the balance is not
even stored on your card. The account server is handling
the balance and those things. So much more difficult to break. Yup. OK, AVL systems, or Automatic
Vehicle Location systems– so these are systems that
track vehicle movement. So for bus, they tend
to be based on GPS. You have GPS on a bus, on the
top of the bus, a little hub. And it collects data every five
seconds or every 10 seconds. And these positions might
get sent either in real-time, or maybe they get stored
on the onboard computer and then are extracted when
the bus reaches the garage. So just GPS– sophisticated
AVL systems for bus also have gyroscopes to do
inertial navigation and dead reckoning, especially when
the GPS precision drops. And that happens especially
with the urban canyon effect. If you have tall buildings,
GPS signal bounces around. The dilution of precision messes
up the position of the bus. Or maybe you’re
entering a tunnel, and you want to
continue to get updates of positions inside the tunnel. So this is a
temporary system that kicks in and interpolates
positions and figures out how the bus is moving. For a train, it’s usually
based on track circuits. So we’re going to talk
more about track circuits. But essentially, a
track knows if a train is occupying that segment or
not occupying that segment. And there are often some sensors
that read with RFID technology the ID number of a car. And sometimes, you have a
sensor in the front of each car and [AUDIO OUT] each car. And so a computer will look
up the sequence of readings and follow track circuits
as they are being occupied and unoccupied– and in that manner, track
trains throughout the system. These systems were put in
place mostly for safety to prevent train crashes. And because of that, you
would need it to know buses or where a train was. They are available in real-time. They were designed
from the beginning to track vehicles in real-time. So that’s what we have. I guess what’s newer
is that now, we’re collecting them and keeping
them in a data warehouse so that we can
analyze running times. AUDIENCE: [INAUDIBLE]
these systems have benefit to the consumer? GABRIEL SANCHEZ-MARTINEZ:
They do. And that’s the newest
thing that has happened– that nobody thought
about consumers when they were put in place. So yeah, we are
talking about tracking, knowing how many minutes
I have to wait for my bus, for example. And those things are pushed
through a public API, so that if I’m a
smartphone app developer, I can go ahead and pull
data from this next bus app and make an app. And so people can download it,
and they know how many minutes they have to wait. Yeah, so definitely. So we have seen a lot of AVL
being pushed in that manner. We have not seen so much AFC
data or APC data being pushed. Obviously, you wouldn’t
want all the details of AFC being pushed. But you might want to know
how crowded is my next bus, or how crowded is my next train. And you might actually
alter your decision whether to wait
for a crowded train or walk a longer time
based on that information. So that’s coming. I think, in the next
few years, that’s going to start happening. So passenger counting– many
different technologies exist. For bus, we tend to have these
optical sensors in the back. You might see them if
you pay attention– broken beam sensors. They look like two little
eyes with two little mirrors on each door. And so when you cross
the beams, if you press one beam first
and then the other, that sensor will know– is a person coming into the bus? Or is a person exiting the bus? And you have that at each door. And it counts those beams
going in and going out. And often, this is
slightly inaccurate. So you might get more boardings
and lightings for a given trip. So at the end of
a trip, whatever remains in terms of imbalance
between boardings and lightings gets zeroed out. And the area is distributed
throughout that trip that was just run. And often, you still have
to do some error correction after that. But it’s a way of counting
people getting on and off. And that’s useful to get
how many people are riding the system and also
the passenger miles– the passengers multiplied
by distance, which is often a required reporting element
in things like the NTB, the National Transit Database. So for rail systems,
we have gates that count how many
times they open and how many times they close. So you might have that
kind of counting in rail. You also have
video-based counting– so camera feeds that
can be hooked up to a system that will
essentially track nodes moving inside that frame. And you can count things
that cross a certain line, for example. And you could do
that to count flows. And then for train, we also
have the weight systems. So this is only in trains. The braking systems in
trains apply braking force in proportion to the
load on each car. So if you have a
very heavy car, you need to apply stronger braking
force than in a car that is almost empty. If you don’t do that, then
you apply a lot more force per weight on the lighter car. That car is going to be the
one pushing the other cars or pulling the other cars
through the coupling. And that will eventually
break the [INAUDIBLE] at a faster rate. So what you want is,
each car to slow down at the same rate by itself
as much as possible. And for that, you need to brake
in proportion to the weight. And therefore, you have
these weight systems. They used to just do that. And more recently,
we hooked them up to a little
storage device that keeps track of the
weight and maybe Wi-Fi, so that each time it reaches
a station or the terminal, it sends the data off. And we might have a rather
somewhat [? unprecise ?] idea of how many people
are in the car just based on an average
weight of a person. And these are traditionally not
available in real real-time. [INAUDIBLE] you have questions? Yeah? AUDIENCE: You could
also just reconcile it with the other system, right? GABRIEL SANCHEZ-MARTINEZ:
Of course, yeah. AUDIENCE: So if you have– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –people early
can transport to get on to. GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Yeah, definitely. Yeah. And that’s cutting edge research
that’s happening right now. How do you do data fiction
and merge different systems? They all have errors. And how do you
detect when one is more erroneous than the other? And how do you mix
these data sources to get the most precise,
not just loads, but paths within a network and
things like that. Yeah. So any questions on these three
very important automatic data collection systems? AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ: Yup. AUDIENCE: So if
there [INAUDIBLE] AVL, what kind of reason
can be [INAUDIBLE]?? GABRIEL SANCHEZ-MARTINEZ:
So the question is, why might some of these
technologies produce errors? And in particular,
you’re asking about AVL. So each of these has
a different behavior. And within each of these
categories of technologies, each vendor’s system might have
specific things that happen. With AVL, the most
common thing is end of root problems–
detecting when a trip actually begins and ends. So AVL systems,
you have this GPS coming in every five seconds. Depending on your chip set, you
might get it more frequently than that. But you also actually
sometimes hook it to the doors. So if the door is opening, you
say, well, I must be at a stop. And therefore, let me
find which one is closest. So there are ways to correct it. But when you get to
the end of the route, it’s not clear always–
have you finished your trip? Or rather, are you
starting your trip already? So maybe if the terminal is at
the same place on the trip– the previous trip
ends at the same place that the next trip
begins, there might be a time where the doors
open and close various times. And the trip isn’t
ready to leave yet. And so you really have to
wait to see the bus leaving that terminal and moving. Sometimes, there
are false starts. So maybe another bus comes
along, and it needs that space. So the driver moves the
bus a few meters forward. And the system thinks
my trip has started. And then, when you’re
looking at aggregate data, you’re looking at, say, running
times at the trip level. You see these outliers
with very long times. And if you were to
plot them by stop, you see that the link
between the first stop and the second step is
sometimes very high, 15 minutes. And so you can throw those out. Or you can do some interpolation
or imputation of data. Some systems that care
very much about that will purposely
place the terminal stops sufficiently far
apart to prevent that from happening because
it is a problem. And this data is crucial to
planning service and figuring out how much resource you’re
going to put into each route. So yup. AUDIENCE: For tap cards,
[INAUDIBLE] and metros, some of them we have
to tap out to exit. It is because of
variable [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ: Yes. AUDIENCE: But in some systems,
it’s still a flat fare. You still have to tap out. Is the reason behind that
mostly data collection? Or is there anything
[INAUDIBLE] you’re going to still have to
tap out [INAUDIBLE]?? GABRIEL SANCHEZ-MARTINEZ:
So yeah, no examples of it come to mind. You might know one. AUDIENCE: MARTA? GABRIEL SANCHEZ-MARTINEZ:
OK, I haven’t visited. So yeah, data collection
might be a reason to do that. But I’ll have to get back to
you on why MARTA did that. But yeah, most systems that
have controls in and out are for fare policy
reasons and not for data collection reasons. We’re starting to see more
interest in data collection and in investing on
these technologies just for data collection. So maybe– but I’ll have to
check and get back to you. AUDIENCE: You mentioned some
systems separate their depots to not confuse the end
[? from the start point. ?] [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:
Their terminal stops, yeah. AUDIENCE: What are
some examples of those? GABRIEL SANCHEZ-MARTINEZ: TFL
will do that in London, yeah. Yeah, so they’ll monitor this. And if they see that
this is occurring often, they will separate
the stops a bit. And the reason they do
that is because they have people whose job
it is to impute data when it’s incorrect. So if they don’t do that, and
the system is consistently producing bad data,
then that means they’re going to have to spend
human resources on correcting that data. So at some point,
it’s just easier to move the stop a little bit. It doesn’t have to
be a long distance. AUDIENCE: Got it. GABRIEL SANCHEZ-MARTINEZ:
It does not make the same and make it far enough apart
that the geo fences can be told apart from each other. Alright? AUDIENCE: Really small scale
data of the EZRide who I work for, actually you could
see real-time bus loads [INAUDIBLE]– GABRIEL SANCHEZ-MARTINEZ:
Oh, interesting. AUDIENCE: –which was actually
helpful if you’re dispatching, and you know a bus is
getting through people on it. [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Yeah, for real-time control. [INTERPOSING VOICES] AUDIENCE: But the
terminal at our station had a drop-off point
and a pick-up point. The drop-off point was
before layover [INAUDIBLE] was after for this exact
reason to make sure that it will go through
the drop-off point, reset, until people
get off of it. GABRIEL SANCHEZ-MARTINEZ: Yeah. Yeah, so it happens. [INTERPOSING VOICES] AUDIENCE: Definitely. [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
That sounds about right. OK, if there are
no more questions on the three very important
categories of automated data collection systems,
let’s talk a little bit about the data collection
program design process. So this comes from before
automatic data collection. And nowadays, we think a
little bit less about this. But it’s still important. So if you do need to
collect some data, there’s a structure that you
can follow to do it properly and to make sure that you
collect data efficiently, so that you don’t spend too much
resources on data collection and that you can answer
your policy or your planning questions. So based on your needs and
the properties of your agency, I say here, determine
property characteristics. That’s a North American term. A property is an agency. So if you see that,
that’s an agency. So based on the characteristics
of the service you’re running and your data needs, you can
select some data collection technique. We’ll get into what
some of these are. Then, you can develop
route-by-route sampling plans based on how variable
the data is in each case. And you can determine how
many checkers do I need. A checker is a person who
goes out and collects data. And then from that, the cost– so human resources. It’s a planning exercise. And what we do usually is that
we conduct a baseline phase. So that’s the first time
you go out and collect data. You don’t know much
about what you’re wanting to collect data on. So it might be only
matrices, or loads, the people getting on and off. So you have to go out
and do a bigger effort. And that’s called the
baseline phase effort. Once you’ve done that and you’ve
established some tendencies, you might want to monitor
that to see if it changes. So then, you do a lighter weight
data collection effort, where you go out and less frequently,
using fewer resources, you collect sometimes
the same thing. Or sometimes, you observe
something else that is related or can be correlated with
what you really want. And then based on a
relationship between the two, you can estimate
what you really want. So you can monitor
what you collected. And then, if you
detect that there’s been a trend or a change, and
you need to investigate it further, you might go ahead
and repeat the baseline phase to increase your accuracy. So one of the catches of
this is that to determine sampling plans, to
determine required sample sizes to achieve some
confidence interval, you need to know how
variable your data is. And if you haven’t collected
it yet, you don’t know. So you might have some default
values that you resort to. And we’ll get to that
later in this lecture. But you might also
do a pre-test, where you send some
people out, and you collect some data
to really start to get a sense of
how variable is it, and how big will my
sample requirements be, and how much will it
cost for me to do this. So this is the process
that you might follow. And there are different
data needs by the question that you’re trying to answer. So one way of looking
at that is, are you collecting things that
are for specific routes, or for specific route
segments, or at the stop level? Or are you using more aggregate
system level data collection? Are your questions
more system level? So system-level things
are more about reporting, and they might be tied to
things like federal funding. Whereas route-level things
and stop-level things are more important for planning. So when we talk about route
and route segment level, we’re looking at things like
loads at the peak load points or at some other key points. How many people are in the bus? The running time
is by the segment to do schedule that
has time points or maybe end-to-end to
your operations plan. Schedule adherence– are
these buses running on time? Or are my schedules
not realistic? Total boardings or
revenue, two things that are highly correlated–
so number of passenger trips. Boardings by fare
category– so you might say, well, I want
boardings, but I want to know how many
seniors are using this, and how many students are using
this, and how many people are using monthly passes,
and how many people are using pay-per-ride. So you have different
fare categories. And you might want to
segregate the data by that. You might want passenger
boarding and lighting by stop. So that’s what
APC would give you if you have an automated system. But you might also use a write
checker, who sits on the bus and counts people
boarding in a lighting. Transfer rates between routes–
to see you maybe you’re looking at changing
service so that people don’t have to transfer. Passenger characteristics
and attitudes– this usually requires some degree
of survey, where you ask people things,
passenger travel patterns. At the system level,
we have things like unlinked passenger
trips, passenger miles, linked passenger trips. This had the whole system level. So sometimes, you do route
level or route segment level analysis, and
then, you aggregate to get the system-level things. That’s usually how you proceed. But the requirements in
terms of how many of these you have to sample
might be different. So if you want to achieve a
certain accuracy at the system level, you don’t need
to achieve the accuracy for each of the routes
that are in that system because you might have– so if you want to
say 90% confidence in some system-level
data element, you might only need 80% or
70% of the element level. And once you bring
those altogether, you achieve the
90% that you need. So data inference, I talked
about how sometimes we can infer items if we don’t
observe them directly. So from AFC with AFC is a
low-fare collection system, we have boardings because
people are tapping into the bus or tapping into
the subway system. And if we have APC, we
count people getting on. So we can look at total
number of boardings that way, if that makes sense. That’s pretty direct. Sometimes, you want to correct
for errors in the APC system, or you might have things
like variation affecting that number– like it goes
from AFC to how many people were actually in that bus. How many people
actually boarded? So you might do a little
bit of manual surveys to check what that relationship
is and apply some correction. For passenger miles,
we need to know how many people are at the
bus between each stop here. So AFC gives you boardings
and only boardings. APC gives you ons and offs. If every bus had APC, then you
could calculate passenger miles directly. But often, you have systems
where only a portion of the fleet has APC. So maybe 15% of your fleet
is equipped with APC. And from that, you get
the sample OD matrix. And you can use that
OD matrix to convert from boardings only to the
distribution and the ons and offs at all bus routes. And from that, you can
get passenger miles. Or you might just
use your buses that have APC, if that suffices
for your data collection unit. Same thing with peak
point load– similar idea. The AFC only measures boardings. So it doesn’t give you the
peak point load automatically. But from APC, you could get it. And it you can establish a
relationship between boardings and the peak load
point, then you can use that model to
infer the peak load point from just boardings. So this is a key thing to
be efficient about data collection. Any questions on this idea? Yup. AUDIENCE: So to get
passenger miles, you’re also going
to have a GPS system as well to know the distance? Or are we just basically
[INAUDIBLE] this is the routing [INAUDIBLE]? GABRIEL SANCHEZ-MARTINEZ: Both. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Yeah, both. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
What tends to happen is that the APC, it’ll come in. And it’ll say, at this stop,
this many people boarded. This many people are lighted. So you have other
layers in your database that say where the buses
and what the distance is between stops and
the stop pair level. So you then essentially
know how many people are riding on each link
and how long that link is, and you multiply the two. So yeah, passenger miles. Yeah, more questions. AUDIENCE: Yeah, for these
checks that are going on like the more manual checks– GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –I know
often, there’s derivation checkers who
are coming into a check. GABRIEL SANCHEZ-MARTINEZ:
That’s right, yeah. AUDIENCE: Do they also use
that data to cross-reference the passenger counts? As in, [? this ?]
person gets on, and they check everyone’s
voice to [INAUDIBLE] DFL. GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: They then know
exactly how they go on the bus. GABRIEL SANCHEZ-MARTINEZ: Yes. Yeah. AUDIENCE: Do they use that data? GABRIEL SANCHEZ-MARTINEZ:
Yeah, they can. In the APC, sometimes
there’s reliability problems, especially when
vehicles are very full because
sometimes, people will block the sensor by the door. Actually, people like
to stand by the door all the time, even when
the bus isn’t full. And that kind of affects APC. You might notice
this on the one. If you take the one– so yeah, you sometimes have a
little bit of a manual effort to figure out. Just learn about
your APC system, and what are the errors,
and when do you see them. It often happens that you
have more variation when you have very high loads. And that’s when APC
is least accurate. So it all comes together. Yeah. Questions on the back? I think I saw a question. No? AUDIENCE: Yeah, I
noticed that in Chicago, when the bus would be crowded,
then people get off the bus. They let people off– GABRIEL SANCHEZ-MARTINEZ:
That’s right. AUDIENCE: –and then back on. GABRIEL SANCHEZ-MARTINEZ: Yeah. Yeah. These double things. But somebody might be by
the door just blocking the two little sensors– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:
–the two little eyes. And that’s it, no records
of people getting on or off. So if you’re doing a little
data collection, as I said, we use checkers. And actually, your
second assignment, you will be checkers of some kind. The typical checkers which you
won’t be in this assignment are ride checkers
and point checkers. So a ride checker sits
in the vehicle and rides with the vehicle. And the typical thing that these
ride checkers are looking at is, how long did it take
to cover some distance? So what was the running
time for that trip? And also, people
getting on and off– so they act as APC essentially. And they act as AVL. So AVL and APC
together might replace most of the functionality
of a ride checker. Although a ride checker often
can conduct an onboard survey, asking passengers about
where are they going, or their trip purpose, or things
related to social demographics, which are qualitative and cannot
be collected with the sensors. Point checkers stand
outside of the vehicle. They stay at a specific
place, and they can look at headways
between buses– so how long did it take
between each bus to come by, and how loaded were these buses? So if you’re interested
in the peak load point, and you know where the
peak load point is, and you just want
to observe, measure what are the loads
of the peak load point, then you can
just station a point checker at the peak load point. And if that person
is strained, we’ll be able to more or less say how
many people are in the vehicle from looking at the vehicle. With automated data
collection systems– yeah, with a fair system,
we have passenger accounts. We have transaction
data, which is very rich. It will tell you not
only that somebody is entering or exiting,
but also how much they’re paying, sometimes information
about the fare product type, which might help you
infer if this person is a senior, or a student, or a
frequent user, an infrequent user– so many things that are
very useful for planning. And we’ll get to play with some
of these later in the course. And then, there’s Automatic
Passenger Counters, APC. So as more and motor systems
switch to automatic data collection, we still use
some manual data collection, but not in the
traditional sense. Now, we reserve those
resources for things like surveys about social
demographics and other things. And we also carry out
web-based surveys, which would have some biases. But if people
registered their cards, and you have email
accounts, you can maybe send a mass email to everyone
and carry out surveys. The MBTA does that. Maybe some of you
are in the panel of people who are e-mailed
every now and then. Is anybody in that panel? No hands. I’m in that panel. But I know somebody must be. So yeah, they send an email, and
they ask about your last ride. And they say, where
did you start from? What were you doing
this trip for? How long did you have to walk? Are you happy with the system? Was your bus on time? Yeah, things like that– how satisfied are you? It’s a survey with
qualitative questions that you couldn’t
collect automatically. It’s [INAUDIBLE] seeing
things about your experience outside of the bus, which
there are no sensors for. All right, sampling
strategies– a bunch of different ones
and the simplest one is called simple random
sampling– very, very simple. So when you have
sample random sampling, what happens is that
every trip, if you’re looking at surveying trips,
for things like how many people boarded this trip– let’s take that as an example. Then, if you’re using
simple random sampling, every trip has equal likelihood
of being picked and being surveyed. So if you go through
your process, and you determine that
you need to observe 100 trips to get an
average reliably. And you’re going to use
that to plan something, then you need to
look at 100 trips. So if you use simple
random sampling, you take your schedule, and
you randomly pick 100 trips. And that’s your sample. Those are the ones that you
send people out to collect data. Now, there’s a little bit
of a problem with that. It’s not the most
efficient method because if you’re going
to send someone out, and that person is going to be
active, and require some time to get to the site and
some time to return, then once they’re out
there, you want them to collect as much as they can. So that’s not simple
random sampling. That’s cluster sampling. Before we get to that
systematic sampling– so typically, instead of
picking randomly, we say, OK, we need to get
10% of the trips. So let’s just make it
such that we count. And maybe it’s every five
trips, we have to survey it. So now, it’s evenly spaced. And this is useful
for some things. One example is weekday,
picking the weekday that you’re going to survey on. So the technique that is often
used is sample every six days. Why would that be? Yeah. So if you do it every seven,
then you always have a Monday. And that’s going
to get some bias if Mondays happen to
be low ridership days or high ridership days. So if do every sixth
day over a year, you have a good sample
of every week day. So that’s an example
of systematic sampling. But you still have
that issue of it might not be the most efficient. Cluster sampling, sometimes
it’s more efficient once you send out
a person to collect data to do as much as possible. And you survey a cluster. So one example is, if
you’re distributing surveys to passengers, and you need
to distribute 100 surveys. If you do 100 simple
random sample, then those people might be in
different parts of the system. And one might be
the first person you see getting off
at South Station. And then another
one by me might be the first person you see getting
off at the Kendall station. So that’s very inefficient. So a cluster might be
everybody on board a bus, and that will get a
bunch of people together. However, it’s not as efficient
statistically to do that. So you can’t just add
up to 100, and you’re done because there might be some
correlation within the people riding that vehicle
that they will tend to answer in a similar way. So you might need to
increase your sample size when you use this technique. But still, you might have a
more efficient sampling plan. Then, there is the
ratio estimation and conversion factors. We gave examples
of this already. This is in the context
of baseline phase and then monitoring phase. So you start out with
a baseline phase. And in the baseline
phase, you collect the thing you really
want and something that is very easily collected
with lower resources. And you make a model
of the thing you really want as a function
of the thing that is cheap and easy to collect. And then, on the
monitoring phase, you only measure the thing that
is cheap, and easy, and quick. And you then use the model to
estimate what you really want. So converting AFC boarding
to passenger miles, we give an example of that. We’re converting
loads at checkpoints to load somewhere else. So maybe only measure
loads with a point checker at the peak load point. And you have some relationship
to convert those loads to loads at other key transfer
stations as an example. And then, the
stratified sampling– so one of the things that
determines how big of a sample you need is the
variability in the data that you’re collecting. So correlation,
when you’re looking at a whole system with multiple
routes or multiple segments– maybe when you
look at one route, there’s some variability
of running times. But they have a central
tendency as well. And when you’ve
got a second route, you have also some
variability and a different central tendency. So you bunch all
the data together, some of the variability across
data points in our data set are going to be the inherent
variability of each route. And some of it will be
systematic– the differences between both routes. So if you do a
simple random sample, and you don’t separate
the systematic variability from the inherent
variability, then you’re going to get a
wider variability. And you will require
a bigger sample size. Stratified sampling
is an approach where you determine sample sizes
for each of these separately. And it’s more efficient
if you do it well because you eliminate
the need, or you at least reduce the need, to
collect data for the sake of the systematic differences
between different parts of the system. Any questions on these methods? Yes. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Yeah, so let’s maybe pick
another example. Let’s say that you’re looking
at the proportion of passengers in a bus who are students. And you’re
distributing a survey. And they tell you whether
they’re students or not. And you want this
for the whole system or for at least a
group of routes. And it tends to be that some
routes don’t serve universities and don’t serve schools. So they have a lower
proportion of people. And then, some routes that
do go through universities, and they have a higher
proportion of students. So if you just want the
system-wide proportion of people who are students, and
you join all these data points together, there’s going
to be a lot of variability in what proportion that
is across every trip that you survey, correct? So in some sense,
it will indicate that because of
that variability, you’re going to need a
higher sampling size. You’re going to have
to survey more trips to get at your desired
accuracy level and tolerance. But now, if you say no, I’m
going to split routes in two, into two stratas. One is the routes that
serve the universities. And these tend to have
around 50% proportion. And then, there’s the routes
that don’t serve universities. And these tend to have
proportions near 0. So if you’re in your 0, you
might require a lower sample size to cover those. And you can just
very efficiently cover most of your
bus routes that way. And then, focus your
efforts on just the ones that have higher proportion. And you achieved your
system-level tolerance requirements with much fewer,
with by far fewer resources required to collect the data. Does that answer your question? Yeah. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
So what he meant by inherent is that within each bus
route or within each strata, there will be some variability. Even within the trips that
are serving universities, every trip might have
a different proportion. So there’s going to be a little
bit of variability in that. But if you mix that with trips
that are not serving students, then you pull all
that data together. Then, it’s going to look like
the variance of that data set is much higher. All right, so we’ve
tossed these terms around– tolerance,
confidence, level accuracy. So let’s define
them more precisely. Accuracy– when we
talk about accuracy, that has two dimensions. So somebody might say, the
average boardings per trip is 33.1. And then, the
question that follows is, do you mean exactly 33.1? How certain are you of that? And how accurate is that? So when we talk about tolerance,
there’s relative tolerance, and there’s absolute tolerance. Relative tolerance
is expressed in terms of a percent of the amount you
were collecting or a fraction. So you might say mean
boardings per trip is 33.1, plus or minus 10%. And that’s the 10% of 33.1. That’s why it’s
relative tolerance. Then, there’s
absolute tolerance. So mean boarding per trip
is 33.1, plus or minus 3.3. Now, in this case, these
two are equivalent. 3.3 in absolute
terms is 10% of 33.1. But this was expressed
in absolute terms, and the previous one was
expressed in relative terms. So don’t always assume
that if you see a percent, it’s relative because if what
you’re measuring is in itself a percent, unless you’re
using a percent of a percent, then it’s absolute. So here’s an example. Mean percentage of students
is 23%, plus or minus 5%. That’s absolute because
it’s 5%, not 5% of 23%. First, we talked about,
is that exactly 33.1? Or is it something
different from 33.1? Then, the second question
is, how sure are you, how confident are you
that the number you give, plus or minus the tolerance
you give, is the right answer? So now, you say
I’m 95% confident that the mean boardings per
trip is 33.1, plus or minus 10%. So now, you combine
the tolerance with the confidence level. And that’s the full
expression of your accuracy. And that’s what you need when
we look at the data collection. So you have two different
things that you could play with. And what happens typically
is that you choose a high confidence level– 90%, 95 percent are typical. And then, you hold that fixed. And you calculate what
level of accuracy you need. Or rather, you decide
what level of accuracy you need, depending on the
question you want to answer, and the impact it could
have on the system. So if you’re looking to
[INAUDIBLE] something that will have very significant
effects on the service plan or maybe on investment
in the system, then you might need
a higher accuracy. But if you’re collecting
data just for reporting, maybe it doesn’t matter as much. And you don’t need to spend as
much money on data collection. So as an example here, the
National Transit Database– NTD, we call it NTD– for annual boardings and
passenger miles, it says, you should collect
data to achieve an accuracy of 10%, relative
tolerance at 95% confidence level. You need both. So take home message about this. The other thing,
the t distribution– so this is a probability
distribution that is bell-shaped. It kind of looks like
the normal distribution. And it approaches the
normal distribution as the sample size
gets very large. This is the distribution
that arises naturally when you’re estimating the
mean of a population that is normally distributed with
unknown mean and variance and some known sample size. So to the right here,
we have your equations that I’m sure you’ve seen
before for sample mean, sample variance. And I guess, what’s
important to think about is that the
distribution of what you’re collecting–
for example, you might be collecting data on a
number of people boarding route 1. So that might have
some distribution. As you collect
more and more data, so as you survey
more and more trips, the distribution of how
many people board each trip does not necessarily
have to be normal. But it turns out from
the Central Limit Theorem and other laws and properties
of statistics and probability that the distribution
of the estimator– so the distribution of the
mean that you calculate based on that sample that
you collected– is normally distributed as
the sample size increases. So if you have a
lower sample size, instead of using the
normal distribution, use t distribution. Sometimes, we call that
a student, the t student distribution. And this distribution gets wider
as the variability increases and as the sample
size gets smaller. It has a property called
degrees of freedom, which is sample size minus 1. And you can see from this
chart right here when you have degrees
of freedom equals 1, which means you
collected two data points, it’s wider than when
V approaches infinity. And what you have in black here,
the thinnest and least variable of these, is essentially
a normal distribution. And this is the distribution
not of what you collected. It’s not the distribution
of the number of people who boarded route 1. It’s the distribution of
the mean that you estimate. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Exactly, it’s a sampling distribution
of the mean. And if you were to repeat that
experiment with the same number of trips but different
number of trips, you might get a
slightly different mean. So if you were to repeat
that many, many times, the distribution of those means
would be shaped in this manner. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ: Yeah,
well, student t distributed. And as sample size increases to
infinity, normally distributed. Harry. AUDIENCE: So just for V equals
5, I think you [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ: 4. AUDIENCE: 4. GABRIEL SANCHEZ-MARTINEZ:
Sorry, 6. 6. AUDIENCE: Approximately
5 [INAUDIBLE].. GABRIEL
SANCHEZ-MARTINEZ: Yes, 6. Yeah. I mispoke. [INAUDIBLE] AUDIENCE: When there’s a sample
variance, sigma x squared equals roughly. Is that not supposed
to be an equals? Is that not the way the
sample variances define? Because I thought it’s the– GABRIEL SANCHEZ-MARTINEZ:
So– –it’s below the variance
of distribution. But that’s roughly [INAUDIBLE]. AUDIENCE: Yeah, I guess
the issue is that you don’t know the true mean. So you’re using an estimate to
calculate the sample variance. And therefore, it’s almost,
almost the sample variance. GABRIEL SANCHEZ-MARTINEZ: Right. But I thought– AUDIENCE: You’re
using an estimator to do the– that’s
what you have to do. [INTERPOSING VOICES] AUDIENCE: He’s
incorporating the fact we’re dividing by n minus 1
rather dividing by [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:
No, so n minus 1, that has to do with the
degrees of freedom issue. And that’s to go from population
variance to sample variance. But the other thing
that happens is that if you’re doing
the population, then you know exactly
what your mean is. It’s exact, right? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:
And then in that case, you would know what the
exact variances is as well. Yeah. So the n minus 1
is just to remove a bias that would arise from
collecting only a sample. AUDIENCE: But here
for example, you can say this is
equals to [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:
Yeah, yeah, yeah, yeah. AUDIENCE: You’re
working with the sample to know it would be an
approximate [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:
Yeah, in practice equal 2. AUDIENCE: As your
sample distribution increases, then obviously,
your sample increases– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:
And therefore, this becomes more and more accurate. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Exactly. AUDIENCE: It should be
approaching more [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:
Yeah, so I guess what’s
important to realize is that this is an estimate of
the population variance, which in itself uses another estimate. And I guess, that’s
why that’s there. But it’s a very small detail. I didn’t mean to distract you. AUDIENCE: So for the n, is it
the sum of all the different samples of [INAUDIBLE]
or is it just– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:
So you don’t ever repeat the experiment like this. This is more of a
theoretical explanation to why there is a
distribution to the mean, even though you only have one. You only have one mean, right? Because you’re going
to collect data. And once you finish
collecting data, you’re going to calculate
the mean of all that data. So you only have one mean. If you were hypothetically
to repeat that experiment, and you calculated separate
means for each one, then you would
get a distribution that would look like this. In practice, you would just
increase your sample size and still compute one mean,
which would be more accurate. Yeah. OK, let’s move on. So tolerance and
confidence level– so we have these distributions. These are the distributions
of the statistics, of the mean in this case. They are bell-shaped. As your sample size increases,
the degrees of freedom goes up. And your accuracy goes up. And the variance of that
statistic distribution decreases. So it gets thinner. So here in red, you have a
distribution with a smaller sample, and therefore, less
accuracy or less confidence would look like. And then as you increase
your sample size, you see that it
becomes more peaky. So when we talk about
tolerance, and let’s come back to the concept
of absolute tolerance in particular, we’re
talking about the distance between the center of
that distribution, which is a symmetrical
distribution, and some limit. So we’re saying, if you have
a tolerance of plus/minus 10. Then, you’re going to
measure 10, say 10 boardings, from the center to the right
and from the center to the left. And that’s your
absolute tolerance. So when you calculate
absolute tolerance, you can express that
tolerance as a function of the variance and/or
the standard deviation, rather of your mean. So instead of saying 10,
you could say 2 times the standard deviation of that
distribution using the equation that we just calculated. And that’s very convenient. Why would we do that? Why would I want to
complicate things that way? AUDIENCE: [? Outside ?]
[? of ?] a cumulative GABRIEL SANCHEZ-MARTINEZ:
No, I mean, there’s a mathematical convenience here. What is this a function of? It’s a function of
the standard deviation of the thing you were collecting
and your sample size, right? And what do we want to do? We want to determine
how many things we need to collect, right? So here we go– we have n. And now we can solve for
n, we have the sample size that we require for
a given tolerance. So we’re going to decide
what the tolerance is and calculate sample size, a
minimum required sample size. You can always
collect more data. All right. So again, to review,
this is the same equation I had in the last slide. You have absolutely tolerance. You can express
that as a multiplier times the standard
deviation of the mean. And then you solve for n,
and you get this equation right here. t is your tolerance
and you can– oh, sorry. t is the number
of standard deviations from the mean. d is your tolerance,
which you choose. And this is something
that you know, or collect, or approximate. So these are all given. Where does t come from? Well, we said that
we’re going to use the t distribution, right? So the t distribution
has a table– or it has a certain
shape, rather. And using Excel or
looking up at some table, you can figure out
what t is for two times the standard deviation
from the center. So you can just plug it
in from Excel or from– it’s a property of the
distribution, essentially. Once you pick a confidence
interval, you know t. If you want to go to 95,
it’s a certain value. If you want to go to 90,
it’s a different value. OK. When we look at
relative tolerance, relative tolerance is
just absolute tolerance divided by the mean that
you are collecting, correct? Because instead of saying
plus or minus 10 boardings, we’re saying plus or
minus 5% of the mean. So we just take absolute
tolerance and divide by x bar, the sampling mean,
the sample mean. And we solve for n again. So what we have now, it looks
very similar as to the question right here. But now we have the mean
and the denominator. OK, this quantity,
standard deviation divided by mean, sample
standard deviation divided by sampling mean, is called
the coefficient of variation. And there’s a
convenience to this. And there’s actually
a reason why sometimes relative
tolerance is preferred to absolute tolerance. It’s because of
this, because there’s a mathematically convenient
characteristic of property coming out of this– that you don’t need to know
the standard deviation of what you’re collecting to figure
out your sample size. We’re kind of running
in circles here, right? We’re saying that to
determine sample size, you need to know the
standard deviation. Well, I haven’t collected data. So I don’t know how
variable the data is. So that’s an issue. Now I have to
estimate what that is. It tends to happen that the
coefficient of variation is a more stable property
than the variation in itself, than the variance or the
standard deviation itself. So you’re more
likely to get away with using default values for
the coefficient of variation than you are with assuming a
specific standard deviation. AUDIENCE: It should be noted
that it’s unitless, coefficient of variation. GABRIEL SANCHEZ-MARTINEZ:
Yes, it is unitless. Thank you. OK. So what happens is that relative
tolerances are typically used for averages. So here’s an example– you measured 5720
boardings plus minus 5%. So if you were to get
the absolute equivalent of the absolute
tolerance of that. That would be 5% of 5720. That would be 286 passengers. That’s a weird thing to report. 5% is more
understandable, right? And it kind of makes more sense. So that’s what we want
naturally, anyway. So as I said, the
coefficient variation is typically easier to guess
than the mean and the variance separately. So we use that. Here’s an example using
the t distribution, where the sample
is not large enough to assume a normal distribution. So we say, let’s have a relative
tolerance of plus minus 5%, a confidence level of
95%, and a coefficient of variation of 0.3. So we start out
assuming large sample, and therefore degrees
of freedom is infinity. We can use the
normal distribution. If we look at the
normal distribution, with plus minus 5%, confidence
level 95%, the t is 1.96. So we look that up on a table,
or we use Excel norm dist, or– yeah. t dist for t and
norm dist for normal. We got 1.96. We plug in the
relative tolerance, the 0.3– we get 140. 140 is not quite
infinity, right? So if we look at 140
as a sample size, that would imply that all the
degrees of freedom is 139. Now we go back and
look at the t dist, and we change 1.96 to the
value from the t distribution for that degree of freedoms. And we get 140.73. So you’re sort of seeing
that you were almost right. 140 is very large. In practice, you would
just round up a little bit and get a nice round number, and
you would even play with this once you’re looking at planning
who you’re going to send out and how many hours
you’re going to collect. You want to get at
least 141, but if you’re going to have people in units
of eight hours, for example, or units of four hours, then you
might as well finish the batch for four hours, the last one. Maybe you’ll get
150, 160 from that. Here’s an example
of that equation with different assumptions
of confidence and tolerance. And so we’re using
90% confidence, and we’re assuming a
certain sample size here. So you can see that, as the
tolerance decreases, which means that you require
a greater accuracy for different
coefficients of variation, the sample size can
get really large. So if your data is
not very variable, then you can sample
just a few trips. And you know because
they don’t vary that much what the mean is. But if there’s a lot of
variability across strips, then you need more. So that’s what you see as you
go down the rows on this table. Here we have tolerance. If you only have to be 50%
accurate, plus minus 50%, then you don’t have to
collect that much data. If you want to be
more precise, and you want to say plus minus 5%, then
you need a bigger sample size, right? OK. Proportions– and the
homework, actually, is based on proportions,
so this is important. Consider something, a
group of passengers, to estimate the proportion of
passengers who are students. So from probability,
when you are looking at an event
that can either be 0 or 1, or black or white– in this case, students
or non-students– there’s a certain probability
that that person is a student, right? And what you want to
estimate is that probability or, in other words, what
percent of the things you observe are students. So from the properties of
the Bernoulli distribution, the variance is p
times 1 minus p. So if everybody is a student,
or nobody is a student, either way there’s no
variability, right? So you would have 1 times 1
minus 1, 1 times 0, 0– no variability. Though at the peak variability,
the highest variance of this distribution, is
when 50% of your people are students, so 0.5
times 1 minus 0.5, 0.25. That’s the highest variance, OK? So the tolerance is
typically specified in absolute terms when you’re
estimating proportions, because the proportion
is in itself a percent. So you use absolute tolerance. And you just substitute,
essentially, this variance. You put in the variance of
the Bernoulli distribution, which is p times 1 minus p. And that’s how you get the
sampling equation, sample size requirement equation. Here’s a problem. We don’t know in advance what
the proportion will be, right? And we need that to know how
many people we need to survey to figure out– or how many
trips we need to survey to figure out– sorry, how many
students we need– how many riders
we need to survey to figure out what the average
number of students are. OK, so– AUDIENCE: And it’s also
a [INAUDIBLE] p times 1 minus p [INAUDIBLE] is
a constrained number. GABRIEL SANCHEZ-MARTINEZ:
It is a constrained number, and that’s exactly
where we’re going. So we use something called
absolute equivalent tolerance instead of absolute tolerance. We assume that p is 0.5– that’s the maximum it could be. So let’s go ahead with
a worst case scenario. And then what happens
with p itself? Well, if your percent
is high, then you can tolerate a
bigger number, right? So if it’s 32%, you’re
probably OK with plus minus 5%. If your average were
1.2, plus minus 5% is not that good, right? You need a higher– you need a much stricter,
tighter confidence interval for that. So probably not good to do
plus minus 5% in that case. AUDIENCE: [? Well, do ?]
[? you mean ?] you have a plus minus 5% absolutely percentage? GABRIEL SANCHEZ-MARTINEZ:
Absolute, yeah. AUDIENCE: And you’d be
going negative [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Negative, which is possible but
difficult to interpret. AUDIENCE: Sorry, so this isn’t
actually 32% plus or minus 5% of 32 [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
It is not– yeah, it’s absolute tolerance, not
relative tolerance, right. So what’s convenient about this
is that these two factors work in opposite directions. So as you get bigger, as the
proportion gets closer to 50%, the variance increases. So oh, well, we need
a bigger sample. But your tolerance
increases as well, so you don’t need
as big of a sample. And so it’s convenient. And the practical solution
is assume p is 0.5 and work in terms of absolute
equivalent tolerance. So you pick a tolerance
under the assumption that our proportion is 50%. And here’s what happens. Yeah, if the expected
proportion is 50%, and you say plus minus 5
percent, what you would get is this 5%, if it
turns out that p is 5%. But if it worked more to the
extremes, like 5% or 95%, what you would actually achieve
from having planned the survey, assuming 50%, is 2.2– so much better,
much more acceptable to say 5% plus
minus 2.2%, right? So it works out. And there’s a
convenient equation if you assume a very large
sample, or large enough sample, and you pick 95%, 0.25,
which is the variance times the normal
distribution t squared is 0.96, which is almost 1. So then you get this equation. You take 1, you divide
it by the tolerance that you want, your equivalent
tolerance, and that’s your sample size. So it doesn’t depend on anything
about the data in itself. You just say if I want, on
whatever I’m collecting, whatever proportion
I’m collecting, a 5% absolute
equivalent tolerance, then I need 400
surveys to be answered. Yeah? AUDIENCE: So this
assumes a random– GABRIEL SANCHEZ-MARTINEZ:
Simple random sample. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Yes, a simple random sample. So you would increase
these numbers if you are using
cluster sampling to account for correlation. You would have to increase
them if you’re giving people a survey, and not all of them
answer the survey, because you need 400 surveys answered. So if only half of the
people answer the survey, then you need to
distribute 800 surveys. AUDIENCE: Do you
recommend calculating also that the standard error after
this so that [INAUDIBLE] make sure? GABRIEL SANCHEZ-MARTINEZ:
Absolutely, yeah. You want to go back and
check with the standard error and when your confidence
interval is and see if you meet it or
if you need to add a few days of data collection. AUDIENCE: Right. GABRIEL SANCHEZ-MARTINEZ: Yeah. OK, so with proportions, you
need a very large sample size to estimate a proportion
if you want accuracy. If you say absolutely
equivalent intolerance of 4%, then you need 600. That’s a big number, so it
just gives you an idea of that. If you get greedy
with the tolerance, you have to pay for the
surveyors to go out. OK. So the process is you determine
the needed sample size just with the discussion of the
equations that we discussed. Then you multiply
the sample sizes. If you’re using stratified
sampling or if you have questions that
have multiple variables, you need to then make sure
that you achieve that sample size for each combination of
things that you’re measuring. So if you’re, for example,
looking at not just boardings, but proportion of
passengers that are car-owning, who are pleased. So you could just independently
measure pleased, independently measure passengers
who own a car. And you might have the
tolerance you need on each one, but if you want the
combination of that, now you need a higher
sample, because you need that number for the
combination of those things. Then there’s a
clustering effect, so a typical thing
if you’re doing the clustering of a whole
vehicle of passengers is to multiply by 4. And then for things like OD
matrices, the rule of thumb is 20 times the number of cells. What does that mean? That if your OD matrix
is quite aggregate, and it’s at the
segment level– so say you divide a root
into two segments, then your OD matrix
has four cells. Four cells times 20, that’s how
many people you have to survey. If you do error
at the stop level, then you have many more stops
and, therefore, many more cells and, therefore, a
much higher sample size. If you have a response
rate that is not 100%, which is always
the case, then you have to expand by 1 minus that
in the reciprocal– sorry, 1 over that in the reciprocal. And then you get a
very large number, and you say I don’t have
the budget for that. And you have to make tradeoffs
and figure out what you can do. And maybe you have to– maybe you can’t collect
this combination and know that accurately, right? So you revise your expectations. OK, with response
rates, you are concerned with getting the
correct answers. You also want to be getting
a high response rate. If you don’t get a high response
rate, there might be a bias. So you have to worry about that. If you have low
response rates, that means you need to
distribute more surveys, and that costs money. And there’s the bias
that I just mentioned, so people who don’t respond may
not be responding for a reason. And then done that
might bias your results. And that might make you
decide something in planning that is not the right decision
based on what actually happens. So we call that the
non-response bias. OK, so what happens? People who don’t respond
might be different or might have responded
differently to the question had they responded. So here’s some examples. If you’re surveying
people who are standing, they are less comfortable. And maybe it’s a crowded bus–
they are less comfortable. Or maybe they’re getting
off one of those stops that is coming up, so they are
less likely to have the time to respond to your survey. People with low
literacy, teenagers, people who don’t
speak the language, are less likely to respond. And they might have
different travel patterns. So if you understand
those things, and you get lower
samples for them, you might be able to do
some sort of correction to those biases. But you have to pay attention. How do you improve
your response rate? Well you can make your
questions shorter. You can do a quick oral survey. That’s what we’re going
to do for this homework. You can try to get information
from automatic sources whenever possible. So if you have an AFC system,
let’s not collect boardings, because we know that. And then of course some
training, and just being kind, and having supervision
helps a lot. OK, here’s some
suggested tolerances for different things. So we’re looking here at
boardings or the peak load. And you see here that
the suggested tolerance is 30%, plus minus 30%, when
you have a route with one to three buses. And then as you have
more and more buses, the tolerance decreases. That means you require
a larger sample. Why is that? Why do you need a
bigger sample if you have a route with more buses? AUDIENCE: You’re less likely
to sample a different bus. GABRIEL SANCHEZ-MARTINEZ: Yes,
and when you have higher– when you have more buses, you
tend to have higher frequency. There’s bunching. OK, so if you then survey
loads, for example, and you only get a few
because of the bunching effect and because there
are more buses, and you’re observing a
smaller percentage of them for a given time
period, say, you’re less likely to have observed
the bus that was really crowded, right? So that means that you need
to decrease your tolerance. And therefore, it’s more
expensive to survey that. OK, good. Trip time– 10% for routes
with less than 20 minutes, 5% with routes of
greater than 20 minutes. Similar concept if you have
greater than 20 minutes– there can be just
more variability, and you really want
to get that right. When you have less
than 20 minutes, your decision on
cycle times and things like this are not going to have
as much impact on the fleet size that you require. As you get bigger running
times, a small percentage change in the mean could
influence how many buses you need to dedicate to
that and the cost of running that service. On-time performance– 10%
absolute equivalent tolerance. These are typical values– don’t
take them as gospel, please. And these are for reporting,
not for anything that’s very critical for operations. Some of them are. Yeah, 30% at least, I would
say, is for reporting. I wouldn’t make any
critical decisions with 30%. On-time performance– we’re
talking here about whether a trip is on time
or not on time– so Bernoulli trials, right? And there’s a
proportion of trips that are on time, and what
we do is that, we essentially say plus– if we say plus
minus 10%, then we’re saying that the sample size
should be 1 over 0.1. Yeah. All right, default
coefficient– these are default values
for coefficient of variation of key data items. Ideally, you have your
own data that you look at, and you don’t resort to this. But if you ever find
yourself in a situation where you need to start
out with something. Here are some based on studies
that previous [AUDIO OUT] They took different routes
and looked at loads and running times for
different time periods and found what the coefficients
of variations were. And here they are on a
table for you to use. In the interest of time, since
I want to discuss the homework, I’m going to stop
here with slide 25. And I’m going to not cover
the whole process, which includes the monitoring phase. And in this slide
here, we have how you establish conversion factor. The conversion factor in
itself has a variance. So there’s some uncertainty
about the relationship that you estimate between
your baseline data item and your auxiliary data item. So you need to consider
that in your sample size. And here are some tables
with some examples of what happens when you require
different– well, when you’re variability of or
your coefficient of variation of your
relationship increases or decreases. OK, let’s look at the homework. I really want to use these
last five minutes for that. So please take one and pass. OK, so the MBTA, there’s
a proposal here in Boston of taking Route 70 and 70A– they run through
Waltham, and they go into around Central Square. And some people are saying those
two routes should be extended to Kendall Square,
because a lot of people are actually going to MIT, or
Kendall Square, or the Kendall Square area– not just Kendall Square Station,
but the whole area around. So if it’s true, A
lot of people could benefit from that extension. And we don’t know. So what are you going to do? You’re going to go
to a specific stop where it is very likely that
the people who would be going to MIT or those areas of Kendall
Square that would benefit from this extension
would alight, and you’re going to
ask people, would you have stayed on your bus
if this bus had continued to MIT and Kendall Square? It’s a simple oral survey, yes
or no question, one question. You’re going to work in
teams of four people. The stop that you’re going
to station yourself in is shown in figure 3. And you’re going to collect
data for the AM peak, from 7:30 to 9:30. You pick the day. The teams are
assigned on Stellar, so please log into Stellar
and see what your team is and coordinate with
them to pick a day. And tell me what that
day is, because– actually, right after
class, I’m going to set up a shared spreadsheet
that you can all access. And just go into that
spreadsheet and pick a day. I’m going to put all the
days that are available, and you can say team
1, team 2, et cetera. Make sure that two teams
don’t go on the same day. We want data from
different days. And you’re going to all
bring that data together in that same
spreadsheet, and there are some questions
for you to analyze the data that you collected,
all of the class collected together. You’re measuring the
percent of people who would have stayed on the bus, right? So it’s a proportion. And one submission per team
in PDF format to Stellar. This is due March
7, but in order to leave you enough
time to do the analysis, the data collection efforts
should be done by February 28. So please submit your data by
the end of Tuesday, February 28 at midnight, say, or sometime
before the beginning of March in the morning,
where a person would be trying to analyze your data. OK, if you have
questions, let me know. And if not, have fun. Remember that assignment
1 is due Thursday. Eric? AUDIENCE: Just the one question:
[? is that ?] [? this is ?] going to miss anyone who is
transferred to the Red Line to then go to Kendall Square. GABRIEL SANCHEZ-MARTINEZ:
And going back to– let’s see. I forget where I had it. Well, I guess what I– there
was a point I made earlier where we can measure that from
automatically collected data, right? AUDIENCE: OK. GABRIEL SANCHEZ-MARTINEZ:
Does that make sense? AUDIENCE: Yeah, people who
[? car up ?] come from 70. GABRIEL
SANCHEZ-MARTINEZ: So if I see you tapping of
the 70 or the 70A, and then I see you
tapping at Central Square, I can infer that you
were using the service to transfer to Central Square. And then we’ll
cover ODX, which is an inference model
for destinations later in this course. But looking at the sequence
of taps, I can infer– we can infer– what the
destination of that bus trip was. We can infer that
it was the stop that was closest to Central. And later that day,
presumably the person who might be going to Kendall
Square Station after work taps to Kendall Square. So I might think, oh, he took
the Red Line from Central to Kendall. So I don’t need to ask those
people where they’re going. And anyway, they might not
care about this extension. So we’re going to stand
on the bus stop that is after Central Square and see
where those people are going and whether they would
have stayed on that bus. AUDIENCE: Is this an
actual [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Some people are proposing it. It is a real proposal. The MBTA is a big organization. So I can’t say that the
MBTA wants to do this or doesn’t want to do this. But some people are interested. And it will get looked into. So it’s useful. AUDIENCE: [? Can ?] [? we ?]
[? share ?] [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:
Yeah, why not? AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ: Yeah. And I guess one other
thing that I– yeah, so we’re going to
probably make of this like a theme of assignments. So there’s going to be
another assignment on surface planning, operations planning. So we’re going to start looking
at this combination of Route 70 and 70A, and we’re going
to essentially make a thread of this and do
some serious planning on some scenarios where the 70
and the 70A could be merged. And they could maybe be
terminated a little– yeah, we’ll make some
changes to the service plan under some
hypothetical scenarios. And you’ll get a chance to do
an operations plan on these. And then the last homework
will be on policy, so there might be
some policy questions that I have in mind about
what we could do about service outside, on the outer
parts of the 70 and 70A. All right?

Posts Tagged with…

Write a Comment

Your email address will not be published. Required fields are marked *