Discussion:
Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??
Cal Leeming [Simplicity Media Ltd]
2011-06-22 13:15:48 UTC
Permalink
Hi all,

Some of you may have noticed, in the last few months I've done quite a few
posts/snippets about handling large data sets in Django. At the end of this
month (after what seems like a lifetime of trial and error), we're finally
going to be releasing a new site which holds around 40mil+ rows of data,
grows by about 300-500k rows each day, handles 5GB of uploads per day, and
can handle around 1024 requests per second on stress test on a moderately
spec'd server.

As the entire thing is written in Django (and a bunch of other open source
products), I'd really like to give something back to the community. (stack
incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
MySQL/NGINX/supervisord/debian etc)

Therefore, I'd like to see if there would be any interest in webcast in
which I would explain how we handle such large amounts of data, the trial
and error processes we went through, some really neat tricks we've done to
avoid bottlenecks, our own approach to smart content filtering, and some of
the valuable lessons we have learned. The webcast would be completely free
of charge, last a couple of hours (with a short break) and anyone can
attend. I'd also offer up a Q&A session at the end.

If you're interested, please reply on-list so others can see.

Thanks

Cal
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-***@googlegroups.com.
To unsubscribe from this group, send email to django-users+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Michał Sawicz
2011-06-22 13:20:39 UTC
Permalink
Dnia 2011-06-22, śro o godzinie 14:15 +0100, Cal Leeming [Simplicity
Post by Cal Leeming [Simplicity Media Ltd]
If you're interested, please reply on-list so others can see.
Sure, I'd attend.
--
Michał (Saviq) Sawicz <***@sawicz.net>
Thomas Weholt
2011-06-22 13:31:44 UTC
Permalink
Yes! I'm in.

Out of curiosity: When inserting lots of data, how do you do it? Using
the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
wrote DSE to solve inserting/updating huge sets of data, but if
there's a better way to do it that would be especially interesting to
hear more about ( and sorry for the self promotion ).

Regards,
Thomas

On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]
Hi all,
Some of you may have noticed, in the last few months I've done quite a few
posts/snippets about handling large data sets in Django. At the end of this
month (after what seems like a lifetime of trial and error), we're finally
going to be releasing a new site which holds around 40mil+ rows of data,
grows by about 300-500k rows each day, handles 5GB of uploads per day, and
can handle around 1024 requests per second on stress test on a moderately
spec'd server.
As the entire thing is written in Django (and a bunch of other open source
products), I'd really like to give something back to the community. (stack
incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
MySQL/NGINX/supervisord/debian etc)
Therefore, I'd like to see if there would be any interest in webcast in
which I would explain how we handle such large amounts of data, the trial
and error processes we went through, some really neat tricks we've done to
avoid bottlenecks, our own approach to smart content filtering, and some of
the valuable lessons we have learned. The webcast would be completely free
of charge, last a couple of hours (with a short break) and anyone can
attend. I'd also offer up a Q&A session at the end.
If you're interested, please reply on-list so others can see.
Thanks
Cal
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-***@googlegroups.com.
To unsubscribe from this group, send email to django-users+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Cal Leeming [Simplicity Media Ltd]
2011-06-22 13:36:16 UTC
Permalink
Hey Thomas,

Yeah we actually spoke a little while ago about DSE. In the end, we actually
used a custom approach which analyses data in blocks of 50k rows, builds a
list of rows which need changing to the same value, then applied them in
bulk using update() + F().

Here's our benchmark:

(42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range: 72300921
~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
(44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range: 72350921
~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
(55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range: 72400921
~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
(42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range: 72450921
~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
(51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range: 72500921
~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]

Could you let me know if those benchmarks are better/worse than using DSE?
I'd be interested to see the comparison!

Cal
Post by Thomas Weholt
Yes! I'm in.
Out of curiosity: When inserting lots of data, how do you do it? Using
the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
wrote DSE to solve inserting/updating huge sets of data, but if
there's a better way to do it that would be especially interesting to
hear more about ( and sorry for the self promotion ).
Regards,
Thomas
On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]
Hi all,
Some of you may have noticed, in the last few months I've done quite a
few
Post by Cal Leeming [Simplicity Media Ltd]
posts/snippets about handling large data sets in Django. At the end of
this
Post by Cal Leeming [Simplicity Media Ltd]
month (after what seems like a lifetime of trial and error), we're
finally
Post by Cal Leeming [Simplicity Media Ltd]
going to be releasing a new site which holds around 40mil+ rows of data,
grows by about 300-500k rows each day, handles 5GB of uploads per day,
and
Post by Cal Leeming [Simplicity Media Ltd]
can handle around 1024 requests per second on stress test on a moderately
spec'd server.
As the entire thing is written in Django (and a bunch of other open
source
Post by Cal Leeming [Simplicity Media Ltd]
products), I'd really like to give something back to the
community. (stack
Post by Cal Leeming [Simplicity Media Ltd]
incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
MySQL/NGINX/supervisord/debian etc)
Therefore, I'd like to see if there would be any interest in webcast in
which I would explain how we handle such large amounts of data, the trial
and error processes we went through, some really neat tricks we've done
to
Post by Cal Leeming [Simplicity Media Ltd]
avoid bottlenecks, our own approach to smart content filtering, and some
of
Post by Cal Leeming [Simplicity Media Ltd]
the valuable lessons we have learned. The webcast would be completely
free
Post by Cal Leeming [Simplicity Media Ltd]
of charge, last a couple of hours (with a short break) and anyone can
attend. I'd also offer up a Q&A session at the end.
If you're interested, please reply on-list so others can see.
Thanks
Cal
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-***@googlegroups.com.
To unsubscribe from this group, send email to django-users+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Thomas Weholt
2011-06-22 13:45:10 UTC
Permalink
On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]
Hey Thomas,
Yeah we actually spoke a little while ago about DSE. In the end, we actually
used a custom approach which analyses data in blocks of 50k rows, builds a
list of rows which need changing to the same value, then applied them in
bulk using update() + F().
Hmmm, what do you mean by "bulk using update() + F()? Something like
"update sometable set somefield1 = somevalue1, somefield2 = somevalue2
where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
processed 13.8 million rows pr minute? What kind of hardware did you
use?

Thomas
Post by Cal Leeming [Simplicity Media Ltd]
(42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range: 72300921
~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
(44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range: 72350921
~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
(55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range: 72400921
~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
(42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range: 72450921
~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
(51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range: 72500921
~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
Could you let me know if those benchmarks are better/worse than using DSE?
I'd be interested to see the comparison!
Cal
Post by Thomas Weholt
Yes! I'm in.
Out of curiosity: When inserting lots of data, how do you do it? Using
the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
wrote DSE to solve inserting/updating huge sets of data, but if
there's a better way to do it that would be especially interesting to
hear more about ( and sorry for the self promotion ).
Regards,
Thomas
On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]
Hi all,
Some of you may have noticed, in the last few months I've done quite a few
posts/snippets about handling large data sets in Django. At the end of this
month (after what seems like a lifetime of trial and error), we're finally
going to be releasing a new site which holds around 40mil+ rows of data,
grows by about 300-500k rows each day, handles 5GB of uploads per day, and
can handle around 1024 requests per second on stress test on a moderately
spec'd server.
As the entire thing is written in Django (and a bunch of other open source
products), I'd really like to give something back to the
community. (stack
incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
MySQL/NGINX/supervisord/debian etc)
Therefore, I'd like to see if there would be any interest in webcast in
which I would explain how we handle such large amounts of data, the trial
and error processes we went through, some really neat tricks we've done to
avoid bottlenecks, our own approach to smart content filtering, and some of
the valuable lessons we have learned. The webcast would be completely free
of charge, last a couple of hours (with a short break) and anyone can
attend. I'd also offer up a Q&A session at the end.
If you're interested, please reply on-list so others can see.
Thanks
Cal
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-***@googlegroups.com.
To unsubscribe from this group, send email to django-users+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Cal Leeming [Simplicity Media Ltd]
2011-06-22 13:52:55 UTC
Permalink
Sorry, let me explain a little better.

(51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked',
49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]

map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked',
49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]

In the above example, it has found 49659 rows which need 'is_checked'
changing to the value '1' (same principle applied to the other 3), giving a
total of 51,130 database writes, split into 4 queries.

Those 4 fields have the IDs assigned to them:

if _f == 'block_images':

_obj_incs.get('is_image_blocked').append(_hit_id)
if _parent_id:

_obj_incs.get('is_image_blocked').append(_parent_id)

Then I loop through those fields, and do an update() using the necessary
IDs:

# now apply the obj changes in bulk (massive speed
improvements)
for _key, _value in _obj_incs.iteritems():
# update the child object
Post.objects.filter(
id__in = _value
).update(
**{
_key : 1
}
)

So in simple terms, we're not doing 51 thousand update queries, instead
we're grouping them into bulk queries based on the row to be updated. It
doesn't yet to grouping based on key AND value, simply because we didn't
need it at the time, but if we release the code for public use,
we'd definitely add this in.

Hope this makes sense, let me know if I didn't explain it very well lol.

Cal
Post by Thomas Weholt
On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]
Hey Thomas,
Yeah we actually spoke a little while ago about DSE. In the end, we
actually
Post by Cal Leeming [Simplicity Media Ltd]
used a custom approach which analyses data in blocks of 50k rows, builds
a
Post by Cal Leeming [Simplicity Media Ltd]
list of rows which need changing to the same value, then applied them in
bulk using update() + F().
Hmmm, what do you mean by "bulk using update() + F()? Something like
"update sometable set somefield1 = somevalue1, somefield2 = somevalue2
where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
processed 13.8 million rows pr minute? What kind of hardware did you
use?
Thomas
72300921
Post by Cal Leeming [Simplicity Media Ltd]
~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
72350921
Post by Cal Leeming [Simplicity Media Ltd]
~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
72400921
Post by Cal Leeming [Simplicity Media Ltd]
~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
72450921
Post by Cal Leeming [Simplicity Media Ltd]
~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
72500921
Post by Cal Leeming [Simplicity Media Ltd]
~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
Could you let me know if those benchmarks are better/worse than using
DSE?
Post by Cal Leeming [Simplicity Media Ltd]
I'd be interested to see the comparison!
Cal
Post by Thomas Weholt
Yes! I'm in.
Out of curiosity: When inserting lots of data, how do you do it? Using
the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
wrote DSE to solve inserting/updating huge sets of data, but if
there's a better way to do it that would be especially interesting to
hear more about ( and sorry for the self promotion ).
Regards,
Thomas
On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]
Hi all,
Some of you may have noticed, in the last few months I've done quite a few
posts/snippets about handling large data sets in Django. At the end of this
month (after what seems like a lifetime of trial and error), we're finally
going to be releasing a new site which holds around 40mil+ rows of
data,
Post by Cal Leeming [Simplicity Media Ltd]
Post by Thomas Weholt