Post by Andre TerraHello, Cal
First of all, congrats on the newborn! The Django community will surely
benefit from having yet another success story, especially considering how
big this project sounds. Is there any chance you could open-source some of
your custom made improvements so that they could eventually be merged to
trunk?
Thank you! Yeah, the plan is to release as much of the improvements as open
source as possible. Although I'd rely heavily on the community to make them
'patch worthy' for the core, as the amount of spare time I have is somewhat
limited.
The improvements list is growing by the day, and I usually try and post as
many snippets as I can, and/or tickets etc.
It sounds like Thomas's DSE might be the perfect place for the bulk update
code too.
Post by Andre TerraI definitely noticed how you mentioned large dbs in the past few months. I,
along with many others I assume, would surely like to attend the webcast,
with the only impediment being my schedule/timezone.
Once we've got a list of all the people who want to attend, I'll send out a
mail asking for everyones timezone and availability, so we can figure out
what is best for everyone.
Post by Andre TerraI recently asked about working with temporary tables for filtering/grouping
data from uploads and inserting queries from that temporary table onto a
permanent database. To make matters worse, I wanted to make this as flexible
as possible (i.e. dynamic models) so that everything could be managed from a
web app. Do you have any experience you could share about any of these use
cases? As far as I know, there's nothing in the ORM that replicates
PostgreSQL's CREATE TEMPORARY TABLE. My experience with SQL is rather
limited, but from asking around, it seems like my project could indeed
benefit from such a feature. If I had to guess, I would assume other DBMSs
would offer something similar, but being limited to Postgres is okay for me,
for now, anyway.
I haven't had any exposure to Postgres, but my experience with temporary
tables hasn't been a nice one (in regards to MySQL at least). MySQL has many
gotchas when it comes to temporary tables and indexing, and on more than one
occasion, I found it was actually quicker to analyse/mangle/re-insert the
data via Python code, than it was to attempt the modifications within MySQL
using a temporary table.
It really does depend on what your data is, and what you want to do with it,
which can make planning ahead somewhat tedious lol.
For our stuff, when we need to do bulk modifications, we have a filtering
rules list which is ran every hour against new rows (with is_checked=1 set
on rows which have been checked). We then use bulk queries of 50k (id >= 0
AND id < 50000), rather than using LIMIT/OFFSET (because LIMIT/OFFSET gets
slower and slower the larger the result set). Those queries are
analysed/mangled within a transaction, and bulk updated using the method
mentioned in the reply to Thomas.
Sadly though, I can't say if the methods we use would be suitable for you,
as we haven't tried it against Postgres, and we've only tested it against
our own data set + requirements. This is what I mean by trial and error,
it's a pain in the ass :)
Post by Andre TerraCheers,
André
On Wed, Jun 22, 2011 at 10:56 AM, Cal Leeming [Simplicity Media Ltd] <
Post by Cal Leeming [Simplicity Media Ltd]Also, the 13.8 minutes per million, is basically a benchmark based on the
amount of db writes, and the total amount of time it took to execute (which
was 51s).
Please also note, this code is doing a *heavy* amount of content analysis,
but if you were to strip that out, the only overheads would be the
map/filter/lambda, the time it takes to transmit to MySQL, and the time it
takes for MySQL to perform the writes.
1x X3440 quad core (2 cores assigned to MySQL).
12GB memory (4 GB assigned to MySQL).
/var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1.
Cal
On Wed, Jun 22, 2011 at 2:52 PM, Cal Leeming [Simplicity Media Ltd] <
Post by Cal Leeming [Simplicity Media Ltd]Sorry, let me explain a little better.
72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked',
49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked',
49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
In the above example, it has found 49659 rows which need 'is_checked'
changing to the value '1' (same principle applied to the other 3), giving a
total of 51,130 database writes, split into 4 queries.
_obj_incs.get('is_image_blocked').append(_hit_id)
_obj_incs.get('is_image_blocked').append(_parent_id)
Then I loop through those fields, and do an update() using the necessary
# now apply the obj changes in bulk (massive speed
improvements)
# update the child object
Post.objects.filter(
id__in = _value
).update(
**{
_key : 1
}
)
So in simple terms, we're not doing 51 thousand update queries, instead
we're grouping them into bulk queries based on the row to be updated. It
doesn't yet to grouping based on key AND value, simply because we didn't
need it at the time, but if we release the code for public use,
we'd definitely add this in.
Hope this makes sense, let me know if I didn't explain it very well lol.
Cal
Post by Thomas WeholtOn Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]Hey Thomas,
Yeah we actually spoke a little while ago about DSE. In the end, we
actually
Post by Cal Leeming [Simplicity Media Ltd]used a custom approach which analyses data in blocks of 50k rows,
builds a
Post by Cal Leeming [Simplicity Media Ltd]list of rows which need changing to the same value, then applied them
in
Hmmm, what do you mean by "bulk using update() + F()? Something like
"update sometable set somefield1 = somevalue1, somefield2 = somevalue2
where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
processed 13.8 million rows pr minute? What kind of hardware did you
use?
Thomas
72300921
Post by Cal Leeming [Simplicity Media Ltd]~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
72350921
Post by Cal Leeming [Simplicity Media Ltd]~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
72400921
Post by Cal Leeming [Simplicity Media Ltd]~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
72450921
Post by Cal Leeming [Simplicity Media Ltd]~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
72500921
Post by Cal Leeming [Simplicity Media Ltd]~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
Could you let me know if those benchmarks are better/worse than using
DSE?
Post by Cal Leeming [Simplicity Media Ltd]I'd be interested to see the comparison!
Cal
On Wed, Jun 22, 2011 at 2:31 PM, Thomas Weholt <
Post by Thomas WeholtYes! I'm in.
Out of curiosity: When inserting lots of data, how do you do it?
Using
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas Weholtthe orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ?
I
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas Weholtwrote DSE to solve inserting/updating huge sets of data, but if
there's a better way to do it that would be especially interesting to
hear more about ( and sorry for the self promotion ).
Regards,
Thomas
On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
Post by Cal Leeming [Simplicity Media Ltd]Hi all,
Some of you may have noticed, in the last few months I've done
quite a
of
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas WeholtPost by Cal Leeming [Simplicity Media Ltd]this
month (after what seems like a lifetime of trial and error), we're finally
going to be releasing a new site which holds around 40mil+ rows of
data,
day,
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas WeholtPost by Cal Leeming [Simplicity Media Ltd]and
can handle around 1024 requests per second on stress test on a moderately
spec'd server.
As the entire thing is written in Django (and a bunch of other open source
products), I'd really like to give something back to the community. (stack
incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
MySQL/NGINX/supervisord/debian etc)
Therefore, I'd like to see if there would be any interest in
webcast in
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas WeholtPost by Cal Leeming [Simplicity Media Ltd]which I would explain how we handle such large amounts of data, the trial
and error processes we went through, some really neat tricks we've
done
some
completely
can
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas WeholtPost by Cal Leeming [Simplicity Media Ltd]attend. I'd also offer up a Q&A session at the end.
If you're interested, please reply on-list so others can see.
Thanks
Cal
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
.
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas WeholtPost by Cal Leeming [Simplicity Media Ltd]To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org
--
You received this message because you are subscribed to the Google
Groups
Post by Cal Leeming [Simplicity Media Ltd]Post by Thomas Weholt"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
You received this message because you are subscribed to the Google
Groups
Post by Cal Leeming [Simplicity Media Ltd]"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org
--
You received this message because you are subscribed to the Google
Groups "Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group, send email to
For more options, visit this group at
http://groups.google.com/group/django-users?hl=en.
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-***@googlegroups.com.
To unsubscribe from this group, send email to django-users+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.